►
Description
Debugging Machine Learning on the Edge with MLExray - Michelle Nquyen, Stanford
A
All
right,
awesome,
hi
everyone
nice
to
meet
you
all.
I
hope
you
all
had
a
good
lunch.
So
today
I'm
going
to
be
telling
you
guys
about
ml
x-ray
and
so
ml
x-ray
is
essentially
an
end-to-end
debugging
platform
for
your
models
that
are
deployed
on
the
edge.
A
So
just
a
little
bit
about
myself,
I'm
michelle-
and
I
said
before
I'm
a
principal
engineer
at
new
relic
working
on
the
pixi
open
source
project,
so
that
project
is
essentially
a
cncf
sandbox
project
which
is
an
observability
observability
tool
for
kubernetes
and
before
my
time
at
new
relic,
I
was
pixie
labs's
first
engineer
and
pixie
labs
is
where
the
pixie
project
was
born.
Out
of.
A
So
why
should
we
even
talk
about
debugging
deployments
of
machine
learning
on
the
edge?
So
we
see
today
that
a
lot
of
deployments
or
products
and
software
are
moving
their
machine
learning
models
to
the
edge.
So,
for
example,
we
have
crews,
which
is
self-driving-
that's
a
very
popular
hot
topic
these
days.
So
essentially
your
car
is
going
around.
It
is
picking
up
a
bunch
of
sensor
information
so,
for
example,
it's
using
a
camera
to
figure
out,
you
know.
A
Am
I
driving
correctly
in
the
lane
it's
using
lidar
to
find
object,
detection
and
see
if
there's
an
obstacle
in
the
way,
so
you
don't
go
and
accidentally
hit.
Somebody
or
we've
had
you
know
the
amazon
echo
around
for
a
while
and
that
listens
to
you.
You
know
you're
going
out
your
day,
just
talking
normally
and
it
listens
and
picks
up
some
cues
whenever
you
say
alexa,
and
so
all
that
is
done
on
the
edge.
It's
picking
up
sensor
information
and
then
it
is
basically
running
a
model
and
figuring
out.
A
You
know
some
inference
and
what
action
to
take
based
on
the
information
it's
gathered
and
then
another
example
is
just
you
know
you
want
to
deploy
your
applications
onto
different
phones
right
so
essentially
on
these
phones,
you're,
running
machine
learning,
models
to
do
different
things
such
as
image,
classification,
for
example,
or
for
the
case
of
like
the
pixel
6.
One
of
the
recent
things
they
came
out
with
is
you
can
do
like
a
magic
eraser.
All
these
models
are
running
inside
your
phone
itself
and
to
kind
of
expand
on
that.
A
We
have
this
idea
of
the
traditional
model,
which
is
on
your
left,
and
so
here,
in
this
case
the
sensor
is
on
a
separate
device.
It
is
picking
up
a
ton
of
input
data,
so
let's
say
for
a
nest.
Thermostat
right,
it's
going,
it's
figuring
out.
You
know
what
is
the
temperature
at
this
time
in
this
house
and
it
might
want
to
do
something
with
that
data
and
figure
out.
Okay.
What
should
I
do
with
this
and
run
some
inferences
on
it
and
it
will
send
it
to
the
cloud
where
the
model
is
running.
A
The
model
will
basically
go.
Do
some
inference
and
then
return
a
result,
and
so
then,
when
you
move
your
computation
to
the
edge
which
what
actually
happens
is
you
now
have
these
models
running
directly
on
the
devices
so,
for
example,
of
the
amazon
echo
from
before
you're
having
the
model
run
directly
on
the
echo
itself,
rather
than
going
and
running
the
model
on
the
cloud,
and
so
here
what
actually
happens
is
that
now
you
have
a
bunch
of
different
environments
right,
you
can
deploy
to
many
different
edge
devices
that
are
built
on
different
hardware.
A
They
have
different
memory,
compute
resource
requirements
and
whatnot,
and
so
what
are
the
benefits
of
doing
this?
Actually,
so
you
can
see
here
from
this
picture.
You
are
no
longer
egressing
any
data
out
to
the
cloud
so
before
you
know
you
have
a
constant
stream
of
data
coming
in
and
then
you're
sending
out
to
be
like
okay.
What
should
I
do
with
this
information?
A
What
is
the
inference
that
I
want
to
make?
But
when
you
move
it
to
edge
compute
all
the
information
stays
within
the
device
itself
and
a
lot
of
the
times?
This
is
just
stored
in
memory,
and
so
that
helps
a
lot
because
now
you
know
you're
not
sending
something
out:
you're,
not
waiting
for
the
latency
of
that
network
request
coming
out
to
come
back
and
tell
you
okay,
this
is
what
I
should
do
and
that
helps
a
lot
with
latency
and
overall
just
egress,
and
then
you
also
have
security
and
privacy
benefits.
A
Rather
you
feel
more
comfortable
where
it's
you
know,
this
device
is
in
your
home
and
it's
kind
of
all
stored
in
memory
and
then
probably
at
some
point
eventually
expired
out
because
it
no
longer
needs
that
information
to
make
an
inference,
and
so
there's
a
lot
of
security
and
privacy
benefits
for
moving
to
the
edge
and
then
lastly,
you
have
scalability
right.
So
let's
say
you
have
millions
of
connected
devices
in
the
traditional
model.
You
send
all
those
all
that
data
to
your
cloud
and
here,
in
this
case,
you're
actually
handling
it
per
device.
A
A
You
run
your
training
data
sets
and
then
you
know
your
models
looks
great
you're.
You
know
able
to
accurately
detect
dogs
from
the
example
earlier
today,
and
then
you
go
and
deploy
these
models
to
your
edge
devices,
and
so
in
this
image
here
I
kind
of
label
these
boxes
as
different
colors,
because
I
want
to
make
it
very
clear
that
these
may
not
be
the
same
architecture.
They
may
not
have
the
same
environment,
they
may
have
completely
different
hardware
and
so
you're
just
deploying
these
models
to
these
heterogeneous
environments
and
what
can
go
wrong
right.
A
A
This
thing
on
the
cloud
and
running
inferences,
I
was
like
able
to
classify
this
dot
correctly,
but
then
now
you've
deployed
it
to
your
iphone,
for
example,
and
now
it's
starting
to
have
some
problems
or
in
the
other
case,
which
is
the
bottom
one
you
go
and
you
deploy
you
deploy
to
your
android
and
oh
man,
this
model
that
ran
really
quickly
when
you're
training
it
in
the
cloud
it
takes
10
seconds
and
you
have
no
idea
what's
going
on
so
you're
running
into
all
these
issues,
and
it
wasn't
the
case
when
you're
running
on
your,
like
you
know
your
single
cloud
environment.
A
A
You
know
what
exactly
is
going
wrong
with
my
model
that
you
know
usually
works
well,
in
the
other
cases
that
I've
deployed
it
and
ml
x-ray.
Essentially
what
they
do
is
they
give
you
an
api
that
you
can
use
to
instrument
your
models
so
at
the
top
you
see
an
example
of
the
python
api
and
all
you
really
need
to
do
is
say
my
mlx
ray
library.
Let's
start
on
inference
start
you
run
your
interpreter
and
then
on
inference.
A
A
And
so
some
of
the
information
that
it
collects
is
you
know
the
original
input
of
the
model.
The
output
of
the
model
you
know
is
the
result,
correct
and
also
per
layer
the
input
output,
the
end
to
end
latency.
So
you
actually
know
how
long
did
this
whole
inference
take
and
also
within
the
layers
themselves?
How
long
did
those
individual
layers
take
things
like
memory,
I
think,
especially
as
you're
moving
to
an
edge
device,
and
these
have
lower
like
memory
and
compute
resources?
A
That
is
something
you
might
want
to
hone
on
to
and
then
for
in
the
case
of
the
android
example,
they
have
in
their
android
api.
It
also
collects
other
information,
such
as
peripheral
sensor,
information,
just
like
the
orientation
of
the
phone,
the
lighting
that
is
detected
in
the
room
and
that
just
helps
you
provide
more
context
to
the
model
that
is
being
run.
A
So
now
you
have
all
this
data
coming
in
right,
you've,
instrumented
your
your
model.
All
this
data
is
coming
out
as
you're
running
it,
but
don't
really
know
what
to
do
with
this
information.
It's
like
okay,
cool,
this
layer
takes.
You
know
this
many
milliseconds,
and
this
one
takes
this
many
milliseconds.
How
do
I
actually
use
this
information
to
figure
out
what
is
going
wrong
with
the
model
that
I've
deployed
on
this
edge
device?
And
so
the
idea
behind
mlx
ray?
Is
that
there's
a
set
of
reference
pipelines?
A
A
It
gives
you
the
logs,
you
know
it
gives
you.
This
is
how
long
each
layer
took.
This
is
the
approximate
output
and
input
of
each
layer,
and
then
you
run
this
on
your
development
pipeline,
and
that
gives
you
the
same
information,
and
then
you
basically
do
a
diff
between
those
to
create
a
debug
report
to
help
you
figure
out.
Okay,
this
is
what's
going
on
with
my
system.
This
is
what's
different
when
I've
deployed
to
this
environment
versus
the
other
one.
A
A
Looking
at
the
output
and
you're
saying,
is
there
a
layer
where
this
output,
accuracy,
just
or
the
output,
is
very,
very
different
from
the
output
that's
received
from
my
reference
pipeline
and
you
in
the
case
where
things
are
slow,
then
you
want
to
compare
latency.
It's
like
well.
This
layer
took
a
lot
longer
than
the
other
layer
in
my
development
pipeline,
so
you
run
through
that
and
that
helps
you
hone
about
in
on
which
layer
is
having
problems
and
then
finally,
like
I
mentioned
before,
there
are
some
assertion
checks.
A
So
you
can
specify
these
are
custom.
Assertions
in
your
code
that
check
inputs
and
outputs
are
what
you
expect.
So
let's
say
you
have
the
self-driving
case
that
I
mentioned
before,
and
you
know
that
when
you're
running
your
camera,
whenever
you
make
an
inference,
all
the
the
width
of
the
street
should
always
be
the
same,
and
so
then
this
assertion
check
would
be
like
check
that
you
know
the
width
of
the
input
of
the
model
is
always
five
feet
or
something
or
check
that
you
know
whatever
is
detected
at
the
end.
A
And
so
what
kind
of
issues
can
this
pipeline
actually
help
you
debug?
So
there
are
three
that
I'm
going
to
step
into
a
little
bit
more
detail
for,
but
the
first
one
is
pre-processing
errors.
The
next
one
is
quantization
inaccuracies
and
then
kernel
optimization
differences
amongst
heterogeneous
environments.
So
that's
kind
of
the
case
I
mentioned
before,
where
you
have
a
bunch
of
different
hardware
and
just
completely
different
environments
that
your
models
are
running
on.
A
So
the
first
is
pre-processing
errors
and
I
think,
even
in
a
case
where
you're
not
deploying
to
an
edge
device
you're
going
to
run
into
this
right,
you
have
something
collecting
information,
that's
you're,
using
to
structure
for
your
input
to
your
model
and
that's
going
to
be
different
from
whatever
that
model
is
expecting,
and
this
happens
even
more
in
the
edge
device
case,
because
since
these
are
all
running
on
different
environments
and
different
hardware,
your
sensor
might
be
picking
up
information
in
different
ways.
Or
you
know
in
the
case
where
you
have.
A
A
So
this
goes
back
to
the
assertions
that
I
mentioned
before,
but
essentially,
whenever
mlxray
is
running
on
your
pipeline,
it's
going
to
go
and
run
these
assertions
to
make
sure
that
it
checks
and
passes
so
here
in
this
example.
This
is
using
the
python
api
and
this
is
checking
that
it
expects
your
input
to
be
in
rgb
format.
So
it's
checking.
A
If
this
this
thing
is
accidentally
coming
in
as
bgr
format,
it's
going
to,
let
you
know
so:
it's
like
hey
your
deployment
pipeline
is
broken
you're,
going
to
need
to
go
and
add
this
pre-processing
step
to
convert
to
the
rgb
format
and
just
kind
of
stepping
through
exactly
what
this
code
is
doing.
It's
taking
in
the
input
from
your
development,
that's
called
edge
out
and
then
the
input
from
your
reference
pipeline
and
saying
do
these
look
the
same
and
if
they
do
okay,
that's
that's
great.
A
A
So
you
can
see
how
is
your
development
pipeline
doing
in
regards
to
the
reference
one?
So
here
we
have
two
examples.
So
the
orange
line
is
this
model
that
we
know
all
the
weights.
All
the
biases
have
been
quantized,
they
work
and
we
compare
that
to
the
baseline,
which
is
you
know
that
perfect
baseline
image.
A
That
has
been
a
perfect
baseline
model
that
has
been
trained
in
the
cloud
and
we
can
see
that
the
the
mean
square
error
is
right
there
in
the
bottom,
it's
pretty
low
and
that's
doing
great,
and
then
you
have
this
other
model
that
you've
trained
and
you've
quantized
it.
And
you
see
that
okay,
comparing
that
to
my
baseline,
the
error
is
much
higher,
and
so
therefore
I
should
go
in
and
try
to
figure
out
just
what
I
need
to
do.
Do
I
need
better
training
data
to
fix
this
or
what?
A
And
then
last
this
one
is
very
unique
to
edge
compute
because
now
you're
deploying
to
a
bunch
of
different
devices.
These
have
different
hardware
requirements,
these
just
at
the
core
of
it
in
the
kernel
optimize
different
operations
in
different
ways,
and
so
this
can
lead
to
a
huge
latency
difference
or
performance
between
devices.
A
How
long
it
takes
to
run
each
layer,
and
some
of
the
results
are
pretty
surprising
right.
So
you
have
this
the
exact,
the
quantile
or
not
the
quantile.
The
quantized
version
pipeline
that
we
used
before
is
actually
pretty
slow
in
that
second
convolution
step
and
ml
x-ray
helps
you
figure
out.
It's
like
okay,
this
layer
there's
something
wrong
and
that's
why
it's
slow,
and
maybe
I
need
to
deploy
like
a
special
model
to
this
particular
hardware.
A
So
I'm
going
to
walk
through
a
little
bit
about
what
using
ml
x-ray
actually
looks
like
so
first
this
is
a
nifty
collab
that
we
have.
That
just
shows
like
an
example
model
that
uses
ml
x-ray.
So
the
first
thing
you
need
to
do
is
install
the
ml
x-ray
library
and
then
you
want
to
go
and
just
create
your
model
runner
class.
A
So
this
is
just
using
tensorflow
lite
and
the
important
things
to
kind
of
pick
up
on
here
are
essentially
this
m
monitor
so
you're
initializing
ml
x-ray
to
go
ahead
and
start
logging
information
from
each
output
layer,
and
you
know
the
inputs
and
outputs
and
then
finally
you
go.
You
invoke
the
model.
A
A
The
models
essentially
run
and
in
the
background
mlx
ray,
has
picked
up
just
like
a
bunch
of
logs
about
how
each
layer
is
running
about
the
latency
of
each
layer,
all
of
that
information.
So
what
does
that
actually?
Look
like
here's
an
example
of
an
mlxray
log
and
there's
a
ton
of
information
in
here
right.
You
have
the
start
time.
You
have
the
overall
latency
of
how
long
your
inference
took.
You
have
the
memory
usage
and
you
have
for
each
layer
all
the
outputs-
and
this
is
I'm
not
going
to
keep
scrolling.
A
A
A
This
first
function
here
it
goes
reads
the
logs
in
it
parses
it.
So
you
see
here
it's
reading
the
logs,
it's
getting
the
keys
and
the
values
and
then
essentially,
in
the
end,
it
can
plot
the
results,
and
we
use
this
code
to
plot
those
results.
Earlier
that
I
showed
back
on
that
slide
where
I
was
comparing
the
the
accuracy
between
the
different
or
the
differences
between
the
output
layers,
so
you
can
essentially
very
quickly
get
started
with
mlx
ray
okay
and
then
jumping
back
to
my
slides,
oops.
A
You
can
see
that
ml
x-ray
has
some
limitations
and
the
first
one
is
that
you
need
code
changes
to
go
and
enable
instrumentation
on
your
debug
pipeline,
and
that
can
be
annoying
right
because
you
might
go
deploy
and
then
you're
like.
Oh,
I
forgot
to
add
this.
A
I
forgot
to
add
this
line
in
to
go
and
invoke
mlx
ray,
and
you
have
to
go
back
in
and
do
that
and
generally
when
we're
whenever
we're
doing
observability,
we,
like
you,
know
low
touch,
instrumentation,
there's
also
a
slight
imp
performance
impact
when
you're
using
mlx
ray.
So
obviously
it's
more
noticeable
on
gpu,
you're
writing
tons
of
things
to
logs.
So
that
also
has
a
memory
impact
because
you're
just
storing
all
this
data
somewhere
and
then
I
think
we
could
kind
of
see
towards
the
end
it
was
like.
A
Okay,
I
have
all
this
data
now.
I
need
to
use
this
python
api
to
go
ahead
and
parse
it,
and
I
can
use
that
api
to
create
a
graph,
but
it
kind
of
limits
you
and
how
you
can
actually
go
and
visualize
this
information.
What
if
you
want
to
do
more
interesting
things
with
it,
because
it's
not
in
some
standard
output
format
that
you
can
like
stick
into
any
tool
that
you
want
it's
kind
of
hard
to
go
and
just
build
more
interesting
visualizations
with
it
so
kind
of
here?
A
How
I
got
involved
in
mlx
ray
is.
I
worked
on
pixi
I
mentioned
that
before,
and
there
were
a
lot
of
correlations
between
how
we
do
things
in
pixi
that
I
thought
could
help
the
ml
x-ray
project,
and
so
just
like
a
brief
summary
again.
Pixi
is
an
open
source,
cncf
sandbox
project
for
observability
on
kubernetes,
and
there
are
three
pillars
that
I
think
kind
of
help
in
the
mlx
ray
case.
A
So
the
first
is
auto
telemetry,
so
pxe
picks
up
information
using
tools
like
ebpf
without
you
having
to
go
and
instrument
things
in
your
application,
so
it
just
automatically
starts
collecting
information
as
soon
as
it's
deployed,
and
that
really
helps
in
the
mlx
ray
case
where
you
have
to
go
right
now,
you
have
to
add
that
line
to
be
like.
I
want
to
invoke
mlx
ray
and
start
seeing
information.
A
This
also
helps
in
the
case
where
it's
like
you
don't
want
this
thing
running
on
your
pipeline,
all
the
time
right,
you
maybe
want
it
when
you're
debugging,
but
in
the
future
it's
like
when
you
know
it's
running.
Well,
you
don't
want
it
anymore,
so
you're
going
to
have
to
go
and
take
that
line
out
of
your
code
that
invokes
mlx-ray.
A
So
we're
actually
going
to
go
into
this
in
more
detail
tomorrow
on
kubernetes
on
edge
day.
So,
if
you'd
like
to
come
by
and
learn
some
more,
that
would
be
great
to
see
you
guys
all
again,
but
here
are
some
resources
for
mlx
ray,
so
the
first
one.
Of
course,
all
this
is
open.
Source
mlx
rate
is
open.
Source
pixi
is
open
source
check
out
the
repo
check
out
the
code.
Try
running
stuff
yourself.
A
B
C
Hi,
thank
you
for
the
presentation
really
great
work.
I
have
a
question,
so
why
was
the
decision
made
to
use
logs
to
diff
the
layer,
outputs
between
the
cloud
and
the
edge
model?
For
example,
why
not
probe
the
actual
layers,
because
I'm
assuming
you
own
both
the
edge
model
and
the
cloud
model
right
logs
can
run
into
issues,
for
example
of
formatting
and
also
being
really
like
large?
You
know
your
model
is
large
you're
going
to
be
storing
large
text
files
and
also
the
parsing
is
pretty
expensive
and
can
be
error
prone.
C
A
D
So
I
understand
a
pixie
telemetry
model
in
general
for
like
service
monitoring,
I
was
curious
about
ml
model
performance
and
perhaps
those
data
also
being
interesting
to
be
aggregated
and
looked
at
in
a
place
where
people
are
usually
looking
at
ml
performance.
Comparisons
like
in
weights
and
biases.
Do
you
guys
have
like,
like
a
a
picture
of
like
where
those
data
could
somehow
intersect
or
how
you
could
bring
them
together,
like
that.
A
Yeah,
so
I
guess
in
relating
to
pixi,
we
use
ebpf,
like
I
said,
and
that
kind
of
picks
up
you
can
use
evpf
to
hook
on
to
certain
new
probes,
so
that
are
like
certain
user
defined
functions
and
then
that
can
collect
a
bunch
of
information.
You
can
get
like
the
arguments
of
that
function.
You
can
get
the
outputs
of
that
function
and
you
can
send
all
that
data
to
pixie
to
visualize
it.
I
hope.
Does
that
answer
your
question.
I'm
not
sure
I
got
it
correct,
but
we'll
be
talking
more
about
tomorrow.