►
From YouTube: The MLOps Roadmap - Terry Cox, Boostrap Ltd.
Description
The MLOps Roadmap - Terry Cox, Boostrap Ltd.
This talk is an introduction to the CDF MLOps Roadmap.
A
I
want
to
start
by
actually
asking
the
question:
what
is
ml
ops,
because
it's
clear
from
the
work
we've
been
doing
today?
There
is
a
lot
of
confusion
about
what
that
means
and
what
the
methodology
really
encompasses.
A
So
for
many
people,
nlops
is
focused
on
the
idea
of
taking
machine
learning
assets
and
putting
them
into
production
environments,
but
we
would
argue
that
actually,
the
real
challenge
is
in
allowing
people
to
manage
machine
learning
assets
as
part
of
a
wider
product
or
solution,
which
is
the
thing
that
they
really
care
about
in
terms
of
controlling
the
life
cycle
and
the
release
process.
A
A
A
So
so
we
clearly
have
a
significant
problem
with
our
ability
in
general
to
convert
our
machine
learning
experiments
into
practical
products.
A
Now
we
are
also
operating
in
an
intrinsically
high
risk
environment,
so
machine
learning
and
ai
is
almost
always
working
with
large
aggregated
data
sets
potentially
sensitive
data,
so
we
are
always
having
to
contend
with
scenarios
where
the
the
challenges
of
the
data
involved
are
significant
and
the
the
penalties
for
failure
are
potentially
very
high
and
in
fact
it's
it's
clear
that
we
need
to
anticipate
very
high
levels
of
regulation
being
introduced
into
this
environment.
A
A
So
if
you
step
back
and
take
the
bigger
view,
it
is
clear
that
jupiter
nut
notebooks
are
not
production-ready.
Assets
are
they're
very
easy
to
write
and
modify
off
the
cuff,
but
they're,
also
very
difficult
to
to
manage
effectively
put
a
good
governance
process
around
and
extremely
challenging
to
scale
and
secure,
and
all
of
the
other
non-functional
aspects
that
we
need
to
manage
when
we're
dealing
with
technology
in
production,
environments.
A
So
the
roadmap
came
about
as
an
attempt
to
try
and
capture
in
one
place,
all
of
the
all
the
requirements
for
what
ml,
ops
and
ml
ops
tooling
needs
to
be
able
to
do
in
the
future
to
to
get
us
past
this
phase,
where
we're
we're
still
in
the
early
infancy
of
a
new
discipline
and
to
move
us
into
a
situation
where
we
can,
we
can
actually
work
efficiently
and
safely
with
these
types
of
assets.
A
A
So
so
the
document
has
three
primary
sections:
the
the
challenges
chapter,
which
takes
us
through
each
individual,
fundamental
problem
or
issue
in
the
machine
learning
space
and
and
tries
to
spell
out
what
those
challenges
look
like
in
in
terms
of
the
impact
that
they
have
on
a
team
or
an
organization.
A
A
So,
for
example,
you
know
a
a
typical
challenge
within
the
road
map
would
be
the
idea
that
we
we
need
to
be
quite
platform
agnostic
in
the
way
that
we
work
so
historically.
A
Machine
learning
has
been
a
very
much
a
python
based
activity
with
lots
and
lots
of
tooling
and
libraries
developed
within
python,
but
that
brings
with
it
certain
challenges.
A
So,
for
example,
because
python
is
you
know
an
interpreted
scripting
language,
it
means
that
the
the
the
source
code
of
a
python
environment
is
always
available
in
a
production
environment,
and
you
are
effectively
able
to
if
you
have
access
to
that
production,
environment,
modify
the
source
code
and
change
the
behavior
of
the
system.
That's
running
so
clearly
that
could
be
a
very
high
risk
from
a
security
perspective,
be
very
easy
to
inject
malicious
code
into
python
environments.
A
So
the
other
challenge
in
in
in
this
space
is
that
we
are
looking
at
a
situation
where
you
need
to
be
able
to
build
your
code
to
train
your
systems.
A
Now
the
reality
is
that
the
the
technology
that
is
most
convenient
for
writing
training
scripts
is
often
different
to
the
technology.
That's
most
efficient
for
running
the
trainings
and
again
the
technology,
which
is
optimal
for
them
operating
those
models
may
be
different
again,
so
you
know
we.
We
fully
expect
that
customers
need
to
be
able
to
easily
define
their
models
so
that
they
can.
A
Which
may
imply
the
use
of
say,
gpu
resources
in
the
cloud
environment
to
give
you
short-term
access
to
a
large
amount
of
compute
resource.
But
then
the
model
you've
trained
is
is
quite
likely
to
be
some
sort
of
decision
making
system
that
needs
to
operate
in
near
real
time
in
human-facing
environments.
A
So
then,
you
have
the
challenge
that
the
deployment
of
your
model
needs
to
be
able
to
operate
in
the
real
world,
possibly
on
an
edge
device,
and
so
the
inferencing
that
you're
trying
to
do
needs
to
be
able
to
operate
a
very
low
latencies
close
to
the
source
of
the
data
that
it's
collecting
so
yeah.
A
Clearly,
a
significant
challenge
exists
in
how
we
provide
an
overarching,
cicd
process,
if
you
like,
which
allows
you
to
define
models
using
one
level
of
abstraction,
train
them
using
a
different
level
of
technology
and
then
translate
those
models,
so
they
can
be
deployed
into
a
third
level
of
technology.
A
Clearly,
this
is
a
really
big
challenge
and
it's
not
something
that
we're
going
to
fix
overnight
and
it's
not
something
that
we
can
do
on
our
own.
So
how
do
you
find
the
roadmap
and
how
do
you
contribute
it?
Well,
the
roadmap
document
itself
is
probably
available
and
you
can
see
it
here.
A
And
there
are
many
ways
in
which
you
can
contribute
to
the
work
that
we're
doing
we're
more
than
happy
to
to
accept
pull
requests
on
the
document
itself.
So
if
you
feel
there's
a
there's,
a
challenge,
that's
missing
or
a
technology
requirement
that
we
should
detail,
then
you're
welcome
to
to
edit
the
draft
version
of
of
the
document
and
we'll
review
and
incorporate
any
feedback
you're.
A
Also
more
than
welcome
to
join
the
the
emmanop
sig,
we
have
regular
meetings
where
we
discuss
the
progress
of
the
roadmap
and
we
also
have
a
mailing
list
and
a
slight
group
where
you
can
contribute
and
contact
the
various
members
of
the
team.
A
Right,
I
hope
that
was
helpful
and
I'd
be
really
interested
to
know
who's
who's
involved
in
doing
ops,
related
activities
at
the
moment
and
who's
planning
to
do
so
shortly
so
feel
free
to.
Let
me
know
in
the
chat
if
you're
already
on
this
journey
and
if
you've
got
any
specific
questions
that
I
can
help
with.
A
So
the
plan
really
is
to
try
and
evolve
this
incrementally
over
time.
You
know:
we've
we've
done
quite
a
bit
of
work
this
year
to
put
the
first
draft
together,
and
you
know
we.
We
hope
that
you
know.
We've
got
a
a
starter
for
10
in
terms
of
highlighting
what
a
lot
of
the
the
common
problems
are
and
some
of
the
the
bigger
challenges
you
know.
A
Clearly,
what
we
really
need
to
do
here
is
to
extend
devops
so
that
it
takes
into
account
the
needs
of
of
machine
learning
and
ai
projects,
so
so
really,
the
the
the
challenge
here
james
is
to
make
sure
that
the
focus
is
always
on
reducing
lead
time.
So
what
what
the
practice
is
doing
is
setting
you
up
to
minimize
the
time
it
takes
you
to
get
from
a
decision
about
a
feature
to
implementing
that
feature
in
a
production
environment.
A
A
So
you
know
there,
there
are
some
significant
challenges
in
some
areas
and
clearly
one
of
the
the
big
differences
is
that,
when
you're
dealing
with
mlops,
you
have
to
manage
your
data
sets
as
well
as
managing
your
code
base.
A
A
So
you
know
there
are
some
significant
gaps
in
our
current
tooling
and
capability
in
terms
of
being
able
to
treat
a
large
set
of
data
as
a
coherent
asset
that
we
we
need
to
manage
over
time,
and
you
know
to
give
you
a
feel
for
that.
A
large
data
set
in
machine
learning
terms
probably
starts
at
about
10
terabytes
of
data
and
may
go
up
to
tens
of
petabytes.
A
So
you
know
it's:
it's
not
a
case
of
being
able
to
take
a
snapshot
of
something
and
just
throw
it
around
in
the
data
center.
You
know
you
you're
talking
about
very
large
amounts
of
data
that
take
considerable
amounts
of
time
to
to
even
move
from
place
to
place
in
some
circumstances,.
A
A
A
A
They
need
to
be
surrounding
you
in
your
smart
cities
and
in
in
many
cases
they
need
to
operate
very
low
latency.
So
you
you
need
the
system
to
react
to
what's
going
on
and
and
respond
in
milliseconds,
because
you
may
be
dealing
with
a
safety
related
system
like
an
automated
braking
system.
On
a
vehicle
or
a
network
of
systems,
that's
interpreting
traffic
and
trying
to
optimize
the
operation
of
a
set
of
traffic
lights.
A
So
one
one
of
the
challenges
for
ml
ops
as
a
as
a
process
is
that
you
know
you.
You
need
to
gather
a
large
amount
of
training
data
from
the
edge.
Then
you
need
to
move
that
somewhere,
where
you
can
process
it
to
actually
train
a
reliable
model.
A
A
A
A
You
know
you
may
well
need
to
take
your
model
and
convert
it
into
a
hardware
description,
language
like
pericode
and
then
take
take
that
and
use
that
to
make
a
a
physical
product,
such
as
a
you
know,
an
fpga
which
you
can
then
embed
in
another
piece
of
technology
and
and
so
you
then
have
a
high
speed
model
that
can
inference
very
quickly
but
is
encapsulated
in
a
piece
of
hardware.
A
A
A
So
what
are
people
in
engaging
with
at
the
moment?
You
know
what
problems
have
people
come
across
today,
serverless
yeah!
That's
that
that's
a
an
interesting
question.
A
So
an
individual
gpu
card
might
have
you
know,
40
gig
of
ram
available
so
to
to
be
able
to
train
a
model
in
a
reasonable
time
scale.
You
know
less
than
a
week
you,
you
typically
need
to
have
a
lot
of
gpus
and
then
a
very
high
speed
network
so
that
you
can
continuously
stream
the
data
in
40
gig
chunks
through
the
network
performing
a
lot
of
calculations
on
it
and
then
synthesizing
a
model
on
the
back
of
that.
A
So
serverless
is
an
abstraction
that
we
can
use
at
one
level.
So
from
a
data
scientist
perspective
you
can,
you
can
say,
yeah
well.
This
looks
like
it's
serverless
in
that
I
don't
have
to
think
too
much
about
the
infrastructure.
I
just
tell
it
what
I
want
it
to
do,
but
under
the
hood,
the
the
mlops
implementations
actually
need
to
physically
allocate
gpu
hardware
to
nodes,
and
then
you
know,
individual
containers
have
to
mount
individual
gpu
instances
and
we
have
to
help
to
distribute
the
workload
across
those.
A
So
it's
certainly
a
domain
in
which
the
tooling
itself
needs
to
be
a
much
more
aware
of
the
hardware
than
the
physical
constraints
than
with
conventional
ci
cd
environments.
A
I
know
we've
had
some
some
interesting
problems,
just
managing
the
governance
processes
around
these.
These
sorts
of
issues,
I
think,
probably,
if
anyone's
got
any
further
questions
it
might
be
best
to
take
them
to
the
emelop's
birth
of
a
feather
event.
That's
coming
up
next
and
then
we
can.
A
We
can
start
to
get
our
heads
together
and
and
dig
into
some
of
the
detail
of
this
stuff,
but
if
there
are
no
further
questions
here,
then
thanks
everyone
for,
for
your
time
and
attention
and
yeah,
please
reach
out
get
involved
with
the
road
map
and
look
forward
to
collaborating
with
you
in
the
future.