►
Description
Panel: Building Kubernetes Operators for AI and ML Workloads.
Filmed October 28th, 2019 in San Francisco.
A
So
we
really
want
to
thank
everybody
for
the
time
and
effort
it
takes
to
build
an
operator
and
congratulate
you
and
some
of
you
who
are
about
to
get
yours
and
operator
hub.
But
if
anybody
out
there
in
the
audience
is
thinking
about
building
an
operator
and
need
some
help,
please
see
me
during
the
break
and
I
will
make
sure
you
get
that
help
without
further
ado
sure
art
all.
B
Right
we
have
microphones
we're
all
good
to
go.
Everyone
has
a
chair
except
me
all
right,
so
thank
you.
Everyone
is
we're.
Well
we'll
just
pass
the
microphones
around.
Yes,
we'll
do
that.
I
think
it's
gonna
be
a
really
fun
discussion
here,
we'll
talk
about
operators
and
you
know
kind
of
a
the
process
that
everyone
has
gone
through
with
operators.
So
with
that
said,
let's
just
kind
of
kick
things
off
with
a
round
of
introductions.
B
B
D
Name
is
Gopal
Krishna
and
I'm,
the
VP
of
engineering
and
customer
support
for
cognitive
scale,
and
we
bring.
We
make
our
customers
primarily
in
the
financial
services
in
health
care,
successful
with
AI
with
our
product
cortex,
which
allows
for
a
build
solutions
in
a
predictable
way,
as
well
as
to
make
sure
that
they
are
trusted
early.
We
heard
about
ethics,
you
know
embedding
a
lot
of
concepts
to
make
sure
the
systems
are
trusted
and
can
go
through
some
level
of
compliance.
So
that's
our
product
offering.
E
Let's
try
this
one:
shall
we
hey
I'm,
Ryan
I'm
a
developer
at
Selden?
We
we
make
it
easy
for
data
scientists
to
deploy
machine
learning,
models
to
kubernetes
and
run
serve
HTTP
requests.
So
if
predictions
as
HTTP
requests
and
provide
out
of
the
box
features
observability,
so
monitoring
and
tracing
metrics,
but
also
features
like
making
it
easy
to
define
roll
out
strategies
with
traffic
splitting
and
to
define
inference
graphs.
So
you
can
have
like
reusable
steps
within
within
the
graph
and
Duke
is
sort
of
complet,
more
complex
kind
of
deployments,
like
multi-armed
bandit.
G
H
Hey
everyone:
my
name
is
promote
I'm,
a
senior
product
manager
in
our
data
center
business
unit
at
Nvidia,
so
we
make
GPUs
I
product
manage
coda,
which
is
our
GPU
computing
platform
and
we're
building.
You
know
both
hardware
as
well
as
software
that
enables
AI
ml
data,
analytics
workloads
and
many
different
types
of
use,
cases
that
that
will
be
accelerated
using
GPUs,
great.
B
Thanks
everyone,
we
talked
a
lot
about
kubernetes.
We
talked
a
lot
about
operators,
kubernetes
is
kind
of
taking
the
world
by
storm
and
operators,
in
my
opinion,
kind
of
you
know
very
close
behind
that.
At
what
stage
of
your
your
your
companies,
each
one
of
you
have
different
products.
Some
of
you
may
have
started
off
more
and
they
the
kubernetes
or
container
space.
Some
of
you
may
be
traditional
applications,
and
what
stage
did
you
start
to
think
about?
G
Yes,
so
we
obviously
tracking
kubernetes
for
some
time,
but
about
a
year
ago
I
would
say
we
realized
to
scale
prestel
deployments
to
more
and
more
different
environments.
You
know
traditionally
do
some
Prem
and
and
maybe
at
Everywhere's,
but
now
you
have
to
go
to
a
shore.
A
Google
cloud
to
you
know
different
environments.
D
It's
been
a
long
journey
for
our
company
has
been
asked
in
for
five
six
years
now
earlier
on,
in
that
genesis
of
the
company
we've
been
trying
to
leverage
container
technologies,
as
well
as
the
orchestration
mechanisms.
So
several
times
during
the
journey,
we
have
evaluated
kubernetes
and
kind
of
fell
short
for
one
reason,
or
another
timing
was
not
right.
D
So
last
fall
is
when
we
made
a
concerted
effort
to
embrace
kubernetes
and
work
with
microsoft,
actually
to
render
our
product
on
edger,
and
there
was
a
very
positive
experience
for
us,
and
then
we
quickly
learned
that
the
investments
we
made
to
kubernetes
is
rewarding
us
by
allowing
us
to
go
to
the
next
set
of
cloud
vendors
like
eks
in
Red,
Hat
OpenShift,
as
well
as
gke
as
well.
So
we
are
able
to
evolve
and
rapidly
satisfy
our
customer
needs
to
go
on
Prem
or
on
their
own
cloud
or
on
our
cloud
to
support
them.
B
D
You
know,
organization
and
selection
of
technologies
to
move
forward,
so
we
are
kind
of
held
back.
We
are
you
know,
being
a
small
company,
we
can't
just
be
experimental
about
it
in
everything,
so
we
are
driven
by
our
customer
priorities
and
we
are
very
closely
monitoring
and
pushing
them
towards
operator,
and
we
find
that
as
a
very
welcome
addition
to
kubernetes
because
of
the
promises
it's
making
maintenance
of
software
upgrades
as
well
as
moving
towards
your
at
least
towards
Phase
two,
if
not
all
the
way
to
phase
five.
At
this
point,
sunny.
C
Our
solutions
built
to
help
customers
to
cause
resource
optimization,
so
it's
quite
natural
for
us
to
use
operator
framework
solution
because
one
is
customer.
You
know
ask
how
to
install
our
solution
to
help
them
to
do
cost
optimization.
We
tell
them
is
operator
or
we
are
actually
fully
certified
level
five
operators
on
open
ship,
then
there's
no
more.
No
further
questions
asked
okay.
B
H
So
we
started
working
on
an
operator
for
for
GPU
deployment,
essentially
with
with
Red
Hat
a
couple
of
months
ago,
so
we've
been
building
with
Red
Hat,
a
GPU
operator
that
basically
helps
us
provision
GPU,
so
GPUs
are
special
resources
within
kubernetes,
so
bringing
up
a
GPU
is
can
be
fairly
complex
because
we
need
to
install
user
mode
drivers
kernel
modules.
Then
we
need
to
install
device
plugins
to
have
these
resources
Expo
to
do
the
kubernetes
master
master
control
plane.
H
So
it's
not
so
setting
up
communities
and
deploying
GPUs
is
fairly
complex
and
even
once
once
you
deploy
it,
GPUs
are
throughput
machines,
which
means
you
need
to
constantly
feed
the
GPU
with
with
data
and
so
that
it's
it's
operating
at
full
throughput
capacity.
So
the
GPU
operator
that
we've
building
been
building
with
Red
Hat,
essentially
automates
the
manage
the
management
and
deployment
of
all
of
this
GPU
components
and
the
device
plug
in
monitoring
stacks
automatically.
H
So
it's
been,
it's
been
pretty
cool
technology
for
us
because
it
simplifies
users
deploying
these
GPU
worker
nodes
very
easily.
So
we
were
also
been
looking
into
using
operators
to
do
other
things
now.
So,
specifically,
MPI
jobs
can
also
be
fairly
complex
because
you
need
to
have
all
of
these
worker
nodes
participating
together,
we're
also
looking
at
it
using
GPU
direct
technology,
which
is
our
technology
that
allows
GPUs
to
communicate
to
Nick's
and
storage
directly
without
having
to
go
through
the
CPU.
H
B
F
Okay,
so,
basically,
and
for
us,
because
we
work
with
customers,
so
each
customer
has
a
different
needs
and
what
they
want.
So
in
terms
of
adoption
of
AI,
if
I
understood
your
question
correctly
that
how
they
used
it
for
the
before
they
were
clone.
So
we
started.
We
started
last
December
with
one
of
the
biggest
shopping
centers
in
supermarkets
in
Australia
that
they
wanted
to
have
their
training
pipeline
and
kubernetes.
So
you
can.
We
can
see
the
trend
from
our
customers
that
they
like
to
have
operators
and
using
kubernetes
in
the
AI
space.
B
E
Certainly
the
we
chose
we
chose
to
use
an
operator
because
it
allowed
us
to
condense
the
serving
specification
in
a
kind
of
very
neat,
an
expressive
way.
It's
important
that
we
can
express
that
in
the
way
that
the
data
scientist
can
understand
and
control,
so
that
the
data
scientist
is
empowered
to
be
able
to
put
together
that
that
serving
specification
and
run
it
without
having
to
wait
for
the
ops
team
to
figure
out
how
to
how
to
get
something
up
and
running
great.
A
G
So,
in
our
case
I
guess
we
didn't
do
an
spore
or
harm
charts.
Before,
however,
we
develop
a
custom
framework
to
deploy.
You
know
presto
on
a
big
cluster
and
I
was
a
lot
of
scripting
a
lot
of
Python
running
things
in
parallel.
You
have
to
maintain
that
code
and
deal
with
all
the
failure
cases.
You
know
what,
if
some
machine
doesn't
update
properly-
and
it's
just
really
really
difficult
to
do
all
of
this
manually.
G
So
then,
when
we
decided
to
move
to
kubernetes
well,
most
of
those
things
are
provided
by
the
framework
right.
You
have
to
so
instrument
the
framework
and
orchestrate
all
of
this,
but
you
don't
have
to
deal
with
like
individual
machine
in
a
cluster
coming
up
or
down
or
being
misconfigured
or
like
all
those
those
different
problems
right.
So
that
was
the
give
a
simplification
of
how
much
code
we
have
to
develop
and
we
have
to
provide
to
our
customers
versus
what's
guaranteed
by
the
framework
right.
G
F
So
the
way
I
see
it
helm
is
for
packaging,
the
desired
States
and
then
operator
is
responsible
to
make
it
happen.
So
it
says
you
know
so
it's
in
my
opinion.
There
are
two
different
separate
concern
and
helmets
for
packaging
and
operator
is
flowing
and
making
it
happen
or
just
monitoring
and
making
sure
that
they
the
desired
States.
Always
there,
okay.
E
Yeah
so
I
guess
in
our
particular
case,
because
we
have
a
the
option
to
create
a
kind
of
complex
graph
of
steps
where
you
have
like
pre-processing
and
then
you
have
model
prediction
and
you
might
have
some
sort
of
post
processing
step
or
them.
You
know
there
might
be
more
complex
graphs
than
that
to
capture
that
it
made.
It
made
sense
to
be
able
to
do
it
in
a
in
a
custom
resource
and
then
have
the
operator
fulfilled.
E
D
So
one
of
the
things
we
are
finding
out
with
large
customers
is,
however
much
we
want
them
to
just
absorb
changes
and
put
it
into
production
right
away.
That's
not
happening
and
there's
always
change
management
to
Windows
for
maintenance,
all
of
them
come
into
play
and
we
operate
in
a
very
low
tolerance
for
any
errors.
That's
going
wrong
and
the
specific
AML
world
you
got
to
change
the
algorithm,
probably
more
frequently
as
it
learns
and
updates
your
model
and
come
back.
D
So
you
may
be
deploying
new
versions
of
the
model
every
couple
of
weeks
and
sometimes
maybe
every
week,
depending
on
the
sensitivity
of
problem.
So
when
you
try
to
do
that
without
any
formalism
and
simplification
like
operator
framework,
it's
tough
and
you
do
lot
of
you-
know:
dotting
dies
and
dashing
the
t's
in
order
to
get
through
the
change
management
windows,
and
this
one
we
believe,
will
minimize
the
risk
so.
B
Reminds
me,
I
just
recently
went
to
a
doctor's
office
and
they're
still
on
Windows
XP.
You
get
this
change
management
where
there's
there's
not
enough
of
a
value
to
move
off
of
it,
but
at
the
meantime
your
company
has
the
support
older
version,
so
maybe
sunny.
If
that's
something
you
want
to
dive
into
it,.
B
Especially
with
the
rapid
iteration
of
these
changes,
there's
things
coming
out
all
the
time
and
I
guess.
Another
added
benefit
is
because,
if
you
do
have
a
platform
like
open
shift
and
the
customer
has
many
different
environments,
they
can
kind
of
upgrade
all
of
those
environments,
maybe
in
a
single
time
maybe
sunny.
If
you
want
to
you,
know
kind
of
go
into
what
that
means
for
your
customers
as
well.
Yeah.
C
Yes,
because,
as
in
all
machine
learning
model,
you
have
to
keep
updating
it,
sometimes
as
frequent
as
four
weekly
basis.
So
you
know
once
you
have
these
operators
and
in
customers
will
be
much
easier
for
them
to
understand
what
it
means
by
getting
an
update
right,
because
a
customer
don't
want
to
have
any
a
lot
of
updates
too
frequently
and
operation
make
it
so
much
easier.
So.
B
We
talked
a
great
deal
about
the
benefits
of
operators.
Maybe
someone
wants
to
volunteer
and
talk
about
the
challenges
you
faced.
Building
an
operator,
it's
an
I.
Imagine
just
like
any
new
technology.
There's
gonna
be
some
some
herbs
and
flows
and
ups
and
downs.
So
does
anyone
want
to
take
a
first
stab
at
some
of
the
the
complexities
they've
run
into.
F
E
Actually,
I
was
going
to
pick
up
on
a
point
that
you
said
earlier
about
the
the
the
variety
of
tools
that
are
out
there.
It's
a
it's
a
space,
that's
moving
so
quickly,
and
actually
not
so
long
about
sultans
operator
was
written
in
Java
and
we
realized
we'd.
The
the
Java
support
really
wasn't
keeping
up
with
the
changes
and
had
to
switch
to
the
queue
builder
stuff.
F
So
it's
kind
of
like
I
can
talk
about
challenges
about
an
hour.
It's
a
it's
a
different
talk,
but
one
of
the
kind
of
like
high
levels
of
the
challenges.
One
is
test
and
build
and
test
pipeline.
So
imagine
that
we
put
our
operator
that
it's
basically
created
resources
on
Azure.
So
now
we
are
on
open
source
and
then
you
want
to
do
an
integration
test
under
the
on
the
PRS.
F
So
how
you
want
to
make
sure
that
you
know
it's
secure,
so
someone
is
just
not
provision
a
cluster
of
ten
nodes
for
doing
the
pipeline,
because
our
build
pipeline
is
part
of
our
github
repos.
So
what
one
option
is
around
that
another
challenge
is
the
versioning
of
the
CR
DS
and
the
specs
that
you
are
saving
it
so
between
different
version
of
the
operators.
So
if
you
change
your
the
map
and
your
model
that
you're
saving
it
I
mean
at
your
data
model,
not
machine
learning
model.
F
B
F
We
used
to
be
there,
but
we,
but
we
looked
into
the
operator
framework
there.
It
was
very
quite
similar,
but
we
have
a
team
of
the
SMEs
that
really
kind
of
like
chat
with
them.
Before
we
pick
and
then
we
are
using
the
framework
based
on
the
skill
set
of
our
customer
and
our
team,
and
we
picked
queue
builder
because
we
found
it
it's
easier
to
use,
but
it's
still,
if
the
functionality
wise
it's
exactly
same
and
then
for
even
we
realize
it.
F
B
And
the
great
thing
is:
there
are
tools
out
there.
We
also
have
resources
in
the
openshift
community
to
help
out
getting
operators
in
into
the
operator
hub,
and
you
know
using
the
operator
SDK
and
the
framework.
So
that's
great.
What
about
for
those
of
you
I
think
you
know
some
of
the
panel
members.
You
have
certified
operators.
Does
anyone
maybe
want
to
speak
about
the
process
or
first
off?
Why
did
you
want
to
go
the
route
of
having
a
certified
operator
and,
if
you're
currently
going
through
the
process
of
then
what
is
it?
You
know?
E
I'm
happy
to
say
something
about
that:
yeah,
so
I
think
it
for
us.
It
was
it's
just
it's
really
cool
that
people
thought
you
were
using.
Openshift
can
just
click
the
button
and
they
get
the
operator
installed
it.
It's
like
it's
very
obvious,
win
for
us
to
make
it
the
install
that
easy,
but
also
that
gives
a
lot
of
confidence
to
the
customers
that
are
running
on
openshift,
that
they
know.
E
B
G
I
would
second
that
I
think
the
best
experience
was
did
help
from
the
team.
You
know
in
terms
of
testing
that
you
know
troubleshooting
any
issues
providing
infrastructure
to
certify
our
kubernetes
integration
on
openshift,
so
that
was
the
great
part
I
think
the
the
process
of
is
so
fairly
new.
So
it's
some
rough
edges,
but
with
that
extra
help,
I
think
that
was
really
smooth
experience
and
yeah.
There
are
like
definitely
some
notes.
I
have
to
share
for,
like
the
improvement
of
the
process
for
the
future
generation
of
certification,
but
I
think
it's
really
good.
G
I
mean
we're
actually
successfully
there
on
open
operator
hub
right
now
and
that's
really
important
for
some
of
our
customers.
Obviously
right
if
you
go
to
a
bank
and
break
Enterprise
and
the
fact
that
it's
certified
and
open
ship,
the
password
we
are
actually
using
for
kubernetes,
is
it's
a
win.
You
know
immediately
sunny.
B
C
Want
to
reiterate
the
other
member
said:
first
of
all,
confidence,
the
customers
and
certificate
certification
process
make
it
so
much
easier
because
of
support
from
her
head.
We
have
done
it
I
think
in
a
very
very
short
time
period.
Of
course
my
team
has
done
it,
but
they
told
us
it
has
been
a
great
partner
right,
I.
H
Have
a
different
perspective:
we
haven't
been
certified
yet
so
I
mean
it
is
kind
of
encouraging
that
you
know
the
certification
processes
is
in
Red.
Hat
plays
a
very
successful
role
for
us.
I
think
you
know
having
a
certified
operator
is
very
important,
as
we
you
know,
scale,
with
red
hat
into
enterprises,
because
we
have
a
lot
of
components
that
are
managed
by
the
by
the
operators.
So
having
a
certified
operator
will
give
us
give
customers
a
very
inner
ear,
surance
that
you
know
there's
their
support
behind
it.
B
Those
not
familiar
with
the
certification
process
and
what
it
means.
The
great
thing
about
certification
process
is.
It
gets
the
the
all
of
the
components
of
the
operator
on
the
same
type
of
operating
system.
In
this
case
we
use
ubi,
which
is
you
know
our
way
red
hats
way
of
saying
this
is
a
nice
certified
operator
in
terms
of
wheels
support
it,
but
then
also
you
get
the
support
from
the
partners
as
well.
Then
you
have
that
dual
level
of
support
both
from
red
head
and
the
partners
alright.
B
E
Well,
I
suppose
I
should
we
do
we
do
something
similar.
We
we
have
a
way
to
enable
sort
of
data
science
functionality
in
in
the
custom
resource,
so
you
can,
for
example,
turn
on
an
outlier
detector
component
and
that
will
then
run
alongside
your
model
and
then,
if
a
particular
data
point
is
its
marks
as
an
outlier,
then
you
know
that
that's
kind
of
that
that's
potentially
like
you
might
get
a
lower
quality
prediction
for
those.
Those
particular
records.
E
B
D
Couple
of
dimensions,
this
one
is
I.
Think
when
I
read
your
question,
you're
talking
about
you
know
in
the
operator
framework,
you
are
doing
a
lot
of
data
collection
and
you're
doing
a
lot
of
performance,
telemetry,
etc.
So
you
can
use
that.
Definitely,
you
know
where
there's
data
there's
always
opportunity
to
learn
and
make
it
easier
for
people
to
operationalize
some
other
decisions
in
cognitive
scale.
We
have
just
released
a
product
called
certified
with
the
AI
at
the
end
in
store
Y.
D
D
So
we,
our
intention,
is
to
migrate
towards
chaining
that
to
operator
framework
so
that
when
models
change
either
during
the
time
of
change
management
or
while
it's
in
the
process
of
change
management
or
while
it's
functioning
as
well
in
a
way
to
measure
its
dimensions
and
report,
the
outliers
and
be
able
to
initiate,
you
know
remediation
activities.
So
that's
what
we
are
hoping
we'll
get
there
as
soon
as
we
can.
B
C
D
For
us,
it's
more
of
aligning
with
our
customers,
large
banks
and
healthcare
companies
and
make
sure
we
are
going
hand
in
glove
with
red
at
kubernetes
and
OpenShift
and
whatever
they
ask
us
to
do
to
be.
You
know
credible
as
well,
as
you
know,
opera
operable
in
their
environment.
So
that's
our
next
quote.
E
Another
thing
we're
working
on
more
closely
in
the
operator
space
actually
is
that
a
collaboration
with
the
coop
flow
project
to
do
serv
model
serving
to
serve
HTTP
predictions,
but
in
a
server
'lest
way
based
on
key
native.
So
you
can
scale
to
zero
and
make
more
use
of
the
underlying
infrastructure
resources.
F
B
B
G
So
I
think
after
successful
release
of
the
initial
version
operator,
we
want
to
expand-
and
you
know,
take
more
advantage
of
like
the
native
press,
the
features
that
we
are
now
like,
developing
in
conjunction
with
the
operator
framework
capabilities.
So,
for
example,
in
in
the
press
environment
you
have
multi
nodes
running
and
sometimes
you
have
spikes
in
load,
sometimes
slower,
and
there
are
many
different
reasons
and
potential
causes
how
to
improving.
G
You
know
performance
of
the
cluster,
sometimes
you
just
you
know
you
have
to
add
more
nodes
for
some
time
and
and
that
may
be
based
on
CPU,
which
we
support
today,
but
sometimes
you
need
more
memory
simply
right
here.
What
this
requires
more
memory,
and
otherwise
we
won't
be
able
to
finish
successfully
in
a
short
time.
So
we
want
to
add
that
capability
and
tap
into
values.
H
So
for
our
from
from
invidious
perspective
there
there
are
two
things
that
I
can
think
of
one
is
we
operate
a
fairly
large
kubernetes
cluster
for
I
mean
internally
to
serve.
You
know
the
companies
compute
needs
for
their.
We
offer
I
know
our
users
and
MPI
operator
just
to
manage
their
MPI
batch
workloads
for
for
for
training,
and
you
know
quickly.
We
are
trying
to
use
kubernetes
to
do
multi,
node,
deep
learning
training
because,
as
the
models
get
supremely
complex,
you
have
billions
of
parameters
to
train
models.
H
So
we
are
looking
to
use
communities
to
do
multi,
node,
training
and
I'm.
Pretty
sure
that
you
know,
there's
going
to
be
more
operator
work
there
as
we
try
to
get
all
these
nodes
cooperating
together
to
do
deep
learning
training,
then,
on
the
other
side
of
the
spectrum,
we're
also
working
on
edge,
you
know
edge
use
cases.
So
there's
going
to
be
a
lot
of
you
know
tiny
little
gpu-accelerated
devices
and
I'm
trying
to
do
inferencing
at
the
edge
related
to
video,
analytics,
smart
cities
and
so
on.
H
B
Fantastic,
although
each
one
of
you
have
vastly
different
products,
the
nice
thing
is:
there's
a
single
thread
of
making
things
easier
and
enabling
customers
and
I
think
that's
really
when
you
think
about
openshift
and
the
foundation
of
open
shift,
and
why
it's
so
popular,
you
know,
that's
that's
at
the
core
of
it
and
I
think
operators
are
following
right
along
there
and
just
making
it
much
easier
for
customers.
So
I
want
to
thank
you
guys
for
for
participating
it.