►
From YouTube: Sponsored Keynote: Challenges and Opportunities in Making AI Easy and Efficient with... Maulin Patel
Description
Sponsored Keynote: Challenges and Opportunities in Making AI Easy and Efficient with Kuberenetes - Maulin Patel, Google
A
A
A
The
third
reason
is
productivity.
Kubernetes
makes
data,
scientist
and
ai
practitioner
more
productive
by
freeing
users
up
from
having
to
manage
their
own
workstations
or
servers.
It
lets
them
focus
on
their
business
critical
mission,
which
is
to
build
a
model
and
train
the
model
without
having
to
worry
about
underlying
infrastructure
and
compatibility
issues.
A
A
A
This
assumption
does
not
suit
well
for
many
distributed
computing
frameworks.
Majority
of
distributed
computing
frameworks,
especially
that
are
used
for
aiml,
are
very
sensitive
to
disruptions.
They
are
intolerant
to
disruptions
such
as
preemptions
failures
or
maintenance
events,
and
the
problem
gets
really
really
acute
when
you
do
really
large
scale
training
with
thousands
of
nodes.
A
A
What
this
community
needs
is
framework
agnostic,
elastic
training,
so
our
goal
should
be
to
support
any
framework
without
any
code
changes,
and
this
will
give
two
main
benefits.
If
we
have
the
elastic
training,
then
you
can
use
it
to
run
your
training
on
spot
vms,
which
are
a
lot
cheaper
than
on-demand
vm.
So
it
saves
a
lot
of
cost
and
it's
also
another
problem
which
is
obtainability.
A
As
most
of
you
know,
gpus
are
scarce.
Resource
spot
vms
are
also
scarce
resource.
It's
very
hard
to
find,
let's
say
thousands
of
gpus
up
front
to
start
your
training.
If
you
have
elastic
training
support,
then
you
can
start
a
training
with,
however,
number
of
spot
gpus
available
to
you
and
then
scale
it
up
when
more
gpus
are
available
and
scale
it
down
when
you
lose
them,
so
that
will
also
address
the
obtainability
challenge.
A
Another
opportunity
for
this
computer.
This
community
is
to
basically
enable
native
support
for
checkpoint
migration
and
restoration
in
kubernetes,
and
the
way
it
can
be
done
is
that
whenever
the
underlying
infrasurfaces,
that
there
is
an
impending
maintenance
event-
or
there
is
a
preemption
coming,
then
kubernetes
can
transparently
and
gracefully,
take
a
snapshot
or
checkpoint
and
store
it,
and
this
will
make
it
work,
conserving,
meaning
current
checkpoint
mechanisms.
You
lose
the
work
since
last
epoch,
but
if
you
had
like
a
transparent
checkpointing
on
demand,
then
it
will
conserve
all
the
training
work
that
has
happened.
A
A
So
in
my
opinion,
it's
very
hard
to
get
this
kind
of
information
from
existing
kubernetes
primitives
like
prometheus
metrics
or
logging
metrics.
To
give
a
little
bit
of
concrete
example
like
in
typical
kubernetes
observative
observability,
you
rarely
have
to
deal
with
events
that
are
months
or
even
years
apart.
On
the
other
hand,
in
ai
world,
this
is
a
very
common
occurrence.
So
let
me
give
you
some
examples.
Let's
say
you
have
a
model
to
predict
customer
churn.
A
Now
customer
acquisition
happens
and
at
some
point
in
future
the
customer
is
going
to
churn.
So
in
order
to
study
the
accuracy
of
this
model,
you
have
to
combine
these
two
events,
which
are
spaced
months
or
even
years,
and
apart
there's
another
example.
For
example,
you
have
a
model
that
predicts
loan
default.
A
Now
you
issue,
a
loan
default
may
happen
sometime
in
future,
to
understand
the
accuracy
of
this
model.
You
have
to
combine
these
events,
so
in
order
to
understand
the
accuracy,
precision
recall
and
many
other
metrics,
the
observability
solution
needs
to
combine
desperate
events
that
may
space
far
apart.
So
this
is
one
of
the
examples
where
we
as
a
community
have
an
opportunity
to
extend
existing
observability
solutions
that
are
there
for
kubernetes
to
make
them
suitable
for
aiml
community.