►
Description
Data science, machine learning (ML), and artificial intelligence have exploded in popularity in the last few years, with companies building out dedicated ML teams. Kubeflow is an OpenSource, ML toolkit for Kubernetes which provides useful components resolving problems in multiple areas. In this hands on demo we will discuss about how you can make a scalable architecture for ML training and inference at scale using a distributed storage in a backend (Amazon EFS) and Kubeflow (on Amazon EKS).
A
Hello:
everyone
welcome
to
this
session
on
machine
learning
at
scale
using
kubeflow
with
amazon,
eks
and
amazon
efs.
My
name
is
suman
davnat,
I'm
a
developer
advocate
with
on
the
efs
product
team,
and
I'm
super
excited
to
share
a
few
of
the
insights
about
how
you
can
run
your
machine
learning
workflow
now
using
an
open
source
tool
called
qflow.
A
So
what
we
are
going
to
do
in
next
30
minutes
or
so
we
are
going
to
talk
about
what
machine,
why
machine
learning
on
containers,
because
that
might
be
little
odd
to
many
of
you.
If
you
are
new
to
you,
know
container
space,
and
we
will
then
dive
a
little
bit
around
queue
flow
and
then
we
will
straight
away
jump
into
the
demo,
which
is
more
interesting
rather
than
there's
boring,
slides.
So
first
thing:
first,
why
machine
learning
with
containers?
A
So,
if
you
think
about
it,
you
know
it's
not
only
about
machine
learning,
we
get
the
flexibility
and
all
the
benefits
that
any
application
can
get
with
containers.
So
if
you
look
at
the
whole
machine
learning
stack,
there
are
different
tools
like
tensorflow
pytorch,
and
we
need
different
kinds
of
infrastructure
to
run
this
training.
So
things
gets
very
complicated
when
you
think
about
different
packages,
dependencies
and
configurations.
A
So
what
containers
helps
us
to
do?
Is
you
know
it
packages,
our
training
code,
along
with
all
the
dependencies
in
a
much
modular
way?
And
that
way
you
know
our
ml
environment
gets
very
light
weighted
and
it's
very
much
gets
portable
right.
So
you
can
run
your
machine
learning,
training,
jobs
or
other
tasks.
You
know
independent
of
the
platform,
so
one
of
the
reason
that
why
it
is
even
better
to
run
it
on
kubernetes
is
you
know
first,
which
is
composibility.
A
So
basically
you
you
are,
you
know,
defining
your
training,
jobs
or
you
are
defining
your
machine
learning
task
in
a
granular
way,
and
so
that
you
know
you
can
always
run
it
at
different
places.
And
if
you
want
to
make
some
changes,
it
doesn't
affect.
You
know
the
other
jobs
in
your
pipeline.
The
other
thing
is,
you
can
start
today,
on-prem
in
your
kubernetes
environment,
and
you
don't
have
to
change
it
in
future
if
you
want
to
migrate
it
to
aws
and
run
the
same
training
job
on
the
cloud.
A
So
since
it
is
created
at
the
container
on
kubernetes,
it
gets
very
much
portable,
as
we
have
just
discussed
a
while
back
and
obviously
you
don't
have
to
think
about
the
scale,
because
kubernetes
will
take
care
of
the
infrastructure,
so
whether
you
need
two
instances
to
run
the
training,
job
or
three
instances
or
ten
instances.
Since
the
training
job
is
running
as
a
container,
you
don't
have
to
worry
about
the
backend
infrastructure,
which
is
very
much
valuable
for
any
machine
learning
engineer
or
even
for
any
infrastructure
engineer
and
the
best
part
about
this.
A
Is
you
don't
have
to
worry
about
managing
this
kubernetes
cluster?
So
you
may
like
to
use
amazon
eks,
which
is
our
managed
kubernetes
service
on
aws,
which
helps
you
to
give
to
give
you
the
control
plane
where
you
can
attach
your
data
plane
or
compute
nodes,
and
it
supports
you
know
you
can
always
get
your.
You
know
kubernetes
upstream
experience
and
you
can
decide
which
version
of
kubernetes
you
want,
but
you
know
you
can
negatively
get
the
upstream
kubernetes
experience.
So
it's
it
will
look.
A
You
know
it
will
feel
exactly
the
same
as
how
it
is.
If
you,
if
you
have
to
install
and
configure
kubernetes
on-prem
or
in
your
managed,
you
know
ec2
instances
and
it
gets
integrated
with
lot
of
other
aw
services
and
which
we
are
going
to
see
in
a
while.
Where
we
will,
we
are
going
to,
you
know,
run
machine
learning,
job
using
kubeflow
onycast
and
we
are
going
to
save
our
training.
Data
sets
on
efs
and
one
of
the
other
thing
is
irrespective
of
kubernetes.
A
Is
you
don't
have
to
build
all
those
training
container
images
from
scratch?
So
we
do
offer
a
lot
of
pre-packaged
docker
container
images
which
are
fully
configured
and
validated
and
you
know
tested
rigorously.
So
you
are
always
get
the
best
configuration
and
image
which
you
can
make
use
of.
A
So
you
can
just
what
you
can
do
is
you
can
just
create
your
own
training
script
or
use
the
training
scripts
that
we
have
as
a
template
and
make
the
relevant
changes
based
on
your
need
and
requirement,
you
can
always
customize
those
container
images
and
we
support
different
frameworks
like
tensorflow,
mxnet,
python,
dots
and
so
on
and
the
best
part
is
you
know
you
can
use
these
deep
learning
containers
not
only
with
eks
but
with
ecs,
amazon
stage,
maker
and
ec2
instances.
A
So
let's
talk
a
little
bit
about
queue
flow
before
we
jump
into
the
demo.
So
keyflow
is
basically
you
can
think
of
it
as
a
machine
learning
toolkit
for
kubernetes.
So
it's
it
comprises
of
you
know
various
other
projects
like
jupiter
notebook,
pipeline
training,
services
and
inference
you
know
or
serving
so
basically,
if
you
have
ever
seen
amazon
sagemaker,
it's
kind
of
a
similar
platform
that
you
will
get
here.
A
It
may
not
have
all
those
fancy
features
that
stagemaker
offers,
but
if
you
want
to
run
your
machine
learning
workflow
on
kubernetes
and
if
you
want
to
have
a
control
over
your
workflow
even
at
a
more
granular
level,
then
maybe
you
can
make
use
of
a
queue
flow
and
it's
an
open
source
project.
So
you
can
always
contribute
to
that
and
we're
going
to
see
in
the
demo
that
how
you
can
create
a
jupyter
notebook
and
how
you
can
start
a
training
job
on
queue
flow.
A
Now,
one
important
advantage
of
using
your
own
aws
is
you
are
going
to
get
a
lot
of
flexibilities
and
leverage
the
integration
that
aws
has
with
other
services.
So
when
you
are
running
q
flow
on
eks,
you
get
all
the
goodness
of.
You
know
this
service
integration
that
we
have
with
eks.
So
in
this
case
we
are
going
to
use
efs
with
queue
flow.
A
So
since
we
are
going
to
use
efs
in
our
workflow,
so
let's
talk
about
efs
a
little
bit
right,
because
we
we
learned
a
little
bit
about
eks,
which
is
a
managed
service
for
kubernetes.
So
let's
spend
a
couple
of
minutes
on
efs.
So
efs
is
a
simple
serverless
and
set
and
forget
kind
of
file
system
which
we
just
created,
and
you
don't
have
to
mention
the
size
of
the
file
system
and
you
can
just
use
it.
You
know
anywhere,
so
it
you
can
access
that
file
system
from
your
on
premise.
A
You
know
machine
or
you
can
access
it
from
ec2
instance,
from
a
lambda
function
or
from
a
kubernetes
cluster,
and
we
are
going
to
use
efs
for
saving
the
training
data
set
for
our
machine
learning
workflow.
But
it's
very
much
elastic
and
performant.
We
recently
announced
a
sub
millisecond,
read
latency,
so
that
means
in
general
you
will
get
a
latency
of
600.
A
A
So,
as
we
discussed
you
know,
efs
has
an
integration
with
you:
can
access
efs
from
various
different
compute
services,
so
one
is
on-prem
or
ec2,
but
we
don't
stop
it
there.
You
can
always
use
efs
with
your
any
of
your
containers,
which
are
running
on
ecs,
which
is
our
managed
container
service
or
on
eks,
which
we
are
going
to
use.
You
know
in
our
demo
so
how
to
get
started
with
the
efs
with
kubernetes.
A
So
now
we
are
not
talking
about
qfl
in
general,
because
we
are
going
to
do
that
in
the
demo,
but
I
just
want
to
give
you
an
overview
of
how
you
can
get
started
with
efs
with
kubernetes.
So
the
first
thing
is,
you
need
a
kubernetes
cluster.
So
in
this
case
you
know
we
are
creating
an
amazon
eks
cluster,
but
this
could
be
very
much
your
own
installation
of
kubernetes
on
a
bunch
of
ec2
instances
where
you
are
managing
the
kubernetes
cluster.
But
you
know
this.
A
We
are
just
taking
this
example
where
you
know
we
are
creating
and
managed
eks
cluster.
What
that
means
is
you
don't
have
to
manage
the
cluster
by
yourself?
All
the
pack,
you
know
patching,
you
know,
version
control
and
all
of
that
update
updates
is
taken
care
by
amazon.
A
So,
first
you
need
to
create
that
eks
cluster
and
second,
if
you
need
to
create
a
security
group
so
that
you
know
the
effs
file
system
can
be
accessed
from
the
eks
cluster
and
then
you
have
to
create
an
efs
file
system.
Now
we
are
going
to
do
this
through
the
code
in
a
while,
but
this
is
how
the
workflow
would
look
like
and
the
most
important
thing
is:
you
need
to
install
the
efs
csi
driver,
so
this
storage
driver
doesn't
come
out
of
the
box.
A
So
once
you
have
your
eks
cluster,
if
you
want
to
attach
efs
storage,
you
need
to
you
know,
install
this
csi
driver
right,
so
it's
this
is
also
an
open
source
project.
If
you
would
like
to
contribute,
if
you
to
contribute
on
this
project,
it's
you
know,
we
have
done
a
lot
of
improvements
in
our
csi
driver
in
the
recent
past.
So
please
have
a
look
so
once
you
have
created
this
eks
cluster,
you
have
created
the
file
system
and
you
have
installed
the
csi
driver
on
the
case
cluster.
A
What
you
can
do
is
you
can
define
a
storage
class
right.
So
in
the
storage
class
definition
you
will
provide
that
file
system
id
which
we
have
just
created
and
that's
all
right.
So
after
this
you
can
run
your
application
by
creating
a
persistent
volume
claim
and
you
will
define
the
same
storage
class
which
you
have
used
before
so
here.
A
If
you
see
the
storage
class
name
that
we
have
given
is
efs
hyphen
sc,
you
just
refer
to
the
same
storage
class
in
your
pvc
definition
and
once
that
is
done,
you
can
mount
that
pvc
in
your
application
or
in
your
in
your
pod
right.
So
in
this
case
we
are
just
using
the
same
position.
Volume
claim
whose
name
is
efs
claim,
and
if
you
see
this,
this
is
the
same
claim
name
that
we
have
used
to
create
this
pvc.
So
that's
all
so!
A
Next
time,
if
you
want
to
run
another
application,
you
will
just
create
another
pvc
and
use
it
in
your
pot.
So
you
don't
have
to
go
back
to
the
storage
over
and
again
right
so
because
we
are
using
the
dynamic
provisioning
and
the
csi
driver
will
take
care
of
creating
those
access
points
which
is
the
technology
behind
provisioning.
This
these
pvcs
right.
So
that's
about
you
know
how
a
little
bit
about
how
you
can
make
use
of
efs
with
kubernetes,
so
before
we
jump
into
the
demo.
A
This
is
what
the
architecture
of
our
demo,
so
we
are
going
to
use
efs
to
storing
our
training
data
set.
So
in
our
demo
we
are
going
to
download
the
training
data
set
and
we
are
going
to
run
some
training
job
on
queue
flow
and
that
a
training
job
is
going
to
access
the
storage
from
efs,
using
the
csi
driver
and
for
our
training
job.
A
We
are
going
to
build
an
image
and
then
we
will
push
it
to
ecr,
which
is
our
container
registry,
which
is
kind
of
docker
hub,
but
it's
within
aws,
and
then
we
will
start
the
training
job
on
our
eks
cluster
using
q
flow
and
it
is
going
to
pull
the
image
from
ecr,
run
the
training
job
on
the
training
data
set
which
is
saved
in
efs,
okay.
So
let's
jump
into
the
demo
and
see
it
all
in
action.
A
Okay,
so
I
have
already
opened
my
aws
console
and
if
you
see
I
am
inside
cloud9,
so
we
are
going
to
use
a
cloud9
which
is
an
ide
for
writing
your
code
on
aws.
So
it
will
give
you
a
a
nice
ide
kind
of
environment
which
runs
on
an
ec2
instance.
But
it's
very
easy
for
you
to
write
code
as
you
go
along.
A
You
don't
have
the
dependency
to
carry
your
laptop
or
you
know
workstation,
you
can
just
you
know,
write
your
code
from
anywhere
as
long
as
you
have
the
internet
connectivity.
So
if
you
come
here-
and
you
can
see
that
I
have
two
environments,
so
I
have
already
opened
this
environment-
you
can
always
create
your
own
environment
by
clicking
on
create
environment,
and
it
will
ask
you
just
a
few
questions.
What
type
of
ec2
instance
you
need
what
operating
system
and
you
are
all
set.
A
So
this
is
the
ide
environment
that
I
have
you
know.
If
you
come
here
and
click
on
open
ide,
you
will
land
onto
this
page.
So
I
have
a
lot
of
code
here.
You
know
I
have
cloned
view
of
the
github
repo
which
I'm
going
to
share
in
a
while,
but
you
will
get
this
kind
of
interface,
okay,
so
first
thing
first,
so
I
already
have
an
kubernetes
cluster
up
and
running
and
we
can
see
that
if
you
run
cube,
ctrl
get
nodes
and
we
have
an
eks
cluster
with
five
nodes.
A
I
have
also
installed
kubeflow
and
if
you
want
to
see
that
we
can
see
all
the
parts
running
as
part
of
keyflow
because
qflo
as
we
have
learned
a
while
back,
it's
a
collection
of
different
services
right.
So
we
have
all
those
you
know,
parts
running
which
supports
which
are
actually
running
a
queue
flow.
So
the
first
thing
is:
we
need
to
create
an
eff
file
system.
At
this
point
we
don't
have
any
effs
file
system
created
and
if
you
see
cube,
ctl,
get
storage,
class
or
sc.
A
You
see
that
the
default
storage,
which
is
an
ebs
volume
which
is
by
default
when
you
create
an
kubernetes
cluster
on
aws
now
the
first
thing
we
need
to
do
is
we
need
to
create
the
eff
file
system
and
then
install
the
driver
and
create
a
storage
class.
So
we
don't
have
to
do
it
all
manually.
A
We
have
created
a
script
which
is
located
inside
this
directory
and
it
has
some
dependencies.
So
if
you
see
this,
you
know
auto
efs
setup
of
script.
We
have
some,
you
know
we
are
using
some
external
libraries
so
for
that
we
have
one
requirement.text
file.
So
first,
let's
install
all
the
packages,
so
I
have
it
installed,
so
it
will
just
escape
all
of
the
installation.
So
next
is
we
need
to
run
this
script.
A
So
let
me
just
run
this
script
and
then
we
can
go
over
what
the
script
is
doing
so
the
before
I
hit
enter.
I
just
want
to
show
you
a
few
parameters
that
we
are
passing
along.
One
is
the
region
that
means
in
which
region
we
are
creating
this
file
system,
which
is
obviously
in
the
same
region
where
I
am
going.
I
have
my
eks
cluster,
so
I'm
just
giving
the
cluster
name
which
I've
saved
in
the
environmental
variable
and
then
the
file
system
name
right
so
before
I
hit
enter.
A
If
you
come
to
efs
console
and
hit
refresh,
we
see
that
there
is
no
file
system.
So
hopefully,
after
this
you
know
script
gets
executed.
We
will
have
a
new
file
system,
so
by
the
time
this
runs.
Let
me
just
quickly
show
you
what
the
script
is
doing.
A
So
if
you
see
the
script
is
doing
just
three
basic
things,
one
is,
it
is
checking
few
of
the
predict
with
it
and
what
is
needed
and
then
it
is
creating
the
I
am
roles
so
that
the
cluster
can
access
the
efs
file
system.
Then
it
is
going
to
install
that
csi
driver,
which
we
just
talked
about.
A
You
know
in
the
presentation
that
it
needed
to
you
know,
install
the
csi
driver
so
that
the
kubernetes
cluster
can
talk
to
efs
and
then
we
are
just
creating
a
file
system.
And
after
that
we
are
creating
a
storage
class,
and
this
is
the
storage
class
which
we
are
going
to
use
in
the
queue
flow
for
training
our
job
or
even
for
the
notebooks
to
creating
a
different
data
stores
for
keeping
our
training
data
sets.
A
Okay,
so
creating
a
file
system
and
creating
a
storage
class
is
something
that
you
can
repeatably
do
as
then,
when
you
need.
But
this
setting
up
the
you
know
csi
driver,
and
all
of
that
is
just
your
one
time
activity,
so
it
will
take
couple
of
minutes.
So
let's
wait
for
this
to
complete.
A
Okay,
as
we
can
see,
it
has
created
a
file
system
also,
it
has
created
the
amount
targets
and
it
has
provisioned
a
storage
class
so
to
see
that
we
can
just
run
cube
ctl
get
sc,
so
we
can
see
that
the
storage
class,
which
has
been
created
right
and
it
is
still
not
the
default
one.
So
if
we
create
any
anything
on
queue
flow,
let's
say:
jupiter
notebook,
it
will
use
this
storage.
But
if
you
explicitly
mention
this
storage
class,
it
will
carve
out
storage
from
here.
Okay.
A
So
now
before
we
go
ahead,
if
we
go
to
efs
and
click
on
refresh,
and
you
will
see
that
the
file
system
got
created,
my
efs1-
and
this
is
exactly
the
name
that
we
have
given
you
know,
while
parsing
the
efs
file
system
name
in
the
script.
A
So
we
are
all
good
now
and
if
you
come
inside
this
file
system
and
come
to
access
point,
which
is
the
entry
point
for
the
application
to
efs,
you
see
that
there
is
no
access
point
created
because
we
haven't.
We
have
just
created
the
storage
class,
but
there
is
no
pvc
we
have
claimed
or
created.
A
A
So
this
is
the
dashboard
service
or
basically
st
low
service,
and
now,
if
I
just
go
to
preview
and
open
the
app
in
a
separate,
not
here,
maybe
so
I'm
just
closing
this
off
and
going
back
here
so
now
the
dashboard
is,
you
know
up
and
running,
and
we
can
see
that
here
and
if
you
see
here,
we
have
notebook
tensor,
tensorboard
volumes
and
all
of
that
pipelines
everything.
A
So
this
is
kind
of
you
know
if
you
have
seen
sagemaker
on
aws
it's
kind
of
same,
but
it
has
sagemaker
has
a
lot
of
different
flexibility
and
features.
But
this
is
kind
of
a
nice
environment
for
you
to
manage
you
know
inside
out,
so
you
have
more
granularity
and
it's
all
running
on
kubernetes,
which
is
great.
So
now,
if
you
see
here,
we
don't
have
any
volume.
So
let's
create
a
volume,
and
this
is
the
volume
which
we
are
going
to
use
for
our
for
keeping
our
training
dataset.
A
And
let's
give
some
size,
let's
say:
100
gb,
and
here
we
can
select
the
storage
class
and
we
can
have
access
mode,
as
maybe
read,
write
many,
which
efs
supports.
That
means
you
can
access
this
volume
from
multiple
parts.
A
So
let's
create
it
and
it
is
going
to
be
in
pending
state
because
we
have
not
yet
you
know
we
are
not
yet
attached
this
volume
to
any
of
the
jupyter
notebook
or
any
of
the
other
training
jobs.
So
let's
go
ahead
and
create
a
jupyter
notebook.
A
Let's
give
some
name,
let's
say
notebook
1,
and
here
we
can
select
the
image,
but
let
it
be
the
default
and
it
is
going
to
create
a
a
volume
for
its
own
use.
Basically,
the
home
directory
for
this
notebook
and
since
our
default
storage
class
is
ebs.
So
it
is
going
to
create
that
pvc
from
this
storage
volume
or
gp2
or
type
of
storage
class.
But
we
are
going
to
attach
the
x
one
external
volume
which
is
going
to
be
the
data
set
efs
volume
which
we
have
just
created.
A
So
if
you
see
it's
already
got
created
and
if
I
come
to
volume
now,
you
will
see
that
the
data
set
is
also
the
status
is
now
bonded
and
we
also
have
a
another
volume
which
is
for
the
home
directory,
and
this
is
coming
from
gp2
and
now,
if
I
come
to
efs
and
click
on
refresh
here,
you
will
see
one
access
point
which
got
created
dynamically
by
the
csi
driver.
A
So
let's
go
back
to
our
notebook
and
connect
to
our
notebook
and
what
we
are
going
to
do
now
is
we
are
going
to
run
a
training
job,
but
before
that
we
need
to
have
some
data
set
right.
So
what
we
are
going
to
do
is
we
are
going
to
create
a
jupiter
notebook
and
we
are
going
to
download
some
data
set.
So
I
already
have
the
location
of
the
data
set
and
it
is
basically
a
simple.
A
Data
set,
which
contains
you,
know,
images
of
different
flowers
and
what
we
are
going
to
do
is
we
will
run
a
cnn
job
or
basically,
a
deep
learning
training
job
which
will
identify
the
type
of
flowers
you
know
given
an
image
of
a
flower.
So
this
is
a
very
tiny
training
data
set
and
the
focus
is
not
on
the
machine
learning
part,
but
the
idea
is
what
I
really
want
to
you
know
give
you
is
how
you
can
make
use
of
queue
flow
to
run
your
training
jobs.
A
So
if
you
get
inside
this
data
set-
and
this
data
set
is
coming
from
the
ess
storage,
if
you
get
inside
this,
you
will
see
different
types
of
flowers
rose,
sunflower
and
so
on.
So
let's
wait
for
this
to
get
downloaded,
and
once
this
is
over,
what
we
will
do
is
we
will
go
to
q
flow
and
start
a
training
job.
A
So,
as
you
can
see,
the
training
data
set
have
been
downloaded
and
we
have
images
saved
inside
this
efs
data
set
share
right,
so
we
are
all
good
to
start
the
training
job.
So
let
me
close
this
jupyter
notebook,
because
we
don't
need
that.
The
only
idea
you
know
only
thing
we
wanted
to
do
is
to
download
the
dataset
which
is
now
stored
in
you
know
in
this
eff
file
system
via
this
access
point.
So
now,
let's
go
back
to
our
console
and
let's
start
the
training
job.
A
If
we
open
this
this
architecture,
we
have
saved
the
training
data
set
on
efs
and
all
all
we
need
to
do
is
you
know,
run
this
training
job,
and
for
this
I
already
created
a
deep
learning
image
which
contains
the
you
know
the
code
for
running
that
training
job
and
we
we
have
to
build
it
locally
on
cloud9
instance,
and
then
I
pushed
it
to
ecr
repository,
which
I'm
going
to
show
you
and
then
we
can
simply
go
ahead
and
run
a
training
job
on
queue
flow
where
I
have
specified
the
training
data
set
location,
as
the
data
set
which
we
have
created
and
the
image
to
use
is
the
same
image
which
is
in
ecr
right.
A
So
let
me
show
you
the
image
first.
So
if
you
see,
if
we
run
docker
image
ls,
you
will
see
that
we
have
one
image
or
a
repository
inside,
that
we
have
this
image
saved
and
all
this
code,
like
all
the
docker
file
for
this,
it's
there
in
the
github
repo
which
I'm
going
to
share
with
you,
you
know
towards
the
end,
but
if
you
want
to
quickly
look
into
that
docker
file,
it's
simply,
you
know
we
are
just
pulling
one
a
tensorflow
base
image.
A
We
are
copying
this
training
script,
which
is
located
here
and
we
are
just
you
know
giving
this
as
an
entry
point.
That's
all
it's
nothing!
Fancy
and
inside
this
training
script
we
are
running
that
machine
learning
training.
A
A
Let's,
let's
run
it
so
so
the
training
job
is
basically
we
have
to.
We
are
going
to
run
on
a
queue
flow,
so
it's
also
defined
as
a
yaml
file.
So
this
is
inside
this
notebooks
inside
this,
so
one
second
yeah
inside
this
training
samples
in
the
tf
job.yaml.
So
if
I
just
open
this
up,
you
will
see
you
know
this
is
a
tensorflow
job.
This
is
the
name
of
the
job
and
we
are
going
to
create
two
replicas.
A
That
means
when
we
execute
this,
you
will
see
two
parts
getting
created
for
this
training
job
and
if
you
see
here,
this
is
the
image
we
are
using
and
the
most
important
part
the
training
data
set,
because
we
have
this
training
will
be
running
on
some
data
right
and
this
is
the
same
data
set
which
we
have
downloaded
a
while
back
inside
this
data
set
pvc.
So
if
we
run
this
cube,
ctl
get
pvc,
let
me
just
grab
the.
A
If
you
see
this,
this
is
the
dataset
pvc,
and
this
is
the
same
pvc
which
we
are
mounting
on.
This
training
chart
right.
So
this
training
job
is
going
to
create
the
pod
right
and
that
part
we
will
have
our
efs
storage
attached
and
mounted
and
it
will
be
mounted
inside
this.
You
know
train
directory
and
in
our
training
script
we
have
mentioned.
You
know
that,
go
and
read
this
directory
for
your
training
job.
A
So
let
me
open
this
training,
dot,
py
and
if
you
see
it
here,
we
are
mentioning.
Where
is
our
training
a
data
set
located?
Okay,
so
let
me
go
back
to
our
cli
and
let's
run
this
job,
so
we
are
inside
this
ml
folder
and
the
training
job
is
inside
this
training
samples
directory.
So
we
can
simply
run
this,
keep
ctl
apply
and
the
training
you
know
and
the
location
of
our
definition
file.
A
So
now,
if
we
see
the
training
job
and
it
it
is
now
in
the
bot
creation
phase,
it
is
yet
to
start
the
training,
but
we
see
that
there
are
two
parts
which
it
created.
So
we
can
even
run
cube.
Ctl
get
part.
A
A
The
name
space
and
you
will
see
that
these
two
these
are
the
two
parts
which
are
running,
and
these
are
the
exact
same
two
parts
which
we
have
name
that
we
have
given
image:
classification,
pvc,
so
it's
worker,
0
and
worker
1,
because
the
replica
is
2..
A
So
let's
run
this
and
if
you
see
the
training
job
is
already
completed
because
it
was
a
tiny
data
set
and
we
just
run
it
for
two
epochs
and
if
you
see
that
the
accuracy
is
not
at
all
great
but
that's
okay,
so
the
idea
is
basically
to
show
you
how
you
can
make
use
of.
You
know
kevlar
to
run
your
training
job.
You
know
without
any
hindrance,
so
our
training
job
got
over.
A
And
now,
if
you
see
the
parts
you
see
that
it's
in
not
ready
state,
I
mean
meaning
it's
not
running
now,
so
it's
already
over
and
what
we
can
do
is
we
can.
Even
you
know,
delete
this.
You
know
whole
deployment,
so
by
just
you
know,
we
can
just
copy
this
and
maybe
to
delete
this
job.
You
can
just
say,
delete
and
let's
copy
this
and
run
it
right.
A
So
if
you
see
here
we,
what
kubeflow
allows
us
to
do
is
it
allows
us
to
scale
our
machine
learning
workflow,
you
know
dynamically,
so
we
don't
have
to
worry
about
the
infrastructure
that
is
needed
for
your
ml
training
and
with
efs
you
get
the
flexibility
to
attach
or
use
the
storage
for
your
team
for
different
data
scientists
or
maybe
for
different
users.
A
Saving
your
training
data
set
in
one
central
location
which
can
be
accessed
by
you
know
different
people
right.
So
if
you
see
here
in
the
efs,
this
is
the
place
where
I
have
my
training
data
set.
So
you
can
not
only
access
this
from
your.
You
know,
keep
flow
users,
but
also
you
can,
let's
say
for
troubleshooting.
You
want
to.
You
know,
attach
this
to
an
ec2
instance
and
want
to
explore.
You
know
something.
A
So
maybe
you
want
to
see
you
know
what
the
training
data
set
is
for
some
ad
hoc.
You
know
our
task,
so
you
can
always
click
on
attach
and
you
know,
copy
this
command
and
mount
this
file
system
as
an
nfs
storage
into
your
ec2
instance,
and
provided
that
you
have
all
the
permissions
granted.
A
So
the
idea
here
is
when
you
use
efs
the
same
storage,
you
can
access
from
the
you
know,
containers
from
your
ec2
instances,
lambda
functions,
and
all
of
that
which
gives
you
a
lot
of
flexibility
right
so
and
it
you
don't
have
to
provision
it
beforehand.
So,
if
you
mention,
if
you
see
here
nowhere,
you
know
we,
we
mentioned
the
size
of
the
file
system
right,
so
it
will
scale
up
and
scale
down.
You
know
automatically
also
when
we
created
this
a
volume
we
have.
A
We
we
mentioned
the
size
just
to
make
kubernetes
happy
because
for
q
flow
or
for
kubernetes
in
general,
you
need
to
mention
the
size
of
the
pvcs,
but
that
is
not
from
the
ek
standpoint
of
from
efs
standpoint.
A
You
know
it
is
just
ignored
because
there
is
no
requirement
for
size
right,
so
that's
all
about
it
and
if
you
want
to
go
through
this
whole
demo,
this
is
the
location,
so
you
can
get
into
amazon
efs
developer
zone
inside
that
we
have
a
machine
learning
with
kubeflow
on
eks
with
efs
section,
and
you
know
this
is
you
know
this
tutorial
will
guide
you
through
a
setting
of
the
whole
environment
on
cloud
nine
and
also
the
training
job
and
few
other
things.
A
So
if
you
want
to
try
out
feel
free
to
go
over
here
and
give
it
a
try
and
to
come
to
this
page,
you
can
go
into
this
landing
page
yeah,
amazon,
efs,
developer
zone,
and
if
you
scroll
down,
you
will
get
some
information
about
efs
like
what
it
is,
how
it
works
in
little
bit
of
details,
and
if
you
scroll
down,
you
will
see
a
section
of
different
integration.
So
this
is
amazon
efs
with
containers,
and
here
you
can
see
machine
learning
at
scale
using
your
flow.
A
So
you
can
always
click
here
and
you
will
go
to
that
page,
which
we
have
just
seen
and
you
know
give
you
know:
do
it
in
your
account?
Okay,
so
that's
about
it.
So
let's
go
back
to
our
slide,
all
right.
So
now
that
you
have
learned
a
little
bit
about,
you
know
how
you
can
make
use
of
efs
with
the
ks4
queue
flow.
A
There
are
plenty
of
other
you
know
kubernetes
or
container
specific
tutorials,
which
are
available
on
amazon,
efs
developer
zone,
which
we
have
just
seen
a
while
back
during
the
demo
but
feel
free
to
access.
This
page-
and
you
know,
share
your
experience
and
if
you
would
like
to
contribute,
you
know
you
know,
maybe
you
can
send
a
pr
request
with
your
demo
and
we
will
add
that
in
the
repository.
A
So
thank
you
so
much
for
your
time.
I
hope
you,
you
learned
a
little
bit
about
qflow,
eks
and
efs,
and
I
look
forward
to
you
know,
hear
from
you
about
your
feedback
once
you
you
know
do
this
in
your
account
and
share
your
experience.
Thank
you
so
much
once
again
and
have
a
wonderful
day
ahead.