►
From YouTube: OpenShift Commons ML Briefing: Distributed Training Performance on K8S with Lachlan Evenson (MSFT)
Description
Lachlan Evenson (Microsoft) discusses the results of his team's research on Distributed Training Performance on Kubernetes with the Machine Learning on OpenShift SIG of OpenShift Commons.
Learn more at https://github.com/joyq-github/TensorFlowonK8s
Join OpenShift Commons https://commons.openshift.org#join and join the conversation
B
A
Can
you
can
you
give
me
okay,
okay,
fantastic,
so
if
you
have
any
questions,
feel
free
to
feel
free
to
interrupt,
because
I
can't
see
everybody's
faces
well,
I'm,
giving
the
presentation,
so
don't
don't
be
shy.
What
I'm
gonna
share
is
just
some
performance
testing
that
we
did
with
distributed
training
on
kubernetes
with
tensorflow,
and
this
is
a
piece
of
a
presentation
that
I
gave
at
cube
con,
but
I
think
you'll
find
it
very
interesting.
A
So,
as
David
mentioned,
a
lot
of
folks
are
looking
to
run
ml
workloads
on
kubernetes,
and
you
know
the
sig
is
really
about
that
doing
it
on
open
shift.
Do
we
get
on
hooba
netting?
So
one
of
the
things
that
this
work
came
out
of
was
several
months
ago
about
eight
months
ago
the
Microsoft
Research
team
came
to
the
container
team
over
at
Microsoft
Azure
and
said.
Is
there
any
way
you
can
give
us
some
of
your
expertise
in
running
kerbin
at
ease
and
help
us
understand?
A
Why
we're
having
performance
issues
when
we
try
and
scale
these
distributed
training
these
distributed
models
on
kubernetes,
so
I
went
over
and
start
with
the
MSR
team.
For
several
weeks
and
we
went
through
all
the
scenarios-
and
this
is
just
the
output
of
that
work-
so
hopefully
you'll
find
some
of
it
interesting
and
at
the
end,
there's
some
future
work
that
where
we're
also
doing
it
space,
but
before
we
get
into
it
most
people
say
well.
What
were
you
testing
on?
A
So
you
can
taste
this
and
test
it
on
any
cloud
and
any
any
bare
metal
setup
that
you
have.
What
we
do
here
is
we've
just
documented
the
exact
VMS
specs
that
we
have
for
the
workers
and
for
the
parameter
servers
the
type
of
GPUs.
We
have
the
type
of
the
diversion
of
kubernetes,
all
the
parameters
that
we
have
for
the
actual
test
environment
and
we're
just
using
I'm,
not
sure,
if
everybody's
aware
we're
actually
just
using
the
upstream
tensorflow
benchmark
scripts
to
actually
run
these
test
frameworks
against
different
models.
That's
all
there
for
you.
A
If
you
want
to
go
and
replicate
this
and
test
it
yourself,
the
data
set
that
we're
using
is
imagenet,
which
is
a
real
data,
set,
not
synthetic,
so
there's
the
parameters
of
the
setup
environment.
So
what
we
did
first
was
we
actually
didn't
do
any
distribution.
We
just
wanted
to
do
a
Ingle
node
with
multi-gpu
and
just
have
a
look
at
a
baseline
of
what
how
a
model
could
be
trained
and
how
fast
we
could
actually
process
images
per
second
out
of
that
image.
A
Net
data
set,
so
you
can
see
here
the
1
2
&
4.
We
pretty
much
get
linear
scalability
and
we
made
sure
that
all
the
GPUs
are
indeed
saturated
when
we
were
doing
those
tests,
so
that
gave
us
a
fairly
good
baseline
of
what
we
could
do
with
resin
at
50
VAT,
so
64
running
images
per
second
there.
So
that
was
a
standard
we
get
linear
scaling,
because
what
we
wanted
to
see
is
in
a
distributed
fashion.
Did
we
still
get
linear
scaling
or
whether
did
that
break
down?
A
So
when
we
go
over
to
distributed,
we
started
with
one
parameter
server
and
two
workers.
We
had
a
sync
variable
update,
so
these
are
all
parameters
that
you
can
tune
inside
tensorflow
when
you
set
things
up
using
the
CPU
as
a
local
parameter
device.
We
had
each
worker
pod
had
its
own
dedicated
host,
so
we
didn't
have
any
contention
between
parameter
servers
and
worker
nodes,
so
it
was
pretty
much
three
DMS
1
parameter
server
and
two
workers
each
with
the
pod
for
their
respective
services
that
were
hosting.
A
We
had
the
variable
update
mode
set,
the
parameter
server
and
we
were
using
G
RPC
as
the
network
protocol.
Now,
what
you
see
here
is
a
single
pod
training
with
4
GPUs
and
distributed
training.
So
the
blue
is
the
numbers
that
we
actually
saw
with
a
single
pod
on
a
single
node
essentially
and
then,
when
we
scaled
it.
So
these
are
across
the
different
models:
Google
meta,
Concepcion,
v3,
ResNet,
50,
ResNet,
152
and
bgg
16,
but
you
can
see
google
that
actually
gets
fairly
good
start
bailing.
As
you
scale
out
you
get
fairly.
A
Linear
scaling
is
lossy
before
GPUs,
here's
8
onto
machines.
You
can
see
that
the
factor
gets
a
little
bit
less
depending
on
the
model
and
for
vgg
16
in
particular.
It
actually
gets
worse.
Using
distributed.
Training
with
vgg
model
is
actually
worse
than
using
a
single
machine.
So
this
was
kind
of
interesting
to
us
just
to
kind
of
get
the
data
and
take
a
look
at
how
we
can
dig
in
and
what
was
actually
causing
those
bottlenecks
and
understand.
A
So
what
we
can
came
to
find
was
we
took
a
look
at
Network.
We
took
a
look
at
disk.
We
took
a
look
at
different
network
configurations
that
we
could
do,
but
really
it
came
down
to
the
compute
versus
been
with
ratio
and
really
when
it
goes
to
setting
up
a
parameter
server.
The
bottleneck
of
the
whole
system
is
how
fast
that
parameter,
server
can
process
the
gradient
changes
and
redistribute
them
to
the
nodes,
and
that
is
actually
basically
on
the
network.
Although
different
models
have
different
networks,
so
computation
cost
readily
alex
net.
A
How
much
computation
costs
we
have
for
these
different
models
and
across
training
speed
up
of
two
nodes
versus
a
single
node,
so
Google
getnet
down
in
the
bottom
right.
Here
you
get
one
point,
eight
six,
obviously
as
a
public
cloud
provider
or
anybody
who's
concerned
with
costs.
If
people
are
paying
for
two
nodes
and
they're
using
distributed
tensorflow,
we
want
them
to
get.
You
know
pretty
much
linear
scaling
as
they
pay
for
more
compute
and
with
Google
net.
A
So
this
is
just
some
things
to
keep
in
mind
as
you
look
at
distributed
training,
because
what
was
a
surprise
to
me
was
as
an
operator
of
kubernetes
for
many
years,
just
throw
more
hardware
at
the
problem,
add
more
nodes
and
surely
your
results
will
be
better,
and
that
is
true
to
a
certain
extent,
but
not
the
extent
that
you
would
think
now.
Just
some
observations
that
we
jotted
down
was
it.
It
really
depends
on
the
model
and
network
bandwidth
in
particular,
so
we
had
non
specialized
network
hardware
and
specialized
network
hardware.
A
So
really
what
that
boiled
down
to
was
10
gig
Nick's
on
the
VMS
or
InfiniBand
on
the
VMS
and
how
much
bandwidth
we
can
consume.
Gpus
are
not
fully
saturated
on
the
work
worker
nodes.
So,
even
when
we
scaled
it
out,
we
couldn't
actually
reach
full
CPR
GPU
saturation
and
it's
likely
due
to
the
gradients,
the
batches
and
then
being
rich
redistributed,
that
the
GPUs
could
never
get
full
saturation.
We
could
do
it
with
those
models
on
single
single
node
systems,
suboptimal
performance,
GPU,
starved
most
of
the
time
there
wasn't
anything
to
them.
A
To
do
we,
we
actually
tried
doing
a
number
of
different
PS
server
configurations
so
running
on
the
same
note
as
the
pods
actually
seemed
to
create
worse
performance
and
that
we
tried
to
do
multiple
PS
servers
and
understand
whether,
if
we
had
multiple
parameter
servers,
whether
we
would
actually
get
better
distribution
to
those
workloads.
We
also
tried
synchronous
and
asynchronous
variable
updates,
but
what
we
actually
published
in
those
previous
slides
was
best
case
that
we
actually
found
so
really
they've
got
us
asking
what
we
can
do
better.
A
So
we
went
out
and
we
re
running
a
lot
of
these
tests
today
results
yet
to
be
published.
But
before
we
went
in
to
cube
con
about
a
week
before
goober
released
hor
Avadh,
which
is
an
open
source,
distributed
learning
framework
for
tensorflow.
So
if
you've
already
got
your
models,
written
intensive
flow,
there's
no
extra
work
that
you
had
to
have
to
do:
hor,
Avadh,
simply
shims.
In
now
here's
this
is
taken
from
the
horrible
dock
specifically,
but
really
what
you
can
do
is
enabling
different
batch
sizes
here
really
get
linear
scaling
distributed.
A
Tensorflow
gets
this
horrified
will
get
that
with
TCP
and
obviously
with
our
DMA.
You
can
get
a
little
bit
more
and
they
also
publish
the
ideal
linear
scalability
here.
So
really,
what
you
can
see
is
with
large
about
sizes
in
the
distributed
training
here
on
kubernetes,
you
can
actually
get
closer
to
linear
scaling
with
horev
on
now.
It's
a
standard
load
Python
package,
if
you're
not
familiar
with
it
and
it's
seamlessly
installed.
Now
it
uses
nickel
and
nickel
is
intra.
Node
GPU.
A
So
this
is
just
a
way
that
we
can
start
to
look
at
using
tensor
flow
and
there
are
other
tools
out
there
that
we
might
be
able
to
achieve
linear
scaling,
because
when
we
go
to
these
distributed
models
on
distributed
compute
and
in
the
cloud
we
really
want
to
get
as
close
to
linear
scaling
as
possible.
That's
what
we're
going
for
here!
So
there's
a
another
paper
out
there,
that's
also
of
interest.
So
it's
deep
grading
compression,
so
there's
been
analysis
on
those
gradients.
A
So
that's
as
the
models
being
changed,
the
Delta
in
the
parameters
that
get
distributed
back
up
to
the
parameter
server
they're
actually
have
to
be
fairly
redundant.
99.9
percent
of
the
gradient
is
strange,
is
actually
redundant,
but
the
purpose
of
this
gradient
progression
is
really
to
get
your
parameters,
your
gradients,
actually
down
to
reasonable
sizes
before
shooting
them
down
the
wire
right.
A
So
you're,
going
from
a
hundred,
Meg
or
500,
meg
and
or
less
then
I'll
make
for
most
of
this
so
another
way
that
we
can
actually
not
take
care
of
scaling
and
as
we
go
obviously
we
only
had
two
nodes,
but
people
made
one
hundreds
or
thousands
of
worker
nodes,
but
this
really
allows
you
to
scale
in
our
testing
on
one
gig
Ethernet
VM.
So
you
don't
need
specialist
hardware
and
you
don't
need
large
NICs
in
your
VM.
So
another
interesting
fact
there
now
just
just
some
other
other
tools
that
we
actually
have
out
there.
A
So
there
is
a
deep
learning
open
source,
deep
learning
workspace
that
we
publish
now.
This
is
akin
to
a
very
point-and-click
experience
and
where
we're
also
working
with
with
cue
blow
as
David
mentioned,
but
the
deal
workspace
predates
Qubo
a
little
bit.
So
if
you're
interested
you
can
shoot
over
here,
it's
open
source.
A
You
can
take
a
look,
but
that's
very
UI,
driven
supposed
to
give
data
scientist
access
to
a
bunch
of
different
frameworks,
tensorflow
PI,
torch,
spark
and
lightweight
way
to
get
entry,
there's
a
couple
of
other
things
that
were
working
on
so
free
flow,
C&I
plugin.
So
this
may
be
of
interest
in
this
forum,
but
you're
really
leveraging
shared
memory
and
RDMA
to
improve
network
performance.
A
Let
me
just
pop
over
so
these
are
both
open
source
tool
kits
that
you
can
go
and
take
a
look.
So
if
you
want
to
run
spark
in
this
deal,
workspace,
there's
a
document
on
how
to
do
that
and
set
this
up,
and
this
will
run
on
any
kubernetes
cluster
anywhere
on
any
platform
I'm
just
using
kubernetes
abstractions.
A
Finally,
just
this
week,
we
repackaged
and
open
sourced
a
custom
CRI
to
give
us
more
granularity
in
GPU
constraints
and
a
custom
scheduler
for
kubernetes
to
be
able
to
bubble
these
GPU
constraints,
not
only
how
many
GPUs
you
have,
but
different
factors
inside
those
GPUs
so
feel
free
to
go
and
pick
that
up,
take
a
look
at
it
and
and
contribute
at
this
point.
I
will
open
it
up
to
any
questions.
B
A
B
Listening
to
this
and
wondering
well
some
of
the
stuff
you
just
present
or
reminds
me
a
little
bit
of
how
you
know
the
old
MapReduce
formalism
got
its
traction
where
it's
gonna
be.
You
know
opinionated.
It's
like
we're,
gonna
focus
purely
on
stuff
that
actually
scales
well
and
so
I'm
kind
of
I
guess
like
in
terms
of
Kubb
flow
direction
and
other
stuff.
Like
that,
I
wonder.
If
we
should
this
indicates,
we
should
be
fairly
opinionated
as
to
what
kinds
of
implementations
we
support.
Sorry
I
also
have
a
little
flew
here,
but.
B
C
You
know
I
think,
that's
a
really
excellent
point:
Eric
I
I
won't
like
I'm,
not
so
hardcore
in
impe
nation.
This
is
my
opinion,
so
not
as
a
coupon
person,
just
as
an
ml
person
I'm,
not
so
so
hardcore
in
in
implementation.
That
I
will
literally
block
people
from
running
stuff.
That
is,
is
having
a
is
gonna
have
a
bad
time,
but
I
think
I
think
it
behooves
us
to
really
double
and
triple
down
and
make
sure
that
people
are
aware
that
you
know
to
the
extent
possible.
C
Hey
do
this
or
you
may
not
scale
linearly
right,
and
it's
it's
exactly
the
kind
of
stuff
that
that
you,
that
research
that
that
Lachlan
has
provided
here
and
things
like
that.
That
will
help
people
be
aware
of
it.
But
second,
you
know
it
will
help
us.
You
know
indicate
like
hey.
If
you
want
to
get
linear
performance
here,
you
really
need
to
be,
including
you
know,
a
higher
level
MPI
framework
or
something
along
those
lines.
C
A
Yeah,
absolutely
and
I
think
you
know
even
upstream
intensive
low
specific
now
that
these
performance
benchmarks
are
there
and
people
are
starting
to
process
the
numbers.
The
tensorflow
community
as
a
whole
is
doing
a
lot.
As
you
mentioned,
it's
very
akin
to
MapReduce
once
you
start
to
see
the
numbers
there.
A
People
are
actually
taking
action
to
to
get
this
closer
to
linear
scaling,
because
it's
a
surprise
to
most
people
that
it
is
not
currently
so
we're
continuing
that
research
and
expect
to
see
a
little
bit
more
and
and
I
can
add
the
slides
to
the
talk
and,
if
you're
interested
in
in
chatting
with
either
myself
or
the
Microsoft
research
team.
That
I
worked
on
this
with
happy
happy
to
discuss.