►
From YouTube: 21 - Scaling Neural Networks Training - Thorsten Kurth
Description
Deep Learning for Science School 2019 - Lawrence Berkeley National Lab
Agenda and talk slides are available at: https://dl4sci-school.lbl.gov/agenda
A
Hi
good
morning
welcome
to
the
last
day
of
the
deep
learning
for
science,
school
I
hope
you're
still
have
the
energy
to
go
through
another
half
a
day,
so
today,
I
mean
earlier.
In
the
week
we
talked
about
training
with
gradient
descent,
and
we
talked
how
you
know
gradient
descent.
We
do
it
stochastically,
so
we
use
small
batches
and
how
you
know
to
go
beyond
that.
The
small
batches
has
some
noise
and
to
go
beyond
that.
We
actually
run
into
optimization
issues,
because
the
noise
is
important.
A
lot
of
us.
A
A
lot
of
you
actually
asked
about
larger
batch
training,
because
you
want
to
do
training
faster
right,
because
we
deal
with
very
large
data
sets.
So
today,
most
of
the
program
will
be
on
this.
Thurston
will
talk
about.
How
do
we
do
this
training
on
multiple
nodes
or
multiple
workers
and
even
going
all
the
way
to
the
HPC
scale
and
you'll
also
talk
about
large
batch
training
in
general?
A
How
the
problems
that
you,
you
know
what
kind
of
problems
we
run
into,
how
to
get
around
those
and
then
we'll
have
our
hands
on
later
today,
actually
doing
that
in
practice
on
quarry
machine
Thurston
is
an
application
readiness
engineer
at
at
nurse.
His
main
work
is
always
on
optimizing
code
to
run
faster
on
bigger
machines.
A
B
You,
okay,
let's
try
that
so
thanks
for
introduction,
Mustafa
and
I
hope
you
still
have
energy
for
the
rest
of
the
day.
So
I
will
talk
about
deep
learning,
training
scaling
and
the
I
will
also
do
some
to
bother
you
a
bit
with
that
with
some
HPC
back
background
about
like
communication,
so
that
complexity
of
come
negation,
algorithms,
and
also
that
of
why
why
we
care
about
matrix,
multiplications,
deep
learning
you
might
know,
but
especially
when
you
dis
men,
it
is
to
be
with
that
thing.
You
need
to
think
about
it
more
carefully.
B
So
I
will
talk
about
this
a
bit
so
that
that's
my
first
gift
is
this
motivation?
It's
more
like
an
introduction.
While
you
do
we're
gonna,
do
this
review
to
do
deep
learning,
I!
Think
most
of
you
are
anyway
motivated
of
doing
that,
but
in
general
I
think
it's
it's
good
to
point
out
certain
things
which
might
help
so
I
will
talk
about
like
the
communication.
B
Basically
and
I-
will
talk
like
briefly
about
this
as
well,
so
the
polarization
strategies
will
be
mentioned,
especially
with
the
in
light
of
deep
learning
and
then
I
will
talk
about
what
most
of
our
mentioned
the
large
wedge,
training,
what
you
can
do
to
get
convergence
better
convergence,
and
we
also
talked
about
things
like
accuracy
improvements,
especially
when
you
do
this
ability
blurring
and
have
a
small
local
batch
size.
You
might
think
about
okay.
B
B
Okay,
so
why
do
we
need
to
scale
deep
learning
right?
So
this
is
a
survey
Moustafa
conducted
in
2018
2018
and
we
basically
looked
at
how
long
the
models,
typical
models
need
to
train,
and
it
looks
like
with
a
lot
of
like
people
like
60
percentage,
which
which
train
for
like
a
couple
of
hours
right,
but
we
also
have
like
scientists.
We
want
to
train
for
like
days
or
even
weeks,
alright.
B
You
don't
want
to
train
your
model
for
two
weeks,
just
to
learn
that
it
doesn't
work
very
well
and
also
the
problem
scale
can
be
quite
big
and
then
you
look
at
the
data
sets,
especially
there
can
be
quite
large,
so
we
have
like
about,
like
20%
of
the
folks,
have
like
a
terabyte
or
bigger.
Ok.
So
that's
quite
big,
but
even
here,
like
40
percent
of
25
percent
have
like
100
100
gigabytes
data
set
sizes
so
yeah.
This
was
a
this.
B
So
if
you
think
you
just
asked
questions
about
like
it
was
a
survey
which
targeted
machine
learning,
especially
so
we
just
asked.
So
what
do
you
run?
What
do
you?
What
kind
of
models
do
you
run
and
what
kind
of
like
public
is
a
data
set?
How
big
is
your?
How
long
do
you
train
yeah?
This
was
the
data
set
size
like
maybe
I
just
asked
them.
So
what
is
the
typical
training
data
set
size?
You
want
to
train
on
fine,
so
I
mean,
of
course,
this
this.
B
This
can,
and
this
can
include
anything
like
the
HPC
experiments
can
be
petabytes.
They
might
not
want
to
train
on
everything,
but
this
can
be
everything
right,
so
this
like
tube
in
which
captures
all
rest
and
the
data
sets
itself.
So
this
this
this
might
look
like.
We
have
a
lot
of
samples,
but
the
data
sets
itself
can
be
very,
very
complex,
so
they
can
have
like
very
high
dimensional
data.
B
So
that
means
that
technically
is
a
single
sample
can
be
of
order,
50
megabytes
already
easily
or
bigger,
so
that
is
kind
of
so
that
doesn't
mean
you
have
many
samples,
but
you're
chefs
ever
like
the
number
of
bytes
you
need
to
like
load
and
process
is
quite
large
so
and
also,
as
you
know,
models
gets
get
bigger
and
more
compute-intensive.
So
this
is
like
an
outdated
model.
B
Analysis
me
eg,
but
when
you
look
at
bird
like
this
transformer
models,
for
example,
they
are
like-
billions
of
parameters
right,
so
in
the
end
they
want
to,
and
you
want
to
tackle
that
much
more
complex
tasks
with
them.
So
you
you
want
to
somehow
you
want.
You
don't
want
to
necessary,
restrict
yourself
to
a
single
node
or
single
GPU
what-have-you.
You
want
to
basically
think
about
like
splitting
these
models
up
in
the
future,
and
this
is
a
plot
by
open.
B
Ai
like
this
is
a
study
like
how
many
petaflop
per
day,
how
many
soy
peda
flops
per
second
days.
You
need
to
invest
to
train
a
model
so
and
when
you
look,
for
example,
at
the
deep
mind
once
this
reinforcement
learning
here,
I
forgot
I
zeroed.
They
need
a
lot
a
lot,
a
lot
of
flops
right.
So
you
need
to
paralyze
that
you
cannot
do
that
on
a
workstation
anymore.
B
B
Ok,
so
I
say
talk
very
briefly
about
matrix
multiplications.
Why
it's
important
and
it's
very
easy.
So
when
you
look
at
all
the
deep
learning
primitives
you
have
it's
basically
exactly
that.
So,
for
example,
when
you
have
a
fully
connected
layers,
the
most
obvious
right.
So
this
is
just
like
you,
you,
you
basic
connect.
All
the
input
features
with
all
the
output
features
right,
so
this
is
just
a
matrix
multiplication
right.
B
You
know
that
when
you
coded
up
in
terms
of
flow,
for
example,
it's
exactly
you
write
down
death
right
so,
but
also
convolutions
can
be
casted
into
matched
modifications.
You
might
not
know
that
because
it's
not
that
obvious.
If
you
do
not
work
with
the
underlying
kernels
really,
then
you
might
never
see
it
or
might
not
be
aware
of
it.
So
what
you
could
do
this
is
one
method.
I,
don't
say
it's
the
most
efficient
one.
This
is
called
convolution,
lowering
or
image
to
column
or
triplets
matrix
approach.
B
So
this
was
for
a
long
time,
the
the
most
common
one.
There
are
no
more
modern
algorithms,
but
you
know
they
basically
rely
under
the
hood
for
matrix
multiplications,
even
if
they
are
like
small
sighs
and
these
big
ones
here.
So
this
is
not
very
memory
at
first
and
then
late.
Lastly,
when
you
look
at
Ellis
TMS,
it's
the
same
thing.
So
when
you
are,
when
you
basically
look,
we
have
an
input,
sequence
X
at
the
time
T
you
want
to
produce
an
output
sequence
H
of
T,
and
you
have
all
these
gates
here
right.
B
So
these
are
like
activation
functions.
These
are
elementwise
multiplication,
but
what
you
essentially
have
is
you
have
all
these
made
modifications
with
the
input
vector
right,
so
this
X
is
a
feature
vector
and
you
basically
act
with
some
weights
on
it
and
compute
outputs,
and
you
do
this
like
a
couple
of
times
for
like
a
typical
lsdm
there.
So
that
means
like
it's
all
about
maybe
some
applications
and
so
in
the
end
you
want
to
make
those
fast
in
a
distributed
setting.
B
So
one
one
that
set,
though,
usually
the
feature
vectors
here
are
very
small,
so
you
might
not
make
sense
to
distribute
that
so
just
this,
this
is
so
it
depends
on
the
size.
So
the
way
you
distribute
the
training
depends
on
the
size
of
the
objects
you're
dealing
with
and
I
will
come
to
that
next
at
first.
B
You
don't
have
to
care
about
this
yourself,
but
it's
nevertheless
good
to
know
if
you
might
be
at
some
point
more
interested
in
looking
at
the
layers
underneath
what
actually
is
going
on
on
the
system.
So
this
is
a
very
HPC,
a
part
of
the
talk
now
and
if
you're
not
really
interested
in
doing
some.
Looking
like
an
analyzing
communication
behavior
of
application,
you
might
not
need
to
pay
attention
that
much,
but
it's
kind
of
interesting
to
know
about,
like
certain
things:
okay,
communication
complexity.
B
So,
let's
talk
about
this
briefly
so
usually
when
you
have,
when
you
have
a
training
setting,
you
have
a
number
of
workers
or
ranks
or
processes,
we
call
P
and
they
need
to
communicate
data
right
and
communicating
data
costs.
You
like
bandwidth
right
for
every
package
to
send.
You
basically
take
a
big
chunk
of
the
bandwidth
of
the
network
right,
so
you
have
a
latency
right.
The
message
needs
some
time
to
arrive
to
its
destination
and
you
have
some
overhead.
B
So,
for
example,
when
you,
when
you
want
to
communicate
something,
you
might
need
to
phone
from
a
GPU,
for
example,
you
might
have
download
the
data
then
and
send
it
off
to
the
interconnect
or
the
the
interconnect
can
grab
it
directly
from
the
GPU,
but
nevertheless
you
have
some
overhead
to
basically
pack
it
to
the
to
the
interconnect
and
then
shift
it
off
from
there
and-
and
you
have
a
message
size
s
right
so,
and
you
can't
care
about
three
different
things.
You
can
care
about.
B
Runtime
I
think
this
is
what
most
people
here
might
care
about,
because
in
a
practitioner
setting
runtimes
that
you
just
want
to
want
to
get
to
make
your
communication
go
fast.
We
can
also
think
about
memory
right.
So
some
of
these
communication
primitives
need
a
lot
of
like
additional
memory
to
what
une
our
half.
B
This
is
usually
not
a
big
deal
for
deep
learning,
for
example,
what
you
can
do
if
you
have
GPUs
you
can
do
the
training
on
the
GPUs
and
all
the
communication
on
the
CPUs
concurrently
and
the
CPU
is
much
more
memory
and
you
read
your
objects.
You
want
to
communicate,
have
a
much
smaller
than
the
whole
model
class
Waits,
those
activations
all
these
things
right.
B
So
this
is
usually
not
an
issue,
but
of
course,
if
you
are
in
a
kind
of
setting
where
you
are
like
way
too,
like
a
distributed
edge,
computing
or
something,
this
might
actually
be
an
issue,
and
then
the
last
is
energy.
So
how
much
energy
does
my
algorithm
consume?
So
there's
difference
within
static
and
dynamic
static
is
basically
like
the
baseline
of
this
algorithm,
but
you
have
some
algorithms,
which
do
some
have
some
communication
networks
which
are
more
intense
than
others.
B
That
means
you
have
like
spikes
in
your
energy
consumption
and
that
can
be
important
when
you
join
a
company
like
I,
don't
know
Facebook
and
you
want
to
have
operated
data
center
on
top
out
under
the
power
envelope.
You
might
want
to
look
at
that
stuff.
Okay,
so
just
to
explain
how
that
works.
You
have
like
you,
have
a
process,
p
0,
who
wants
to
send
a
message
to
process
P
2,
it
packs
up
the
message:
does
this
overhead?
B
Oh
and
then
it
should
shoot
off
like
one
package,
another
package,
another
package
right
so
assume
we
do
like
multi-threaded
communication.
You
can
split
this
message
up
and
basically
the
whole
thing
is
s
times.
So
every
every
small
like
fractious,
basically
a
unit
of
bandwidth,
gene
and
the
latency-
is
now
the
time,
for
example,
when
my
last
message
left
interconnect
to
the
when
it
arrived
so
like
once.
B
The
message
needs
to
interconnect
to
when
it
arrives
at
the
destination
in
the
nick
at
the
destination,
and
then
you
need
some
investor
over
here
to
basically
take
the
message:
unpack
it
and
use
it.
So
this
means
that
communications,
complexities
or
like
the
the
time
it
takes
to
send
a
single
message,
is
like
the
latency
plus
two
times
the
overhead,
because
you
need
to
pack
and
unpack
it
plus
the
size
times,
the
bandwidth.
Okay.
So
that's
like
the
communication
model.
This
is
for
sending
a
single
message
to
a
single
single
worker.
B
So
and
then
you
can,
you
can
think
about
so
what
kind
of
like
communication
primitives
do
you
need
and
there
are
route
at
once
and
good,
that's
one!
So
what's
the
difference
route
at
once
is
something
like
a
broadcast
or
gather
where
in
one
process
send
stuff
to
every
other
process
or
one
process
gathers
data
from
all
the
other
processes.
That's
routed
because
you
have
one
node,
which
all
one
worker
which
sticks
out.
B
So
that
is,
for
example,
done
when
you
do
a
parameter
server
in
deep
running
right,
then
you
have
these
kind
of
communication,
behaviors,
routed
communication.
So
and
there's
like
things
you
can
do
so
that's
the
very,
very
simple
one.
I
think
this
is
what
most
deep
learning
frameworks
implement
in
the
beginning
is
like
it's
like
a
flat
tree.
It's
very
simple.
You
basically
send
like
messages
or
receive
message
from
every
worker
individually,
and
you
can
think
about
how
that
scales
scales
with
the
number
of
workers
right.
B
So
that
is
like
the
most
simple
one
you
can
do,
but
it's
still
important
because
this
this
thing
is
like
part
of
a
lot
of
other
communication
items,
then
you
can
also
do
treats
right
so
like
the
next.
Better
thing
to
do
is
a
tree
where,
like
you,
have
one
root,
node
and
then
you
can,
you
can
think
about
if
you
wanna,
for
example,
assume
you
send
messages
to
the
left
branch
first,
you
can
just
compute
the
whole
communication
complexity
by
just
looking
at
how
long
does
it
take
to
node
number.
B
B
It
can
basically
directly
send
a
message
to
note
six
okay,
and
what
you
see
here
is
that
you
feel
very
high
latency
its
gates
with
the
depth
of
the
tree,
which
is
very
bad,
but
still
the
in
terms
of
message
sizes
just
gates
with
the
depth
of
the
tree
and
not
with
the
number
of
workers.
So
it's
definitely
better
when
you
have
bigger
messages,
yes
assume
that
yeah.
This
is
basically
basically
what
you
the
idea
here.
Is
you
shoot
the
first
message
to
so
you
shoot
the
message
to
one
two,
three,
four,
five.
B
Six,
so
as
you
have
to
wait,
you
have
to
wait
till
till
note.
Zero
sends
like
a
message,
not
one
two
free
and
you
don't
need
to
wait
to
receive
it
right.
So
this
is
the
overhead
of
packing
up,
like
P
minus
one
messages,
plus
like
the
the
bandwidth
like
the
the
size
of
each
message
right-
and
this
is
the
overhead
of
impacting
it
on
your
side,
less
s,
not
six.
Of
course,
many
of
congestion
or
some
of
these
paths
can
be
longer
than
others
in
the
net
in
a
real
setting.
B
So
you
basically
take
the
longest
the
longest
so
like
the
maximum
latency.
So,
for
example,
when
you
go
to
supercomputing
system
like
quarry,
it
has
a
it
has
a
aires
interconnect
which
is
like
a
dragonfly
topology,
which
is
diameter
five.
This
means
that
you
can
send
a
message
from
one
node
to
any
other
node
in
five
hops
max,
but
they're
like
connections,
we
already
can
have
one
half
able
to
the
same
Chelsey's.
B
You
might
have
one
hot
or
two
hops
so,
which
means
that,
of
course,
when
you
have
a
tree
spanning
the
whole
machine,
the
the
latency
will
basically
be
like
the
relevant
latency
is
the
latency
to
take
it
between,
like
this
five
hop
connection,
it's
technically
the
Layton's?
No,
this
is
technically.
The
thing
is
that
you
send
a
message
down
this
tree:
okay,
this
means
that
you
don't
this
tree,
doesn't
need
to
wait
for
that
tree.
B
To
finish
like
when
this
message
is
sent
and
your
and
your
your
hoop,
your
hoop
node
receives
its
message
can
start
casting
so
this
tree.
This
branch
goes
in
parallel
to
that
branch
and
at
every
every
sub.
Every
sub
tree
can
go
parallel
to
next
to
any
other
sub
tree.
So
this
is
a
binary
tree,
but
you
can
do
it
with
K
nodes.
It's
ok,
Airy
tree,
and
one
thing
this
is
non
personal,
which
means
that
basic
it's
like
a
reduction
so
mm
or
like
a
like
a
broadcast.
B
So
every
note
gets
the
same
message
right.
You
can
also
think
about
the
personal
one
where
every
note
gets
gets
a
personal
message,
and
that's
that
case
you
basically
need
to
send
more
packages
in
the
first
branch,
because
for
every
note
you
need
to
have
a
different
different
message.
Okay,
so
that
then
the
communication
complexities
a
bit
bigger,
but
it's
still
better
than
the
direct
sent,
possibly
okay.
B
This
is
you
said
you
do
a
collective
communication.
All
the
notes
communicate
collectively
they
don't
run
different
stuff,
they
don't
they
run.
This
assumes
that
you
need
to
pack.
You
need
to
send
the
same
data
so
to
same
data,
to
all
the
notes.
For
example,
when
you
do
model
broadcast
at
the
beginning
of
your
training,
you
do
model
broadcast.
You
need
to
copy
the
weights
to
all
the
notes.
So
when
you
do
that,
then
actually
you
would
use
something
like
that.
B
So
basically
sent
your
model
along
this
tree,
so
that
every
note
gets
the
same
model
and
you
need
to
wait
for
this
thing.
Like
till
all,
the
notes
are
done,
and
that
is
usually
when
you
anyways,
you
assume
that
you
send,
through
the
left
branches
first
or
the
right
as
mirrors
like
assume
you
send
through
the
left
branches
first.
This
is
the
time
it
takes
for
the
data
to
ripen
arrive
at
note,
6,
assuming
that
all
these
communication
legs
are
equally
fast.
B
So
this
just
this
is
just
technique
an
upper
bound
and
then
on
in
a
real
setting.
You
need
to
run
simulations
to
get
like
this.
This
thing
right,
but
this
is
just
like
to
see
what
kind
of
algorithms
are
better
when
you
run
it
forth
in
different
settings
all
right.
The
good
thing
here
is,
you
don't
need,
like
you,
don't
have
this
many
to
one
that
the
thing
is
here
if
this,
if
you
have
a
lot
of
lot
of
lot
of
nodes,
doing
that
you
basically
fresh
the
buffers
of
the
interconnect.
B
So
what
will
happen
is
that
you
have
mediate
network
congestion.
This
left
that's
right,
because
every
every
node
is
sending
to
one
node,
like
all
the
links
of
this
node
will
be
totally
saturated
with
data
and
the
message
queues
in
the
interconnect
which
receive
the
messages
and
put
them
in
a
queue
if
the.
If
the
queues
run
full,
it
will
not
receive
further
messages
and
tell
the
communication
like
on
the
other
side,
wait
to
send
more
and
then
you
will
get
basically
like
back
pressure
on
the
in
the
network.
B
So
if
you
have
a
lot
of
lot
of
nodes,
this
thing
will
actually
break
down.
So
this
is
why
these
trees
are
better
come
to
that.
So
this
is
the
personal
one.
So
just
personal
ones
that
every
node
gets
a
different
message
right.
So
this
is
why
you
need
to
send
three
here:
one
is
consumed
by
this
root
node
and
then
it
sends
one
down
the
tree.
B
So
there's
also
one
thing
as
the
the
which
is
a
rooted,
one
is
a
pipeline
that
is
actually
important
because
I
will
sign
in
so
and
the
next
slide.
Why?
Because
there
was
like
recently,
I
would
say.
Like
two
years
ago
there
was
some
breakthrough
and
distributed
deep
learning
where
they
implemented
an
algorithm
for
this
muted
training,
which
is
actually
based
on
a
pipeline,
but
in
the
HPC
world
this
is
quite
old
stuff
right.
So
here
the
idea
is
you're.
Basically
a
non
personal
perso
doesn't
matter
you
broadcast.
B
You
want
to
broadcast
these
messages
to
all
the
nodes
and
assume
you
want.
Every
node
should
get
like
should
get
like
the
same
thing.
What
you
do
is
you
inject
a
message
to
node
one,
and
this
node
passes
it
on
down
the
tree
right
for
the
for
the
personal
case.
You
basically
sent
a
message
which
is
destined
for
the
last
node
first
and
then
the
next
step.
B
You
see
you
send
it
for
the
second
to
last
and
so
on,
and
then
the
notes
just
pass
it
down
the
pipeline
and
in
the
end,
you
can
compute
that
this
is
basically
just
two
times
the
length
of
the
pipeline
and
when
you
close
it
so
basically
a
seven
feeds
back
to
zero.
You
have
a
ring,
and
that
is
actually
there
was
this
paper.
Bye-Bye
do
I
think
where
they
implemented
a
ring
reduction
algorithm,
which
was
like
in
the
deploying
community,
was
like.
B
Oh
wow,
that's
awesome,
but
actually
that's
a
pretty
pretty
old
concept
and
it's
not
more
very
efficient
because
it
scales
with
the
number
of
the
of
the
nodes
you
have
in
the
pipe
with
twice
the
number.
So,
but
still,
if
you
have
a
few
nodes,
you
can
actually
nicely
hide
like
a
lot
of
the
communication.
So
in
general
this
is
not
a
bad
algorithm,
but
you
should
not
use
it
at
that
scale.
B
So
this
is
something,
for
example,
assume
you
have
a
have
a
system
where,
where
your
GPUs
are
connected
in
a
linear
fashion
or
like
an
inner
ring,
you
can
you
can
easily
use
that,
for
example,
to
do
reduction
within
the
in
the
box.
But
then,
once
you
go
out
to
like
like
multiple
boxes,
this
might
be
inefficient
right,
so
I
said
so.
This
is
not
it's
not
a
bad
I
prefer
but
like
for
for
like
scale
it's
it's
not
very
good,
so
there's
also
the
route.
B
So
this
is
a
rootless
example,
so
everybody
gets
everything
now
so
so
this
is,
what's
the
same
so
like
when
you
have
to
swing
technically,
everybody
gets
every
message.
If
you
do
it
right-
and
this
is
the
direct
sense
assume
you
have
like
an
all-to-all
connection.
For
example,
you
have
you
have
a
DJ
Xbox,
which
is
basically
a
box
with
two
CPUs
and
eight
NVIDIA
GPUs
or
in
the
more
modern
version
of
DJ
x2.
You
have
basically
sixteen
GPUs
in
a
box
connected
in
all-to-all
fashion.
B
You
can
basically
do
that
because
you
have
a
certain
amount
of
bandwidth
between
all
the
all
the
GPUs
like
be
bi-directional,
like
25
gigabytes,
a
second
or
something
like
that,
so
you
can
basically
shoot
of
messages
to
everybody
at
the
same
time.
So,
like
everybody,
sends
to
everybody
and
everybody
receives
from
everybody.
So
that
is,
if
you
have
enough
bandwidth,
that
is
awesome
right
and
it
just
scares
with
the
number
of
processes.
So,
first
for
a
box,
that's
totally
fine.
B
A
more
clever
algorithm
is
butterfly,
so
this
is
actually
implemented
for
a
lot
of
all
communications.
If
you
have
a
power
of
two
in
the
notes,
if
not,
then
then
it's
very
becomes
very
complicated
and
I.
Don't
want
to
talk
about
this,
but
the
idea
here
is
you
start
at
the
beginning,
and
he
just
sent
basically
to
your
to
your
so
like
note
0,
since
the
one
node
1
to
0,
and
you
have
basically
discussed
communication,
and
then
this
is
the
first
epoch
in
the
next
epoch.
Note
0
tends
to
like.
B
No,
so,
basically,
you
send
out
to
the
two
nearest,
their
nearest
neighbors
and
then,
in
that
case
they
like
the
next
group.
The
cool
thing
about
this
is
that
it's
guys
with
the
binary,
lock
and
P,
and
you
can
basically
like
just
like
doing
all
all
communication
or
like
in
this
case
it's
more
like
this
is
a
non
personal.
So
this
is
like
everybody
gets
the
same
same
message,
so
this
forum
all
reduce
in
like
lock
to
P
time.
So
this
is
quite
efficient.
B
But
of
course,
as
you
see,
you
need
a
lot
of
like
interconnectivity
here.
Otherwise
you
you
you,
you
will
basically
like
run
congest
the
network,
so
this
is
very
important
algorithm
to
think
about
and
I
think
most
Fourier
transforms
are
like,
like
big,
like
big,
all
reduces
and
stuff
like
that.
I
implement
it
with
this.
So
there's
also
personal
version
of
that
where
your
basic
message
size
grows
when
you
go
along
the
the
tree.
Ok,
so
one
thing
two
men
remember:
the
optimal
collective
communication
depends
on
use
case.
B
So
if
you
are,
for
example,
aiming
at
best
run
time,
you
should
relook
at
the
run
time
complexity.
There
are
all
these
algorithms
have
memory.
Efficiency
is
so
like
memory
complexities,
so
this
is
basically
you
need
to
take.
Take
this
into
account
when
you
are
like
in
the
sorry.
So
they
have
a
memory
complexity
and
an
energy
complexity
and
I
have
like
put
them
on
the
slide,
because
I
don't
care
that
much
about
it.
B
But
you
should
really
think
about
these
things
when
you
like
design
a
cluster
which
should
operate
under
power
constraints
or
memory
constraints.
So,
and
the
most
important
thing
is
look
in
the
HPC
literature,
because
a
lot
of
stuff
is
getting
reinvented
in
the
deep
learning
world
and
actually
it's
well
known
stuff
in
HPC
for
decades
now
so,
there's
a
lot
of
like
like
fancy
algorithms,
even
like
more
fancy
verse
than
one
I
presented
where
you
combine
trees
with
pipelines
or
like
butterfly
pipelines.
Things
like
that
you
can.
B
You
can
do
a
lot
of
like
crazy
things
and
there's
like
if
you
have
a
certain
communication
pattern
for
somebody
to
implement
a
library
or
something
that
you
use
you
get
stuck
on
something
you
choose
should
look
in
the
HPC
literature.
There's
a
lot
of
stuff
I
in
the
end
of
the
of
this
target
will
get
a
list
for
I
suggested
reading
about
these
kind
of
things,
and
so,
as
I
said,
the
elytra
sometimes
to
try
to
try
to
reinvent
the
wheel,
yeah,
don't
don't
just
fall
into
these
traps?
B
It's
just
most
of
the
stuff
is
old
news,
and
one
thing
I
want
to
say
it
sounds
with
dated,
but
like
MPI
is
actually
very
optimized.
So
when,
when
your
library
makes
use
of
MPI,
for
example,
you
can
be
sure
that
it
basically
implements
the
most
efficient
algorithms
you
have,
because
normally,
if
you
have
a
cluster
and
the
MPI
would
ships
with
that
cluster
is
usually,
it
makes
already
the
right
choices
for
you,
for
example,
on
and
on
and
on
HPC
system
we
have
a
tweet
MPI.
B
It
makes
use
of
like
a
lot
of
like
features
like
hardware
features
like
maybe,
for
example,
hardware
Atomics,
where
you
can
accumulations
force
of
small
messages
very
efficiently
in
the
network.
Hardware
or
like
it
respects
topology.
So
can
it
can
switch
between
like
a
butterfly
the
tree
algorithm
when
you
have,
when
you
cross
different,
like
long
distance
links
in
the
in
the
network,
topology
so
like
with
lower
connectivity,
and
things
like
that.
So
so.
B
These
these
libraries
are
usually
very
well
optimized
for
number
of
processes,
the
message
size
as
anthropology,
so
it
makes
the
right
choices
for
you
and
I
recommend
using
libraries
like
that
not
necessary
that
MPI
itselves
the
best,
but
there's
a
lot
of
things
like
NVIDIA
nickel,
for
example,
when
you
use
GPUs,
just
use
that
and
make
sure
that
in
video
guys
do
a
good
job
or
like
API
for
Intel
or
what
it's
like
for
like
any
like
commodity
cluster.
Okay,
so
now
more
specific
to
deep
learning.
B
So
we
have
like
a
certain
predation
strategies.
There
is
like
data
parallelism
where,
basically,
every
every
process
is
running,
its
own
model
running
sorry
running
the
same
model,
and
then
you
reduce
the
gradients.
I
will
talk
about
that
later.
There's
model
parallelism,
but
you
have
a
single
model
which
is
split
across
the
ranks,
and
then
there
is
layer
pipelining
where
your
partition
by
layer
so
for
example,
process
one
does
a
chunk
of
the
model
process.
B
B
So
the
layers
are
distributed
across
the
ranks
or
the
worker,
and
that
is
basically
what
was
implemented
in
Google
tensorflow
by
default.
If
you,
if
you
use
like
the
whiff
device
so
with,
for
example,
the
which
device
scope
and
then
you
put
it
on
different
GPUs
different
parts
of
the
model,
it
will
essentially
use
this
kind
of
trellis
so
idea.
So
this
is
the
extreme
version
of
it.
You
basically
have
let's
say
a
wait
and
an
input
vector
you
put
it
on
the
on
rank
0.
B
Then
you,
you
compute
this
thing:
compute
the
output
pass
it
on
to
note
no
to
sorry
not
one
while
you're
multiplied
with
another
wait
and
pass
it
onto
note,
note
2.
So
the
thing
is
that
this
is
the
most
extreme
version.
Of
course
you
won't
do
this.
You
basically
have
a
couple
of
layers
on
rank
0
a
couple
of
layers
on
rank.
This
is
like
the
just
for
illustration
and
the
good
thing
about
it.
B
B
You
need
to
compute
the
gradient
of
respect
or
activation
functions,
because
you
want
to
have
to
gradient
with
respect
to
the
to
the
output,
basically
and
the
gradient
respect
to
the
weights
right
and
the
thing
here
is
that
this
you
need
to
update
the
weights
and
this
you
need
for
your
back
propagation.
Ok.
B
So
when
you
go
back,
you
have
basically
from
your
from
the
previous
node
from
node
1,
so
from
no.2.
You
get
this
this
activation
gradient
and
then
you
multiply
it
with
the
transpose
weight
and
that
you
sent
back
to
the
2
node
0,
where
it's
like
dotted
with
with
the
transpose
weight
of
that
node.
Ok,
so
basically
you
just
do
the
whole
pipeline
backwards.
That's
all
what
you
do
so
that
is
for
computing,
the
the
the
gradients
of
the
activations
for
the
gradients
of
the
weights
in
order
to
update
the
weights.
B
You
basically
just
do
the
same.
You
just
take
this
vector,
but
test
time
you
don't
multiply
it
with
the
weights,
but
instead
with
the
other
activation,
which
is
not
local.
So
it's
the
same
communication
pattern.
You
just
multiply
it
with
a
different
different
vector,
that's
it
and
then
you
get
the
gradient
made.
You
can
incorporate
it.
So
there's
still
no
comment.
No
collective
communication
necessary
everything
here.
You
just
need
to
pass
this
these
these
gradients
of
activations
around
along
the
pipeline.
So
that's
it
there's
one
thing,
though,
so
first
there's
a
very
simple
implementation
right.
B
You
can
just
just
just
do
it.
On
the
other
hand,
while
you
pass
a
batch
down
this
pipeline,
you
do
not
want
to
wait
for
it
to
come
all
the
way
back
and
for
the
backdrop
and
then
integrate
the
gradient.
So
this
would
be
like
fully
synchronous
training.
No,
no.
You
want
to
really
like
have
a
pipeline
where
you,
for
example,
you
feedback
zero
here
pass
on
the
results
to
the
other
node.
And
while
you
do
this,
you
feed
in
batch
one
and
then
you
posited
another
node,
and
so
this
expects
0.
B
Now
this
is
fetch
one
and
then
you
here
and
when
you
look
at
it
and
you
go
basically
all
the
way
down
and
go
all
the
way
back.
You
have
the
gradient
of
patch
zero,
but
you
already
fed
patch
five
to
the
system,
which
means
that
once
you
incorporate
that
guy
into
into
this
model,
all
these
gradients
here
in
the
pipeline
will
be
outdated
already
right.
B
So
that
means
you
have
some
kind
of
like
a
synchronous.
Training
I
did
the
deeper
your
pipeline
is
the
more
problematic
it
becomes,
because
these
things
become
more
and
more
outdated,
deeper.
You
make
it
yes,
this
is
an
issue
if
it's,
even
if
you
do
go
to
the
extreme
right,
I
mean
if
you
have
like
a
pipeline
of
one
or
two,
that's
usually
fine,
but
if
you
make
this
like
a
thousand
nodes
long,
it
will
not
learn
anything
because
then
you
have
like
a
thousand
step
outdated
gradient,
you
incorporate
it.
Just
doesn't
make
any
sense.
B
That
means
you
need
to
chunk
up
your
model
in
a
way
the
layers
in
a
way
that
the
computation
time
is
almost
constant
between
them,
and
that
is
quite
tricky.
So
this
is
the
load.
Balancing
here
is
very
hard,
so
this
is
why
I
do
not
recommend
that
if
you
have
like
two
GPUs,
fine
or
free-
but
if
you
have
like,
if
you
go
back
to
the
extreme,
it
won't
work
model.
B
Okay,
so
you
can,
for
example,
split
in
the
feature
dimension
for
like
a
fully
connected
layer,
and
how
does
this
look
like
so
we
have
this
matrix,
W
and
then
process
zero
earns
the
upper
half
process,
1
the
lower
half
and
the
feature
vectors
learnt
by
everybody
right
input,
vector
X.
So
this
is
basically
the
number
of
features.
B
This
is
the
batch
size
and
you
dot
it
in,
and
then
you
get
an
intermediate
intermediate
result
and
then,
in
order
to
produce
a
feature
vector
which
is
shared
across
all
the
nodes,
you
need
to
like
gather
it
right,
because
node
zero
needs
the
results
of
node
1
and
node
1
its
result,
if
not
0,
and
if
you
have
like
a
bigger
vector,
you
basically
need
to
gather
the
results
from
all
the
nodes.
Okay,
in
that
speech
of
before
this
is
basically
a
rootless,
personal
communication.
B
Okay,
so
this
means
is,
all
gather
is
necessary
for
the
forward
pass,
so
the
forward
pass
is
not
local.
So
for
every
step
in
the
forward
pass,
you
need
communication,
ok,
which
is
bad
because
technically
you
need
to
wait
till
all
the
nodes
are
done
with
this
to
cast
or
gather
to
basically
grab
the
results.
B
Backward
pass
is
similar
so
for
computing,
the
gradients
of
the
weights
that
can
be
done,
total
locally.
So
you
just
have
to
you.
Have
this
the
transposed
inputs?
You
have
to
output
right
because
you
get
at
it
and
then
you
just
do
local
metal
and
get
a
gradient
of
the
weights,
and
you
just
take
your
chunk
to
the
chunk.
You
need
fine
you're
done
so
that's
totally
local,
but
the
gradient
for
the
activations.
That
is
order
for
the
input.
B
That's
actually
more
tricky,
because
here,
when
you
transpose
the
weights
and
multiply
it
with
the
input
vector,
you
see
that
you
get
an
intermediate
state
where
you
have
just
a
very
low
rank
right
when
you
do
matrix
multiplication
of
a
very
small
fraction
of
the
data,
and
you
need
to
all
reduce
that
on
a
big
group.
So
that
means
in
order
to
do
the
backdrop,
because
this
guy
is
needed
for
the
for
the
previous
layer
to
do
the
backpropagation.
B
So
that
means
why
your
backdrop
for
your
network,
you
basically
need
to
communicate
so
in
the
forward
pass.
You
need
unity
and
all
gather
and
the
backward
pass
you
need
and
all
reduce.
Only
the
gradient
updates
can
be
done
locally,
so
you
cannot
overlap
really
nicely
these
things.
Okay,
so
that
is
quite
quite
bad.
B
The
one
thing
is:
if
you
have
a
very
large
model,
you
can
split
it
up
right.
You
don't
want
to
split
up
split
up
too
crazy,
because
the
issue
is,
if
you
have
like,
if
you,
if
you
run
out
of
parallelism
on
the
node
right.
So
if
you
have
like
a
matrix
which
is
thousand
24
by
thousand
24
say,
and
then
you
you
do
this,
you
chunk
up
this,
this
dimension
thousand
24
by
say,
Hannah,
twenty-eight.
B
That
means
that
you
don't
have
a
lot
of
parallelism
on
the
node,
so
you
cannot
make
use
of
your
multi-core
CPU
or
your
GPU
very
efficiently,
because
the
matrix
modifications
are
small
and
you
get
a
lot
of
overhead.
So
you
cannot
do
this
crazy,
like
you're
limited
by
model
size,
how
we
can
scale
it
out.
The
good
thing
is
you
don't
have
this
like
grow
of
batch
size
because
the
batch
size
is
still
local
batch
size
right?
B
So
you
don't
need
to
tweak
your
hyper
parameters
if
you,
if
it
works
on
a
single
node,
and
you
just
read
them
all
across
this
like
this,
you
can
still
just
use
the
same
parameters
and
it
will
just
work
forward
a
backward
pass
which
were
expensive
like
what
say
like
ruthless
communication.
So
this
is
quite
quite
bad,
collective
communication
and
then
also
like,
especially
for
the
backdrop
you
cannot,
when
you
are
back
dropping
at
LK
lay
okay,
you
cannot
go
to
LK
minus
one
without
waiting
for
this
collective
to
finish,
so
it's
hard
to
overlap.
B
Communication
with
computation
to
setting
the
thing
is
also
that
the
batches,
because
you
need
to
fulfill
vectors
on
all
the
nodes
which
means
that
the
batches,
so
the
input
is
also
shared
across
the
nodes.
So,
like
everybody
gets
the
same
input
vector
and
then
you
can
do
you.
Can
everybody
reach
the
same
data
from
the
file
system,
which
is
kind
of
like
bad
for
the
file
system
or
we'll
just
one
note
reads
the
data
and
distributes
it,
which
again
is
another
communication
step
you
might
want
to
avoid?
B
Yes
and
you
need
to
bake
models
to
do
that.
Actually,
you
also
need
to
store
the
full
activations
per
rank.
This
can
be
quite
expensive
because
the
activations
is
usually
much,
for
example,
bigger
than
the
size
of
the
wait.
So
you
save
just
the
number
of
weights,
you
store
per
node,
but
you
want
to
keep
the
activations
around
and
these
are
much
much
bigger,
usually
so
because
if
you
have
a
sparse
network
like
convolutions,
the
weights
are
like
kilobytes
and
this
can
be
easily
a
megabyte
or
something.
So.
B
B
You
basically
have
the
stencil
moving
so
like
the
filter
moving
over
the
image
and
you
compute
the
weighted
sum
assumed,
and
then
you
think
it
from
an
HPC
perspective
or
it's
like
a
stencil
operation,
which
is
basically
like
a
like
a
for
example,
differential
equation,
kernel
operating
on
chunks
of
a
data
set.
So
what
you
could
do
you
can
have.
If
you
have
a
big
image,
you
can
chunk
up
the
image
and
then
so
basically,
this
input
image
or
just
chunk
it
up
into
domains
and
then
compute
the
output
per
domain.
B
The
issue
there
is
that
so
the
good
thing
is,
you
can
save
the
whole
input
vector
because
you
chunk
it
up,
but
you
do
nearest
neighbor
communications,
because
the
issue
is
when
you,
when
you
have
is
usually
the
when
you
when
you
see
this
year.
The
the
filter
has
an
extent
and
when
you
hit
the
boundary
right,
you
technically
need
data
points
from
your
neighbor.
So
that
means
you
need
to.
You
need
to
do
some
nearest
neighbor
exchange.
This
is
quite
common
in
HPC,
where
you
basically
have
like
partial
differential
equations.
B
B
So
that's
quite
that
can
be
quite
costly,
but
you
can
do
this
if
you
want
so
I
think
this
only
makes
sense
if
you
have
a
huge
input
image
like
a
Giga,
pixel
panorama
or
something
so
like
otherwise,
this
this
won't
help
you
a
lot
just
to
just
as
a
node
and
the
other
thing
you
can
just
flip
up
the
filters.
So
the
number
of
output
filters.
Sorry,
the
number
of
input
filters.
B
You
can
basically
split
up
and
also
technically
the
number
of
output
filters,
so
that,
like
in
general,
this
this,
this
G,
which
is
like
the
kernel.
So
this
is
the
output
filter,
that
I
mentioned
the
input,
filter
dimension
and
the
height
width
of
the
kernel.
You
just
play
split
up
this
thing
and
different
nodes,
compute
DIF
different
chunks
of
it,
but
also
here
you
need
an
already
use
in
the
end,
because
every
every
node
wants
the
whole
output
right
and
an
allagadda.
B
So
this
is
not
not
very
efficient
either
and
you
don't
save
much
because
these
guys
cost
nothing
okay.
So
just
just
in
case
this
just
comes
up.
It's
some
people
thought
about
it,
but
I
think
it's
not
really
feasible
data
for
us.
So
that's
the
most
important
one,
because
this
is
what
basically
all
the
frameworks
do.
This
model
part
is,
in
part,
is
very
hard
to
implement
framework.
Wise
I
would
say
so.
This
is
the
most
the
way
it's
done
today
and
that
also
causes
these
issues
of
large
batch
training.
B
I
will
talk
about
after
that.
So
how
does
it
look
like
so
assume?
What
you
do
is
you
just
distribute
the
batches
among
workers
right
so
like
when
you
have
an
input
vector
X.
So
this
is
the
global
input
vector
like
processor
0
holds
all
the
features,
but
only
like
a
chunk
of
the
whole
batch
okay
and
then
you
multiply
it
with
W
and
what
you
see
it's
a
local
matrix
multiplication.
So
that
means
that
you
don't
need
any
communication
for
the
forward
pass,
none
which
is
quite
nice.
B
The
thing
is
that
all
the
weight
matrices
have
to
be
replicated
across
the
workers.
Okay,
that's
fine
I
mean.
Usually
these
are
not
very
big.
So
there's
no
communication
for
the
backward
pass
a
bit
more
tricky,
but
it
has
nice
features
too.
So,
first
look
at
this.
So
when
you
do
the
back
drop
in
order
to
compute
the
gradient
of
the
previous
layer
you
need
to
so
you
need
to
basically
do
a
local
matrix
multiplication
this
time
with
the
transposed
weights.
B
But
since
the
way
it's
a
local,
that's
fine,
but
on
the
on
the
derivative
of
the
activation,
which
is
still
like
process
local
here,
so
you
can
do
a
local
matrix
modification
which
has
the
following
impact
that
when
you
back
drop
when
you
compute
the
result
so
like
the
menu
example,
you
back
drop
for
your
of
your
network
and
you
are
layer
K.
You
do
not
need
to
wait
for
any
communication.
You
can
just
do
a
local
backdrop
of
this.
This
gradient
here
and
then
like
go
on
to
layer,
k,
k,
minus
1
right.
B
You
do
not
need
to
communicate
anything.
While
this
is
happening,
you
can
compute
the
weight
updates
which
require
communication.
So,
as
you
see
here,
you
have
basically
the
gradient
activation
and
the
input
features,
and
then
you
dot
them
together
like
that,
and
then
you
need
to
all
reduce
so
that
everybody
has
all
the
weights,
like
the
whole,
the
whole
weight.
So
that
means
that
the
only
communication
required
here
is
for
the
weight
update,
which
is
basically
the
reduction
of
the
gradients
across
all
the
notes.
That's
actually
what
it
is.
B
So
this
is
a
pretty
nice
scheme.
Actually,
so
the
forward
pass
is
completely
local,
the
backward
pass.
You
can
proceed
without
locally
without
doing
any
communication,
except
for
when
you
want
to
update
the
weights
all
right,
only
those
guys
need
to
be
communicated.
So
you
have
a
lot
of
possibilities
to
overlap.
Communication
computation
here
and
the
activations
are
split
across
strength,
so
it
can
reduce
the
memory
footprint
as
well
right.
B
You
don't
need
to
store
all
the
activations
for
all
the
batches
split
up
your
batch,
the
weights
get
duplicated,
so
you
need
to
form
all
on
every
rank.
Okay,
it's
depending
on
the
size
of
your
model
might
be
bad,
but
it's
not
usually
super
bad.
The
batch
size
grows
right,
so
you
can
of
course
say.
Ok,
you
have
like
20
56
fat,
a
batch
of
256,
and
then
you
go
to
256
with
notes.
We
have
a
Bachelors
of
1.
B
You
can
do
that,
but
then
you
run
out
of
parallelism
on
your
node
and
what
you'll
see
is
your
communication
overhead
gross
like
dramatically,
and
your
your
local
parallelism
is
very
low,
so
this
means
that
this
is
not
efficient.
So
what
you
usually
do
is
you
have
a
local
batch
size,
a
meaningful
local
bachelor's
of
8
16,
whatever
you,
whatever
it's
reasonable
for,
like
performance
wise
when
you
scale
it
out
and
then
you
have
like
as
a
global
batch.
That's
just
the
number
of
workers
times
the
batch
size
yeah.
B
B
So
when
you
do
the
backdrop,
you
can
hide
like
the
backdrop
of
the
model
with
with
with
the
computation
or
reduction
of
the
gradients,
but
you
still
have
to
wait
for
the
last
gradients
to
be
reduced
and
you
can
basically
do
it
in
the
synchronous
setup.
We
really
wait
till
the
stills
is
done,
so
the
scaling
can
be
problematic.
B
This
was
I
think
this
was
coming
for
a
bit
at
the
beginning,
where
you
sent
your
weights
to
a
parameter
server.
The
way
it's
keep
the
the
parameter.
Server
keeps
track
of
the
model,
so
it
has
basically
the
latest
weights
the
worker
center
gradients
it
incorporates
into
the
model
and
sends
back
tomorrow.
So
nobody
waits
for
nobody
here.
You
just
want
when
you're
ready,
you'll
shift
ship
it
off
and
it's
very
resilient.
If
you
dies,
Evernote
dies,
it
dies.
B
No,
nobody
waited
for
this
guy,
it's
just.
It
will
just
work,
however,
when
you
think
about
it,
since,
since
this
is
very
a
synchronous,
so
like
the
gradients
they
receive
are
from
different
versions
of
the
model,
all
the
time
right
so
assume,
like
one
workers
always
fast
and
all
the
others,
he
will
basically
contribute
a
lot
of
fresh
gradients,
but
all
the
others
might
contribute
very
old
ones.
B
You
can
mitigate
that
by
spreading
it
out,
like
for
every
layer,
I
have
a
different
parameter
server,
but
still
it's
it's
a
bottleneck
and
you
might
waste
like
computation
resource
because
you
want
to
might
want
to
use
this
node
for
training
example
and
then
more
recently,
there's
the
stale
synchronous
update,
also
called
pipelining,
and
it
works
like
that.
So
assume
you
have
like
you
have
like
say.
Two
independent
system,
for
example,
have
an
accelerator
and
a
hoss
process,
or
you
have
a
very
powerful
interconnect.
B
You
can
do
the
following:
oh
you
have
like
a
lot
of
like
additional,
like
free,
free
compute,
on
your
on
your
on
your
CPU.
So
if
you
have
a
multi-core
CPU,
you
can
basically
only
many
cost
CPUs
or
you
can
do
I,
don't
know
like
on
64
frets.
You
can
do
the
computation
and
we
have
four
remaining
frets.
Of
course
you
can
do
other
stuff
with.
B
So
what
you
can
do
here
is
this,
and
the
idea
is
that,
while
you
wild
workers,
compute
the
fresh
gradients
for
the
model
independently
right,
you
can
do
this
locally
and
push
them
into
local
queue.
Why
do
you
do
that?
You
pop
from
the
queue
gradients
from
a
previous
step,
reduce
them
and
incorporate
them
back
into
the
model?
Okay,
the
good
thing
is
that
you
can
basically
overlap
the
forward
with
the
backward
pass.
That
way.
B
The
downside
is
that,
since
you
all
reduce
gradients
which
are
outdated
by
a
couple
of
steps
can
be
one
step
can
be.
Whatever
steps
can
be
also
be
made
dynamic,
you
would
technically
don't
have
like
optimal.
You
don't
have
to
optimal
like
gradient
and
freshest
gradient
right.
On
the
other
hand,
it's
not
as
random
as
a
synchronous
approach
where,
where
you
contribute
grains
of
different
ages
all
the
time,
so
these
are
outdated
by
one
step
by
two
steps
by
three
steps,
but
I
always
find
my
bias
by
fixed
amount.
B
Okay,
so
that
is
actually
a
much
better,
but
also
it's
not
very
resilient
right.
If
one
node
drops
out
here-
and
at
least
this
step
will
fail
right,
so
it
will
basically
like
stall
on
that.
So
like
resiliency
wise,
it
doesn't
help,
but
it
can
smooth
run
time,
durability
right.
So
when
you
have
like
very
lot
of
fluctuations
in
your
network,
you
can
think
about
making
this
dynamics.
Okay,
I
cannot
communicate
right
now.
My
network
is
like
totally
like
jammed
with
like
a
communication.
B
Let's
store
a
couple
of
more
gradients
before
we
continue,
no
I
mean
I,
don't
know
so
like
every
grade
and
every
gradient
once
it
makes
it
for
Q
gets,
gets
incorporated
right.
So
if
you
yeah
no,
no,
they
they
they.
No!
No.
No!
You
just
push
them
into
the
queue
every
say
like
like
like
like
one,
but
you
basically
have
just
like
two
gradients
right,
so
you
have
like
one
old
buffer.
B
You
use
the
grades
from
that
and
then
you
have
a
new
gradient
buffer
and
once
the
old
ones
are
incorporated,
you
just
copy
them
over
to
the
new
buffers
you
basically
purpose.
We
can
have
a
queue
where
you
just
line
them
up
and
once
there
once
they
turn
they
get
it
no
matter
what,
but
it's
not
receiving
a
sense
that
if
one
of
your
worker
dies
or
very
slow,
you
have
to
wait
for
that.
B
Guy
all
the
time
I
mean
that's
that's,
but
as
I
said,
if
you
have
an
HPC
system,
you
can,
if
it's
a
let's
say,
mature
HPC
system
in
the
sense
that
it
was
operated
for.
Why
you
understand
it
and
it's
you:
can
you
can
count
on
that?
Almost
all
the
nodes
are
equally
fast.
You
don't
have
these
like
the
big
problems
on
a
commodity
cluster.
Of
course
the
performance
variation
can
be
much
bigger
and
that
can
be
a
problem
there.
B
So,
as
I
said,
we
I
just
considered
the
the
synchronous.
Already
was
the
easiest
case.
The
other
cases
are
basically
like
the
pipeline.
One
we're
just
outta
two
gradients
there
they're
similar
from
the
idea,
but
it
just
won't
discuss
this
one,
because
it's
more
it's
the
one
like
most
people
will
use.
So
you
have
a
low
connections
of
P
and
the
global
batch.
That's
just
the
number
of
workers
times
P,
okay,
so
when
you
think
about
stochastic,
releasing
I
think
you've
heard
about
that
in
the
last
week.
So
we
have
liked
wait.
B
You
compute
the
derivative
of
the
last
respect
to
that
weights,
and
then
you
use
everage
over
the
batch
and
incorporate
it
back.
Okay
and
what
it
will
do.
Basically,
stochastic
Elyse
is
nothing
for
conjugate
gradient
here,
which
is
like
deterministic,
but
you
basically
go
into
the
steepest
decent
direction.
Right.
A
B
The
idea
is
now,
while
you
do
this,
because
you
have
larger
batch,
you
have
a
larger
average
or
basically,
if
you
have
more
batches,
you
have
a
more
precise,
gradient
right,
so
ideas,
because
this
average
on
this
slide
here
goes
over.
Like
more
samples,
you
can
think
about.
Okay,
I
know
much
better.
What
my
actual
gradient
is,
so
you
can
think
of
instead
of
doing
like,
for
example,
free
steps
with
like
like
step
1
step,
2,
step,
3
I
might
think.
B
Okay,
since
I
know
my
Direction
better
I
might
just
do
one
big
step
with
three
times
the
size,
all
right,
so
I
basically
increase
the
learning
rate
linear,
so
hopefully,
I
will
end
up
in
a
very
similar
position
to
where
I
would
end
up
with,
like
the
three
smaller
steps
and
that
works
when
you
look
at
it.
So
if
you
do
like
two
consecutive
steps
of
like
batch
size,
be
this,
you
basically
summarize
this
of
doing
like
a
step
of
size
like
like
to
be
at
once.
B
So
if
you
were
scared,
basically
to
do
an
equivalent
step,
and
that
of
course
assumes
that
the
the
gradient
with
respect
to
get
as
an
application
here,
that
the
gradients
are
basically
similar
right,
so
that
they
are
not
like
heavily
varying
with
with
each
leg
patch.
Of
course,
if
your
batch
size
becomes
a
huge
this,
this
summation
or
this
assumption
breaks
down
all
right.
So
that
is
that
is
the
problem.
B
So
yeah,
the
idea
here
is
that
okay,
so
the
covariance
based
case
of
N
squared
over
P,
which
means
when
you,
when
you
scale
the
and
the
best
ice
skates,
would
factor,
and
the
start
see
that
if
you
try
to
scale
your
learning
rate
from
square
root
of
n,
you
might
get
the
same
thing
as
before.
So
that's
the
other
idea
right
when
it,
when
you
think
of
not
after
the
grade
in
itself,
but
of
the
noise
or
like
the
correlate
at
the
covariance,
the
correlation
between
the
gradients,
the
autocorrelation
little.
B
So
you
can
try
everything
and
they're
like
different,
like
as
a
different
approaches,
and
it
really
depends
on
them
all
your
training.
You
don't
have
to
try
that
out
and
I
think
this
is
this
paper
from
open
the
eye
and
they
basically
look
at
the
noise
and
they
try
to
determine
like
an
optimal
learning
rate
depending
on
the
batch
size,
and
here
you
see
it
scales
very
nicely
and
then
flattens
off.
So
that
means
that
here
you
hit
a
point
where,
like
the
noise
in
your
gradient,
is
it's
technically
too
low
right
for
the
specials?
B
You
can?
You
cannot
help,
but
it's
basically
won't.
You
won't
work
well,
but
for
like
up
to
batch
size
here,
I
would
say,
like
a
hundred.
You
can
still
get
a
good
speed
up
just
like
tweaking
the
learning
rate
accordingly,
but
technically
have
to
reproduce
this
plot
for
every
model
Arana,
which
is
costly,
on
the
other
hand,
right
because
you
don't
want
to
do
this
right.
You
don't
want
to
make
this
plot,
because
when
you
can
make
this
plot,
that
means
you
already
trained
your
model.
B
Alright,
so
that's
a
kind
of
like
hand
and
egg
problem,
but
maybe
there's
a
way
of
deriving
more
general
rules
for
that.
So
this
way
I
say:
try
lineal,
scaling
a
square,
would
stick
ailing
it's
about.
So
there's
also
thinking
that
in
the
initial
stages,
the
gradients
are
very
random.
So,
even
if
you
average
over
large
batch,
you
might
think
of.
Like
oh
yeah,
let's
I
know
my
gradient
very
well,
but
that's
not
true,
because
since
it's
very
random,
if
you
average,
there
was
something
like
a
lot
of
like
very
random
qualities.
B
B
Basically,
the
larger
the
batch
is
the
more
sharp
your
minima
are,
so
that
means
and
there
that.
So
that
means
that
technically,
when
you,
when
you
have
so
the
the
black
curve,
for
example,
is
the
landscape
of
the
loss
of
the
of
the
training
loss
function?
Okay,
so
the
the
the
you
see,
if
you
have
a
minimum
here
and
you
have
a
very
sharp
minimum
there.
B
Okay,
so
assume
you
train
your
model
and
you
end
this
flat
minimum
so
that
you
have
to
trim
small
batches,
your
minimize
a
bit
like
scuba,
okay,
and
then
you
have
the
generalization
loss
so
like
to
say
the
loss
on
the
on
the
test
set,
which
is
the
red
curve,
and
it's
a
bit
shifted
right.
It's
not
exactly
the
same,
because
the
different
set
they're
just
part
of
the
data
set
and
then,
for
example,
the
optimum
minimum
will
be
there.
But
you
are
there.
So
what
you?
B
You
will
end
up
being
here,
which
is
not
that
bad
right
when
you
test,
when
you,
though,
are
in
a
sharp
minimum,
so
you
trained
your
model
at
a
large
sketch
size.
You
are
in
this
very,
very,
very
sharp
minimum
here
and
your
actual
minimum
you
want
to
be
is
like
somewhere
like
that
shifted
to
there
and
you
evaluated.
There
you'll
end
up
evaluating
it
here
and
you
have
a
very
bad
loss
so
which
means
the
generalization
is
totally
school.
So
this
is
basically
like.
B
This
is
just
the
conceptual
sketch,
but
this
is
actually
what
happens?
Yes,
if
you,
if
you
look
at
the
I,
think
that's
yeah.
That's
part
reason.
So
when
you
look
at
the
Hessian
at
the
second
derivative,
you
will
see
that
it's
very
flat
when
you
have
small
batches
and
more
batches.
If
you
have
average
of
a
more
gradient,
if
you
use
bigger
batches
so
average
over
more
grades
in
step,
you'll
see
that's
very,
very
sharp
around
minimum
I
think
the
intuitions.
B
No,
no
it's
so
it
stays
very
complicated.
It's
technically,
of
course,
respect
to
the
whole
dataset
right.
I
mean
if
your
batch
size
is
the
whole
dataset.
You
do
conjugate
gradient
technically,
you
don't
do
any
stochastic
gradient,
he
said,
and
then
it's
also
somehow
like
the
does.
Also
size
of
the
data
set
which
is
actually
useful
to
you
depends
on
the
complexity
of
the
underlying.
What
like
features
in
the
data
you
want
to
learn,
so
you
can
have
a
huge
data
set,
but
technically
it's
all
they
do.
B
They
don't
add
more
information
to
what
you
want
to
learn.
So
then
it
doesn't
help
you
either,
but
determining
that
is
much
more
tricky
because
you
don't
know
how
what
the
complexity
of
your
data
set
is
and
then
it
yeah.
So
that's
like
the
trader,
which
you
read,
yeah
Lucifer
I
think
the
best
is
just
to
try
yeah.
This
is
really
like
this
is
this
is
the
problem
with
that.
So
it
would
be
nice
to
have
a
more
like,
like
a
more
guided
way
of
doing
that.
B
B
He
looked
at
the
the
Hessians
for
around
like
selected,
the
dominant
eigen
value
eigen
vectors
of
the
hessian
that
those
two
dominate
one
so
delicate
audit
and
for
64
up
to
like
2048,
and
here
you
see
that
the
minima
becomes
like
sharpen
java,
okay
and
then
you
can
think
about.
When
you
go
to
30,000.
This
will
be
like
very
great
shot,
and
then
you
don't
generalize
that
anymore,
because
this
one
do
generalize
nicely,
maybe
even
that
one,
but
this
might
be
too
bad,
and
it's
also
depends
on
what
accuracy
aiming
for
right.
B
If
you
want
to
beat
imagenet
like
you,
have
to
very
beat
the
curacy
of
other
folks.
But
if
you
do
something
where
there's
where,
where
you're
like
a
scientific,
scientific
application,
layer
basic
you
go
and
say:
okay,
we
still
beat
by
far
the
the
approach
which
is
there,
like
maybe
handcrafted
decision
tree
or
whatever,
and
you
you're
still
much
better
than
that.
But
you
can
train
a
model
very
very
quickly
and
then
you
might
be
happy
with
that.
But,
of
course,
if
you
hand
precision,
that's
that's
like
tricky.
B
So
if
you
are
a
company
who
want
to
make
makes
money
and
makes
money
from,
for
example,
natural
language
processing,
the
recognition
rate
has
to
be
extremely
high
because
people
get
annoyed.
If
you,
if
you
only
have
like
98%
accuracy,
this
is
really
bad
right.
I
mean
they
won't
have
99
point
something
right.
It
sounds.
It
sounds
like
a
small
difference,
but
for
them
it's
it
really
makes
difference
of,
like
consumers
are
happy
or
not
so
yeah,
but
it
depends
so
yeah.
B
These
are
like
basically
just
the
top
20
eigenvalues
and
s
you
see
when
you
go
to
larger
batches.
Basically,
you
converge.
You
converge
to
much
higher
spectrum.
That's
its
basically
illustrates
the
point.
I
want
to
make
so
they're
things
you
can
try
to
do
and
to
fix
as
a
bit.
So
at
the
beginning.
They
do
it
like
a
linear
warm-up,
for
example,
on
to
the
target
learning
rate
and
then
decay
the
learning
rate
right.
So
that's
this
Facebook
paper-like
training
imagenet
in
an
hour
so
like
what
is
what
is
the
current
record?
B
Seventy
seven
seconds
or
something
mr.
fun,
original
training,
77
seconds
yeah,
something
like
that,
but
okay.
So
this
was
so.
This
was
like
in
the
past
so
but
they
they
show
it
very
nicely
that
when
you,
for
example,
do
a
warm-up
and
then
you
do
a
learning
rate,
the
case
schedule
you
can
get.
Basically
very
nice
excuse.
So
that's
like
something
something
which
works,
but
they
also
showed
that
this
is
the
validation
error,
one
it
will
increase
rapidly
with
when
you
go
really
beyond.
B
So
there's
another
idea
of
like
instead
of
the
king,
the
learning
rate
just
increase
the
batch
size
right.
So
so
they
started
like
batch
size
8,000
for
example,
and
then,
while
they
train
instead
of
decaying
the
learning
rate,
since
they
have
like
a
Red
Sea
proc
like
relationship,
you
can
basically
say:
okay
I
just
increased
my
batch
size.
So
then
the
idea
is,
if
I'm
closer
to
the
right
minima
I
might
be
increasing.
The
bachelors
I
make
it
sharper
to
have
a
faster
converter.
B
The
thing
is
with
this
is
very
hard
to
implement,
in
the
sense
that
most
most
frameworks
don't
support
that
very
well,
especially
in
a
distributed
setting
when
you
when
you,
when
you
want
to
change
the
bachelors,
you
might
need
to
dump
your
model
and
reload
it
from
checkpoint,
mostly
which
is,
but
this
is
a
shortcoming
of
the
frameworks.
This
is
not
like
a
principal
issue
and
there's
also
this
adaptive
batch
size
scaling
developed
at
Berkeley
which
to
us
this
more
like
more
like
dynamically.
B
So
the
ideas,
if
you
are
currently
so
use
a
second
order
information
for
from
from,
like,
basically
the
curvature
of
the
loss
of
the
lost
surface
at
the
point
where
you
are
in
order
to
increase
or
decrease
the
batch
size.
Okay,
so
I
don't
want
to
talk
about
this
very
much,
but
they
showed
that.
Actually,
when
you
do
this
adaptive
patch
size
with
this
is
like
I
think
they
do
some.
It
protects
them
also
from
some
adversarial
examples.
B
B
So
this
is
up
to
16
K
batch
size
and
the
same
was
used
for
this
Sony
paper,
where
they
trained
imagenet
in
224
seconds.
Actually
so,
like
I
said,
this
is
already
outdated
right.
This
is
from
last
year,
I
think
so
now
it's
like
77
seconds
to
record
so
like
they
try
to
like
beat
each
other
on
the
like
training,
time,
side
and
accuracy,
sides
like
maintaining
accuracy,
but
then
cranking
down
the
training
time
by
a
lot.
B
B
So
this
is
a
paper
by
opening
I,
think
yeah,
and
they
show
what
most
of
us
said:
a
relationship
between
the
gradient
noise
and
the
critical
batch
size,
and
it
shows
tells
you
basically
what
what
the
critical
batch,
how
is
correlated
to
the
gradient
noise
scale,
and
it
looks
like
it's
pretty
linear
along
that
line
right
and
if
you
see
that,
for
example,
the
the
real
Fossum
only
ones
like
the
space
invaders
or
daughter,
they
are
pretty
far
up
here.
So
you
can
use
huge
batches
for
this
for
training
these.
B
B
A
B
B
Yeah
so
I'm
almost
like
through
my
time
here,
because
I
wanna
have
this
other
torus
well.
So
this
is
the
training
time
and
hours
versus
the
compute
cost.
So
you,
if
you
want
to
cut
down
on
that,
you
have
to
invest
more
compute,
and
this
is
technically
the
front
wheel
where
you
maintain
your
accuracy.
B
You
have
to
pay
a
little
bit
of
more
of
compute,
but
you
get
a
huge
reduction
training
time
by
an
order
of
magnitude.
But
once
you
hit
this
point,
you
basically
are
so
here.
This
is
a
point
of
diminishing
returns.
You
have
to
increase
your
compute
budget
by
a
lot
just
to
get
like
a
small
reduction
in
training
time.
B
So
that's
like
you
have
to
look
at
these
like
curse
to,
but
first
you
have
to
map
out
these
curves
right,
but
maybe
you
have
a
model
which
is
similar
to
a
model
where
these
curves
already
exist,
and
you
can
maybe
think
about
like
what
you
can
do.
What
kind
of
like
parallel
parallelization
parallelism
is
reasonable.
Okay,
so
they
are
like
other
things.
I
wanted
to
talk
about.
Briefly,
it's
like
batch
colonization.
As
you
know,
this
is
you
take
basic
an
input,
bash
subtract
mean
divide
out,
variance
and
scale.
B
It
so
basically
do
an
affine
transformation
on
it
and
it
has
been
shown
for
whatever
reason
so
previous
I
think
the
initial
one
said
that
it
reduces
the
internal
covariance
shift
whatever
that
is.
It's
technically
for
another
paper
scene
shown
that
this
is
actually
not
the
case.
It
basically
improves
some
kind
of,
like
mathematical,
condition,
on
the
on
the
on
the
network
on
the
loss
function,
which
makes
it
much
easier
to
Train
much
smoother.
So
the
thing
is
still
best:
batchelomez
ation
decreases
the
training
time
and
improves
your
bust
nurse.
B
For
example,
you
can
initialize
your
model
different
inner
sedation
schemes,
and
you
still
get
a
good
accuracy
at
the
end
and
also
improve
some
generalization.
I
think
this
is
undoubted.
The
issue
is,
then,
when
you
do
this
distributed
setting,
you
technically
have
to
reduce
all
these
tensors
and
you
have
to
compute
these
quantities
in
theory
on
the
whole
global
batch,
and
that
is
quite
bad
right.
You
need
a
lot
of
overhead
communication
for
that,
especially
forward
pass
so
like
these
can
be
of
size
of
X,
so
like
this
is
like.
B
B
When
you
look
into
that
paper
here,
a
reference,
they
have
like
some
weird
update
algorithm
for
like
how
to
update
these
parameters
on
the
fly,
but
I
think
it's
not
correct
and
there
are
like
some
typos.
So
I
would
just
take
the
global
averages
of
these
things
and
should
be
fine
so
that
that
is
one
thing
you
can
try
and
it
seems
to
work
well
if
your
local
batch
size
is
big
enough.
So,
like
eight
or
sixteen
or
something
so
of
course,
batch,
that's
one
doesn't
help
and
the
other
one
is
weight
normalization.
B
In
that
case,
you
just
work
on
the
weights
directly,
in
the
sense
that
what
you
do
here
is
you
split
I
mean
it
looks
like
a
like
a
like
a
weird
trick,
but
it
works.
It
will
split
the
weights
into
a
direction
and
the
scale
okay.
So
this
is
a
multi-dimensional
Direction
vector
and
the
scale.
So
it's
just
three
privatization,
but
then
the
trick
is
you
update
the
greatest
respect
to
that
to
that
scale
and
direction
separately?
B
So
you
compute
these
gradients
and
the
idea
here
is
when,
when
you
look
at
it
in
a
different
way,
I
don't
want
to
do
the
math
here
that
actually,
the
the
weight
Direction
updates
are
approximately
orthogonal
to
the
dominant
eigen
vectors
of
the
gradient
covariance
matrix.
So
basically
you
don't
fall
into
the
trap
where
you
step
along
this
vector,
but
you
go
perpendicular
to
that.
So
you
have
a
much
more
smoother
like
smoother
convergence
in
that
sense,
so.
B
B
Primitives
is
a
bit
more
like
behind
the
scenes
what's
going
on
about
like
talked
about
complexity
and
how
you
can
paralyze
networks
like
molecules
and
data
parallelism
layer,
pipelining
what
you
can
do
when
you
are
when
you
do
not
convert
well
a
large
Patras,
and
you
need
to
choose
hyper
parameters,
unfortunately,
for
that
and
how
you
can
basically
use
pattern
or
more
similar,
like
accuracy,
enhancement
techniques,
even
at
large
scale,
without
impacting
your
communication.
A
lot.
So
are
there
any
questions
here,
then
I
will
do
some
suggested
reading
you
can
find
on
the
slides.