►
From YouTube: Scaling Deep Learning Training - Mustafa Mustafa
Description
SC20 Deep Learning at Scale Tutorial
https://github.com/NERSC/sc20-dl-tutorial/
A
Hi,
I'm
mustafa
I'm
a
deep
learning
engineer
at
nursk
in
this
part
of
the
tutorial
I'll
cover
some
concepts
and
distributed
training
of
deep
learning
models.
After
this,
we
will
go
to
the
demo
where
we
show
you
how
to
scale
a
particular
model
for
a
particular
problem.
A
So
in
this
talk,
I'll
yeah
I'll
cover
some
training,
parallelization
strategies
and
and
then
I'll
delve
into
more
into
large
batch
training
and
then
I'll
talk
about
the
challenges
with
that
and
some
some
ways
to
to
avoid
yeah
the
problems
with
large
batch
training.
So
why
do
we
first?
Why
do
we
need
really
to
to
paralyze
deep
learning
training
so
to
the
left?
A
Shows
you
some
responses
from
our
users
on
how
long
it
takes
them
to
train
their
deep
learning
models
and,
as
you
can
see,
it
takes
hours
days
or
weeks,
sometimes
to
train
a
model,
and
that
is
definitely
challenging
when
you're
doing
prototyping
for
for
a
new
model
to
solve
a
new
problem
like,
for
example,
imagine
that
every
time
you
need
to
compile
your
code,
it
takes
days
or
weeks
to
compile
the
code
and
then
test
it.
A
That's
definitely
not
not
efficient
to
actually
solve
any
to
develop
any
code
or
model
right,
so
we
need
to
reduce
this
time
to
a
reasonable
time
span
of
like
minutes.
Maybe
hours
is
tolerable
days,
sometimes,
but
definitely
not
to
have
the
usual
case
being
days
or
weeks.
The
other.
The
other
challenge
is
that
to
train
the
models
that
we
are
training
right
now,
which
is
these
large,
deep
learning
models
you
you
typically
need
a
larger
and
larger
data
sets,
especially
in
science.
A
A
The
other
thing
with
deep
learning
is
that
to
solve
complex
problems,
you
need
bigger
models,
so
the
more
complex
the
problem,
the
bigger
the
model
that
you
would
need
to
actually
solve,
solve
it
using
deep
learning
and
we
have
seen
an
increase
of
the
sizes,
especially
the
depth
of
of
the
of
these
smaller,
the
models
that
are
deployed
to
solve,
for
example,
vision,
tasks,
the
same
thing
for
nlp
tasks,
and
it's
the
same
thing
for
all
scientific
problems
to
the
right.
A
You
can
see
here,
for
example,
the
versus
year,
how
long
it
takes
or
how
what's
the
computational
needs
for
training
some
of
the
major
models
now
on
the
market
this
is
compiled
by
opening.
I
you
see,
you
see
like
the
the
the
increase
in
the
curve
is
exponential
and
it's
very
steep,
so
we
need
to
to
be
able
to
to
use
computational
resources
efficiently
in
parallel,
so
that
we
can
reduce
the
the
time
that
it
takes
to
to
train
one
of
these
models
to
solve
every
problem.
A
So
how
do
we
do?
How
do
we
actually
parallelize
training
of
a
deep
learning
model?
So
there
are
different
modes?
The
first
one
is
data
parallelization,
where
imagine
that
you
have
a
model
that
trains
well
on
a
single
gpu,
for
example,
or
a
single
worker
can
be
a
cpu
or
a
gpu.
A
A
But
other
modes
of
parallelism
is,
for
example,
if
your
model
no
longer,
if
your
model
is
large,
very
large,
that
it
no
longer
fits
on
a
single
gpu,
for
example,
or
a
single
worker.
In
this
case,
you
need
to
the
distributed,
distribute
the
model
itself
amongst
multiple
gpus,
for
example
right,
and
there
are
several
ways
of
doing
that.
One
way,
one
one
way
is
to
to
do
layer,
wise
parallelism,
where
you
take
one
layer
and
you
split.
A
Let's
say
like
this
is
the
first
layer
that
was
here
like
a
big
square,
a
big
cube
layer
and
you
split
that
layer
itself
amongst
multiple
workers
yeah.
The
cube
here
is
just
the
activations,
but
yeah,
so
essentially
that
the
layer
is
is
split
amongst
multiple
workers,
and
this
way
you
can
split,
you
know
parallel,
you
distribute
the
entire
model
amongst
multiple
gpus.
You
can
do
this
the
same
for
for
the
other
layers
as
well.
A
Pipeline
parallelism,
where
essentially
every
layer
is,
when
I
say
on
a
particular
worker,
or
every
few
of
them
can
be
on
on
a
particular
gpu
or
a
particular
worker,
and
this
creates
a
pipeline
which
we
call
pipeline
parallelism.
The
most
common
parallelism
mode
is
data
parallelism,
but
pi.
You
know
model
parallelism
in
general,
whether
it's
layer,
wise
or
pipelining,
is
becoming
increasingly
important
as
our
models
becoming
are
becoming
larger
and
larger.
So
you
will
see
more
of
those
in
practice.
A
A
So
now,
how
do
you
do
data
parallelism?
As
I
mentioned,
we
split
the
data
amongst
multiple
workers
right.
We
replicate
the
model
itself
and
then
split
the
data
amongst
multiple
workers.
But
what
do
you
do
so
in
in
deep
learning
training
you?
We
train
them
using
sgd
right
like
gradient
descent
and
then
back
propagation.
A
You
don't
do
anything
differently
from
what
you
do.
Usually
there's,
no
communication
whatsoever.
You
just
take
your
model
here,
it's
just
schematically
as
a
single
matrix.
You
replicate
it
on
the
multiple
workers,
p01
zero
one
two
and
then
you
take
the
data
itself
and
you
split
you
slice
the
data
and
then
each
worker
takes
a
slice
of
the
data
and
produces
its
own
output,
and
that's
it
that
finishes
the
forward
pass.
A
A
You
do
that
locally,
so
each
one
of
the
workers
does
that
on
its
own
gpu
or
cpu,
and
and
then
once
the
gradients
are
calculated
locally,
you
need
to
do
all
reduce
over
those
gradients
before
you
update
the
local
weights
and
that's
that's
the
place
where
this
that's
the
only
place,
actually,
where
you
do
communication
during
the
data
parallel
training-
and
this
is
so
this
is.
This-
is
how
it
works.
There
are
pros
and
cons
to
this.
Some
of
the
pros
is
that
forward.
A
Pass
is
completely
local
and
the
communication
only
happens
during
the
backward
pass,
which
means
that
and
since
you're
doing
it
layer
wise
back
prop,
you
can
also
there
are
a
lot
of
opportunities
to
overlap
the
the
all
the
communication
with
the
computation
operation.
So,
for
example,
if
you
you
calculate
the
gradients
for
the
last
layer
and
then
you
start
calculating
locally
the
gradients
for
the
previous,
the
penultimate
layer
and
then,
while
you're
calculating
the
gradients
for
the
penultimate
layer,
you
could
be
doing
the
all
reduce
over
the
gradients
of
the
previous
layer
right.
A
So
you
overlap
the
communication
with
the
computation.
This
is
very
important,
and
this
is
essentially
what
enables
us
to
scale
data
parallel
training
to
a
very
large
number
of
gpus.
A
Now
some
of
the
the
cons
of
of
this
data
parallel
training
is
that,
essentially,
you
need
to
increase
the
batch
size
right
like
you,
can
do
strong
scaling
where
you
take
one
batch
and
then
you
split
that
batch
amongst
multiple
workers
and
then
and
then
your
batch
size
batch
size
doesn't
change,
and
you
can
certainly
do
that.
But
that's
that
has
diminishing
returns
right
with
the
you
with
the
you
can't
increase
the
number
of
workers
beyond
the
certain
limits.
A
First
like
maximum
of
just
a
local
by
size
of
one,
but
there's
also
like
if,
at
a
certain
point,
the
local
computation
becomes
too
little
that
the
communication
overwhelms
or
becomes
the
the
largest
overhead.
Then
in
that
case,
you're
you're,
essentially
you're
not
really
reaping
any
benefits
from
during
parallelization.
So
you
have
to
do
weak
scaling
right
where
you
you're,
increasing
the
batch
size.
A
I'll
show
up
some
schematics
a
little
bit
in
a
little
bit,
but
essentially,
if
you're,
initially,
you
have
a
say,
you're
running
on
a
single
worker
with
a
batch
size
of
64
and
you
decide
to
do
10
workers,
then
your
batch
size
becomes
640
and
this
is
weak
scaling.
And
this
achieves
you
pass
through
the
data
much
faster.
However,
there
are
challenges
with
training
with
large
batch
with
large
batches
and
we'll
talk
about
those
in
a
little
bit.
A
So
another
thing
to
know
about
this
is
that
I
kind
of
implied
that
when
we
do
all
reduce
so
we
have,
for
example,
if
you
have
eight
workers
and
you're
doing
all
reduce
amongst
the
eight
workers
that
this
is
a
synchronous
or
reduce,
which
means
that
you
need
some.
A
You
need
to
wait
for
all
the
workers
to
finish
their
local
calculations
of
the
gradient
to
be
able
to
actually
to
to
to
be
able
to
make
an
update,
which,
which
means
that
when
you
go
to
a
very
large,
very
large
size
clusters
of
like
workers,
then
you
might
have
more
and
more
stragglers
and
those
could
block
the
the
training
okay.
So
let's
talk
a
little
bit
about
this
large
batch
training,
so
just
to
recap
we're
doing
this.
We
have.
A
We
have
a
where
we
are
increasing
the
number
of
workers,
so
we
and
if
we're
doing
weak
scaling,
then
we're
just
each
one
of
the
workers
gets
its
own
fresh
batch
with
the
same
size.
So
if
we
go
to
n
workers,
then
the
effective
batch
size
becomes
n
times
b.
A
So-
and
this
is
this-
achieves
of
course,
like
you
process
the
data,
much
faster,
as
we
said,
but
you
the
there
are
some
challenges
with.
How
do
you
actually
tune
the
sgd
parameters
so
to
to
account
for
this
increase
in
the
batch
size?
So
let's
remember
how
actually
sgd
works
first,
so
this
is
the
plain
version
of
sgd.
You
have
the
you:
have
your
weights,
a
certain
parameter,
you're,
trying
to
minimize
the
loss,
the
loss
right
over
your
data
over
or
over
your
batch.
A
You
take
one
step
in
the
direction
opposite
right,
so
you
have
a
negative
opposite
to
the
gradient,
so
the
gradient
points
into
the
direction
where
it
increases
the
loss.
You
want
to
walk
in
the
direction
where
it
minimizes
the
loss
and
then
you're
averaging
the
the
the
gradients
over
the
batch,
and
then
you
multiply
it
by
a
step
size
or
what
we
call
the
learning
rate
right.
A
So
now,
if
you
decide
to
increase
the
batch
size,
what
do
you
do
to
the
learning
rate?
That's
the
question
that
we
need
to
answer
so
one
way
is
to
think
about
it.
Is
you
say
like
okay?
So
when
I
take
three
steps
with
batch
size
b,
if
I
increase
the
batch
size
by
a
factor
of
three,
then
maybe
I
should
also
increase
the
learning
rate
by
a
factor
of
three
right.
So
that's
one
way
to
do
it,
which
is
linear
scaling
and
the
way
that
it
would
look
in
in
equations.
A
Now
you
want
to
take
it
and
you
want
to
increase
the
batch
size
by
a
factor
of
two
you're,
comparing
these
two
equations-
and
you
say:
okay,
so
the
the
b
here
it
was
multiplied
by
two
I'm
going
to
linearly
scale,
the
learning
rate,
so
that
the
total
has
the
same
scale
right
and
that's
linear
scaling.
Of
course.
A
One
assumption
here
is
that
the
gradient
at
t
and
the
gradient
that
w
t
plus
one
are
very
close
to
each
other,
that
you
can
make
this
comparison
and
make
this
reduction
right,
but
this
sometimes
breaks
as
we
will
see
in
a
little
bit,
but
generally,
this
is
the
intuition
beyond
the
linear
scaling.
So
essentially,
you
scale
the
learning
rates
in
such
a
way
that
you
make
this
factor
here.
Constant.
A
One,
it's
not
important
necessarily
to
keep
this
factor
constant,
but
what
is
important
is
to
scale
to
keep
the
noise
in
the
gradient
about
the
same
and
and
in
this
case,
you'd
see
that
if
you
look
at
the
noise
of
the
gradients
or
the
covariance
of
the
the
gradient
gradients,
so
you
look
at
the
covariance
matrix
you'd
see
that
the
covariance
matrix
on
the
diagonal,
for
example,
is
a
proportional
to
eta,
squared,
which
is
the
learning
rate
squared
divided
by
b,
and
if
you
want
to
keep
this,
the
the
gradient
scales
or
like
noise
scale
to
be
constant
or
approximately
the
same.
A
So
what
you
do
is
you
need
to
scale
the
the
the
learning
rate
by
square
root
of
n
right
when
you
go
v
equals
to
n.
You
need
to
to
just
scale
eta
by
n,
and
that
way
the
the
square
of
this
gives
you
n
times
e
to
the
eta
and
then
and
cancels
out
right.
So
that's
another
way
of
doing
it.
Now,
in
practice,
we
actually
see
anywhere
from
sub
square
root,
scaling
of
the
learning
rate
to
linear
scaling
of
the
learning
rate
and
you'll
see
like,
for
example.
A
These
are
two
works
where
people
apply
try,
they
scale
the
resnet
training,
and
you
will
see
that
in
news
sub,
like
it's
a
square
root
scaling,
while
in
goyal
two
of
these
seminal
papers,
actually
these
ones
and
go
out,
they
do
linear
scaling.
A
We'll
talk
a
little
bit
about
this
in
a
second
and
there's
a
study
by
open
eye,
it's
actually
now
no
longer
recent,
but
where
they
show
only
from
our
optimization
perspective,
not
from
like
generalization
perspective,
but
only
like
from
like
the
the
picture
of
trying
to
do
local
optimization
of
what
is
the
best
direct.
A
What
is
the
best
learning
rate
to
and
batch
size,
as
as
we
are
increasing
the
batch
size
to
achieve
the
the
to
essentially
to
to
minimize
the
loss
function,
and
they
see
that
actually,
the
scaling
will
depend
on
the
batch
size.
So
the
optimal
learning
rate
depends
on
the
batch
size
and
when
your
your
batch
size
is
small,
then
the
scaling
it
might
make
sense
to
be
more
linear.
A
But
when
the
batch
size
is
very
large,
then
the
scaling
of
the
learning
rate
might
be
more
closer
to
a
square
root
sort
of
regime.
So
that's
at
least
like
some
theoretical
analysis.
That's
it
has
comes
with
a
lot
of
caveats,
but
it's
it
motivates
these
different
scaling,
scalings
of
learning
rate.
A
So
coming
back
to
okay.
So
talking
about
some
challenges
with
with
the
scaling
of
learning,
let's
say
like
we
just
scale
the
yeah,
we
decide
to
train
with
multiple
workers.
We
have
a
much
larger
batch
size.
Some
of
the
challenges
are
that,
if
you,
for
example,
let's
say
that
you
scale
the
learning
rate
linearly,
let's
say
that
you
are
scaling
your
training
with
a
single
gpu
and
all
of
a
sudden.
A
You
want
to
train
with
100
gpus
so
and
you,
if
you
multiply
the
learning
rate
by
100,
then
this
assumption
that
essentially
the
gradient
at
wt
is
is
very
close
to
the
gradient.
The
wt
plus
one
breaks
right
like
you're.
That's
no
longer
the
case,
especially
at
the
very
beginning
of
the
training
right
when
the
when
the,
when
the
lost
surface
is
still
not
very
smooth
you
you're,
starting
from
random
weights.
A
The
last
surface
is
not
very
smooth
and
you
you
decide
to
to
scale
the
learning
grade
by
a
factor
of
100,
then
you're,
taking
these
very
large
steps
on
a
surface
that
is
very
unsmooth
right
and
this
this
is
essentially
like
makes
the
training
completely
unstable
and
everything
goes
haywire.
A
So
one
way
to
to
get
around
this
is
to
yeah
and
then
in
a
second
I'll
talk
about
how
to
to
get
around
this.
But
one
way
another
issue
with
with
training
with
a
large
batch
size
is
that
if
you
train,
for
example,
this
is
an
example
of
training
with
a
batch
size
of
512
and
then,
if
you
train
with
a
batch
size
of
eight
thousands
to
do
512,
it
seems
that
the
models-
don't
don't
generalize.
Well,
so
you
have
something
called
the
generalization
gap.
A
This
is
different
from
generalization
gap
that
we
usually
talk
about,
which
difference
between
loss
and
like
the
training
loss
and
the
validation
loss.
This
is
a
generalization
gap
of
training
at
different
patch
sizes
right,
so
it
seems
that
training
with
a
larger
batch
doesn't
achieve
the
same
generalization
that
you
would
get
from
a
smaller
batch
size
and
there
are
motivations
for
why.
That
is
the
case.
A
One
of
them
is
that
essentially,
when
you're
training
with
with
a
larger
with
a
smaller
batch
size,
so
it
says
essentially
like
the
minimizers
or
the
minima
that
you
find
when
you're
training
with
a
large
batch
size,
they
tend
to
be
sharp
minimals
like
something
like
this.
A
But
when
you
have
these
very
large
batch
sizes,
then
the
gradients
are
much
much
smaller.
The
noise
in
the
gradients
is
much
smaller
right
and
that
forces
you
to
go
and
like
just
jump
into
the
nearest
sharp
minima,
and
there
is
not
enough
noise
to
kick
you
out
of
these
sharp
minimums.
A
So
this
is
the
intuition
behind
this
people
have
done
a
lot
of
studies
you
can
see
like,
for
example,
in
this
paper
by
yao
you,
you
see
like
people
showing
they.
A
So
in
so
how
do
you
get
around
these
issues?
First,
the
instability
in
the
beginning,
and
then
this
this
generalization
gap,
so
in
in
the
one
of
the
first
works
to
actually
show
that
they
can
scale
the
training
of
fresnel
to
a
very
large
scale
or,
in
this
case,
to
a
batch
size
of
eight
thousand.
First,
they
introduced
the
idea
of
learning
great
warm
up
or
they
use
I'm
not
sure.
A
If
they're,
they
introduced
that,
but
they
used
the
learning
rate
warm-up
where
essentially,
instead
of
immediately
starting
with
your
target
learning
rate,
let's
say
that
you're
scaling
by
a
factor
of
10,
then
linearly
scaling
the
learning
rate
and
starting
by,
like
whatever
10
times
your
original
learning
grade.
If
you
start
with
that,
we
said
that
you'll
get
still.
This
lost
surfaces,
it's
not
smooth
and
you
get
a
lot
of
instabilities.
So
one
way
to
get
around
it
is
that
you
warm
up
the
learning
rate
over
a
few
epochs
first
and
that's
what
they
do.
A
So
they
warm
up
the
learning
rate
from
the
original
learning
rate,
all
the
way
to
their
target
learning
rate
over
five
ebooks,
and
the
other
thing
that
they
did
is
that
they
showed
that
linear
scaling
seems
to
work
for
for
this
particular
problem
and,
of
course,
the
the
also
the
paper
goes
through
a
few
other
subtleties
in
in
that
are
common
in
the
implementation
of
in
like
distributed
training.
A
So
after
they
fix
this,
they
show
that
essentially,
they
close
that
generalization
gap
between,
for
example,
a
patch
size
of
256
and
a
batch
size
of
8
000.,
and
this
seems
to
work
for
different
problems,
doesn't
necessarily
work
for
all
scales.
As
you
can
see
in
the
original
paper,
it
works
up
to
batch
size
of
8
000.
But
if
you
want
to
go
to
patch
size
of,
for
example,
32,
000
or
larger,
you
don't
get
necessarily
it
doesn't.
A
Another
another
idea
is
to,
instead
of
doing
increasing
the
learning
rate,
you
can
increase
the
batch
size
of
the
gradually
increasing
the
batch
size
itself.
A
So
the
basic
idea
is
that
initially,
when
you're
training,
when
you're
in
the
beginning
of
the
of
the
training
you're
still
in
this
area,
where
there
are
a
lot
of
sharp
minimums,
and
in
that
area,
you
use
a
smaller
back
size
and
then,
as
you
increase
the
training,
you
can
start
increasing
the
batch
size
and
then
that
should
get
end
while
fixing
the
learning
rate,
and
that
should
get
you
to
areas
where
you
get
you
like,
similar
performance
to
training
on
a
single
batch
size.
Of
course,
this
is
this.
A
This
idea
is
related
to
the
idea
of
learning
great
decay
right
like
so.
We
usually
decay
the
learning
rate
gradually,
as
we
are
training
here.
The
proposal
is
that,
instead
of
decaying
the
learning
rate,
you
can
increase
the
batch
size
in
this
work.
They
take
this
this
idea
and
combine
it
with
this.
This,
the
other
intuition
that
or
the
empirical
also
studies
that
show
that
the
loss
function
is.
A
Less
flat,
when
you're,
when
your
batch
size
is
larger,
so
which
means
that
the
the
curvature
of
the
loss
surface
could
be
a
good
indicator
that
okay,
now
it's
a
good
time
to
increase
the
batch
size
while
you're
training,
and
so
they
combine
these
two
ideas
and
they
introduce
some
measure
of
the
lost
surface
curvature
and
they
use
that
to
adaptively
so
automatically
increase
the
batch
size
as
while
they're
training,
and
they
show
that
this
works.
So
this
is
this
is
their
their
work.
A
They
show
that
this
works
really
well
and
essentially,
instead
of
trying,
like
you,
pre,
determine
the
the
points
where
you
increase
the
batch
size.
Like
these
points,
for
example,
you
can
just
let
the
your
your
calculation
of
the
loss
surface
curvature
determine
when
it's
a
good
point
to
increase
the
batch
size.
A
There
are
multiple
innovations
for
how
to
other
ways
of
how
to
train
to
essentially
handle
this
large
batch
training
problems.
Here's
one
paper,
here's
another
paper,
and
these
are
not
actually
so.
This,
for
example,
goes
trains,
a
a
resonant
50
in
74
seconds,
and
this
is
compared
to
I
think,
the
this
paper
they
started
when
they
were
doing
with
256
it
takes.
A
I
I
if
I
remember
correctly
like
10
days
and
now
we're
talking
about
74
seconds
training
from
the
same
network
and
the
same
data
set
on
the
image
name
data.
So
I
didn't
cover
everything
that
that
could
be
said
about
large
batch
training.
What
I
covered
is
the
basic
concepts
and
this
basic
concept
related
to
like
how
do
you?
What
are
the
challenges,
and
what
do
you
need
to
do?
What
do
you
need
to
think
about?
A
A
We're
just
going
to
to
show
you
how
to
scale
the
training
from
a
single
gpu
to
four
or
eight
gpus,
but
and
at
this
level
of
scales
you
don't
need
necessarily
need,
like
clarks
or
or
lark,
and
also
for
the
short
of
time,
shortness
of
time.
So
before
I
I
close
this
talk,
I
want
to
mention
some
works
by
openai
and
google
investigations
and
essentially.
A
Trying
to
understand
what
is
the
relationship
between
the
batch
size
and
the
performance
or
like
the
the
batch
size
and
like
the
other
parameters,
that
there
are
there
like
the
learning
rate
and
the
gradient
noise,
and
all
of
that
I'm
not
gonna,
get
into
the
details
now,
but
I
just
wanna
point
out
that
they
essentially
find
that
there
is
a
relationship
between
the
gradient
noise
and
a
critical
batch
size
at
which
going
beyond
that
batch
size.
A
There
is
a
it's
a
point
of
diminishing
return,
so
you
can't
you
shouldn't
be
training
larger
than
that
so
and
the
other
thing
that
they
notice
is
that
the
more
complex
the
problem,
the
more
challenging
and
more
complex
the
problem,
the
larger
this
intrinsic
property,
which
is
the
gradient
noise
that
you
would
get
from
from
your
data
or
the
more
complex.
A
Also
the
data
set
itself,
the
larger
this
gradient
noise
and
then
the
larger
the
effect
the
critical
batch
size
that
you
can
actually
use
so
which
the
reason
I'm
mentioning
this,
because
this
is
an
important
idea.
Right
like
we
are
hoping
that
deep
learning
will
be
able
to
solve
more,
more
and
more
complex
problems,
right
and
especially
in
science,
and
so
from
these
studies.
A
We
understand
that
for
those
more
complex
problems,
it
is
actually
promising,
because
we
we
can
use
larger
bite
size
to
train
those
models
and
then,
and
then
that
means
like
we
can
train
them
faster.
So
I
I
see
this
as
actually
great
news.
I
encourage
you
to
look
at
this
paper
and
the
paper
and
also
at
their
blog
post
here,
so
to
wrap
up
before
we
move
to
the
demo.
So
we
talked
that
distribute.
A
We
talked
about
distributed
training,
we
talked
about
the
different
strategies
of
distributed
training,
we
focused
on
data
parallelism
and
then
we
talked
about
large
batch
training
and
how
it
could
be
unstable
and
also
doesn't
generalize.
Well,
we
talked
about
scaling
to
modest
regimes,
essentially,
like
let's
say,
like
you're
scaling
by
a
factor
of
10
from
a
single
worker
or
single
gpu
to
10
gpus.
A
You
could
be
you.
The
first
thing
I
would
try
is
to
do
learning
rate
warm-up
and
linear
or
sub-linear
learning
rate
scaling.
At
least
these
are
this:
is
the
regime
where
you
want
to
what
you
would
want
to
try
this
this
for
this.
For
this,
for
this
scale
of
like
for
this,
for
these
scales
are
going,
you
know
modest
scales,
these
these
seem
to
work.