►
From YouTube: Deep Learning at Scale on Perlmutter
Description
Part of the Data Day 2022 October 26-27, 2022
Please see https://www.nersc.gov/users/training/data-day/data-day-2022/ for the training agenda and presentation slides.
A
Steve
Farrell
I'm
a
machine
learning
engineer,
one
of
a
couple
of
them
in
the
data
and
analytics
Services
Group
at
nurse
I.
Broadly
speaking,
I
support
machine
learning,
workloads
on
our
nurse
supercomputers
and,
of
course,
there
are
a
lot
of
things
that
that
are
included
in
that
some
of
which
I'll
I'll
talk
about
today.
A
So
the
title
is
deep:
learning
at
scale
on
Pro
Mudder
I'm,
going
to
talk
a
bit
about
our
our
offerings
at
nurse
the
kinds
of
things
that
we
do
to
support
and
enable
Cutting,
Edge,
AI
or
deep
learning
for
science
and
I'm
gonna
have
some
stuff
specifically
in
here
as
well
from
a
tutorial
that
we
do
regularly
at
Super
Computing
and
some
other
places
with
the
exact
same
name,
deep
learning
at
scale
and
what
else
do
I
have
I
have
some
fairly
fresh
machine
learning
at
nurse
survey.
A
Plots
that
you
know
are
nice
to
kind
of
illustrate
what
the
community
is
doing
and
and
how
we
think
about
supporting
those
all
right,
I'm
not
actually
going
to
do
too
much
I'm,
not
going
to
say
too
much
about
deep
learning
or
AI
methods
in
an
introductory
sense.
A
I
I
do
have
some
links
to
other
resources,
Outreach
events
that
we've
done
in
case,
that's
of
interest
to
you
I,
will
really
only
kind
of
touch
on
things
that
are
most
relevant
to
what
I'm
going
to
cover
for
how
you
deploy,
workloads
and
deploy
them
at
scale.
So
I'm
also
happy
to
answer
any
questions
that
come
up
of
course,
but,
as
we
probably
all
are
acutely
aware,
AI
is
is
really
kind
of
taking
over
the
world
in
a
lot
of
ways.
A
It's
it's
certainly
transforming
science
and
and
shows
a
lot
of
capability
to
to
keep
transforming
science,
AI
or
machine
learning
or
deep
learning.
I
may
use
these
interchangeably,
but
the
title
is
deep:
learning
I'm,
mainly
focusing
on
on
deep
learning
methods,
because
that
that's
the
kind
of
methodologies
in
AI
that
that
have
really
been
dominant
these
days,
but
these
are
powerful
capabilities
for
scientific
workflows.
A
Just
a
few
bullets
here
to
give
you
a
flavor
of
the
kinds
of
things
that
people
are
doing,
it's
not
exhaustive,
but
people
are
using
these
methods
to
help
with
analysis
of
large
data
sets.
Maybe
data
sets
that
traditionally
require
more
like
hand.
A
Labeling,
maybe
you
don't
have
an
analytical
way
of
doing
your
analysis,
but
now
you
can
automate
it
with
machine
learning
or
ways
where
you
had
traditional
approaches
to
analyze
that
data,
but
you
know
maybe
they're,
based
on
some
sort
of
assumptions
or
simplifications
and
machine
learning
methods
are
able
to
get
more
out
of
your
data.
A
Another
area,
that's
pretty
relevant
for
the
HPC
space,
is
acceleration
of
expensive
simulations.
Of
course,
the
dominant
types
of
workloads
on
HPC
still
today
are
these
in
a
large,
large-scale
simulation
workflows
and
a
lot
of
these
science
domains
are
really
Limited
in
the
kinds
of
science
they
can
do
by
how
expensive
those
simulations
are.
A
They
cannot
simulate
systems
large
enough
or
enough
systems
in
order
to
have
a
good
estimate
for
things
they're
trying
to
compute,
and
so
there's
a
lot
of
excitement
and
a
lot
of
work
going
on
in
in
trying
to
replace
either
simulations
completely
or
some
of
the
calculations
that
happen
in
simulations
with
faster
AI
methods.
A
So
science,
of
course,
in
the
doe
as
well,
are
very
enthusiastic
about
this.
There's
a
lot
of
research
going
on
a
lot
of
R
D.
The
landscape
is
evolving
rapidly
and
partially.
That's
because
it's
evolving
rapidly
elsewhere
too,
in
industry
and
stuff
like
that,
but
yeah
the
doe
has
been
taking
notice
and
you
know,
as
the
EC
is.
The
exascale
Computing
project
is
winding
down.
A
There's
some
anticipation,
hopefully
for
a
future
similar
scale
project
on
AI
for
science
and
while
the
things
are
still
in
some
sense,
new
AI
for
Science
and
rapidly
evolving.
Still,
we
do
see
that
some
areas
are
starting
to
move
into
maturity,
which
is
pretty
cool
to
see
and
these
workloads
increasingly,
they
need
large
comput
computational
resources,
even
in
the
cases
where
they're
replacing
very
expensive
simulations
still,
these
can
can
need
a
bit
of
compute
so
especially
as
it's
maturing
we're
we're
looking
at
folks
tackling
probably
like
larger
problems.
A
Larger
data
sets
because
these
methods
tend
to
be
more
powerful
with
larger
data
sets
so
they're,
looking
at
more
complex
problems,
they're
using
larger
models
to
get
even
better
results,
so
everything's
kind
of
growing
in
size
and
complexity,
which
means
the
computational
costs
grow,
and
you
know
we're
looking
at
basically
that
HP
centers
may
be
like
the
really
a
key
role.
A
This
is
like
a
very
broad
overview
of
how
we
articulate
our
AI
strategy
at
nurse.
So
how
do
we
support?
You
know
this
new
emerging
way
of
doing
science?
First,
we
try
to
deploy
optimize
hardware
and
software
systems.
We
also
work
with
Scientists
to
apply
AI
on
across
different
domains.
We
we
try
to
keep
up
on
The,
Cutting,
Edge
methods
and
tools.
We
have
ways
of
engaging
basically
different
research
groups
and
try
to
do
some
ourselves,
but
we
also
try
to
really
educate
and
empower
the
community
as
well.
A
So
we
do
a
lot
of
Outreach
seminars,
workshops,
training
events
like
this
one,
even
things
that
we
call
schools.
A
So
on
the
hardware
side,
you
know
you've
already
heard
about
promoter,
so
I'm
not
going
to
say
everything
about
it,
but
we
call
it
a
scientific
AI
supercomputer,
maybe
not
everybody
else
was-
was
calling
it
at
that
at
day-to-day.
A
A
A
For
the
software
side,
so
we
try
to
find
a
good
balance
between
providing
things
to
users
that
that
are
well
optimized
for
our
systems,
but
also
sorry,
I
was
just
trying
to
move
this.
Not
do
that,
but
but
also
letting
people
have
the
flexibility
that
they
need
to
to
have
their
own
software
environments
and
their
own
things.
A
So
so
we
build
some
optimized
modules
for
the
most
popular
Frameworks,
there's
the
usual
Anaconda
python
ones
we
heard
about
python
earlier,
but
we
do
build
and
deploy
pytorch
and
tensorflow
with
some
recommended
libraries
and
back-ends
for
for
running
on
our
systems.
We
also
are
heavily
support.
Containers,
particularly
optimized
containers
from
Nvidia,
these
NGC
deep
learning
containers
and,
of
course,
we
run
things
via
shifter,
as
we
heard
about
earlier,
and
eventually
podman
users
can
also
bring
their
own
images
or
customize
on
top
of
our
images
or
the
Nvidia
images,
and
so
on.
A
Of
course,
it's
also
fully
possible
for
for
folks
just
to
build
their
own
common
environments
and
have
their
machine
learning
software
done.
That
way.
A
But
it's
not
just
about
the
Frameworks,
there's
also
a
whole
ecosystem.
That's
growing!
It's
also
rapidly
evolving,
but
there's
a
lot
of
other
things
that
users
who
are
doing
machine
learning
like
to
use
things
like
well
hyper,
parameter
optimization.
So
this
is,
you
know,
you're
trying
to
train
a
model,
but
really
you
don't
know
all
the
settings
of
the
model
like
the
number
of
layers
or
the
learning
rates
and
things.
A
Jupiter
is
a
very
popular
service
at
nurse,
something
like
over
2
000
nurse
users
are
somewhat
regularly
using
Jupiter
and
the
Machine
learning
users
also
like
to
develop
their
things
in
Jupiter
a
lot.
So,
of
course,
we
support
that
we
provide
kernels
and
users
can
have
their
own
kernels
for
profiling
and
visualization.
We
recommend
Nvidia
profiling
tools,
but
people
like
to
use
tensorboard.
A
We
have
a
nice
way
of
of
launching
tensorboard
from
Jupiter
Hub
and
we
use
weights
and
biases
a
lot
and
encourage
folks
to
to
try
that
it's
a
great
way
to
log
experiments
and
also
to
do
hyper
parameter
optimization.
A
So
we
we
do
see
the
the
AI
workload
is
is
growing
at
nurse.
It's
still
a
small
piece
of
the
pie,
but
we
anticipate
it
just
to
keep
taking
off
as
as
time
goes
on,
we
do
track
the
machine
learning
software
usage
to
some
extent.
This
is
not
all
fully
functional
on
Pro
Mudder.
Yet,
unfortunately,
because
it'd
be
really
nice
to
see
the
kind
of
uptake
we,
we
have
right
now
with
GPU
system,
but
we
we
generally,
we
can
track
things
like
module
loads
and
python
Imports.
A
That
might
have
been
mentioned
earlier
earlier
today
on
how
that
mechanism
Works.
But
we
have
some
data
here
that
goes
back
from
2017
and
we
can
see.
There's
you
know
a
pretty
steady
increase
in
the
the
number
of
users
there
more
than
six
times
grow
from
2018
to
2021.
A
We
also
track
Trends
and
engage
with
the
community
in
in
this
machine
learning
at
nurse
survey,
you
may
have
seen
some
emails
from
us
earlier
this
year.
We're
we're
still.
We
still
have
this
one
for
this
year.
Open
and
I
encourage
everybody
to
help
us
out
by
filling
out
that
survey
and
telling
us
about
your.
You
know
what
you're
doing
with
machine
learning
and
what
you
need,
but
the
survey
targets
the
communities
we
use
the
nurse
user
list
and
some
others.
A
So
it's
it's
it's
folks
that
they
may
not
all
be
using
nurse,
but
at
least
they're
like
potential
users
of
nurse
resources
and
definitely
doing
machine
learning
for
Science.
And
we
ask
things
about
you
know
the
kind
of
problems
they're
doing
the
kind
of
models
they
use,
the
the
kinds
of
compute
resources
they
need,
oh,
and
how
happy
they
are
with
the
things
we
have
in
Earth.
A
So
these
are
some
preliminary
results
from
our
survey
this
year,
I'm
not
going
to
spend
too
much
time
because
I
want
to
be
able
to
get
into
the
interesting
things
later
on
in
the
talk.
But
I
wanted
to
kind
of
show
these,
because
it
there
are
some
nice
and
usually
in
useful
insights
into
what
the
scientific
communities
are
doing
these
days.
A
So
we
asked
you
know
the
kinds
of
ways
that
machine
learning
fits
in
their
in
our
in
the
community's
workflows
and
we
see
most
people
are
are
in
this
first
category,
at
least
in
the
respondents
right
which,
which
could
still
be
a
biased
sampling
of
the
real
communities.
But
most
people
are
are
in
this
mode,
where
they're
doing
machine
learning
for
offline
data
analysis.
A
But
we
do
see
the
second
biggest
one
here
is
and
in
the
third
actually
are
related
to
combining
machine
learning
with
simulation
which
are
cool
to
see,
and
we
see
folks
wanting
to
do
machine
learning
for
more
real-time
or
online
data
analysis
and
a
little
bit
here
of
folks.
Looking
at
controlling
scientific
instruments,
a
lot
of
people
are
still
using
convolutional
neural
networks,
which
is
not
too
surprising
but
I.
Think
in
one
of
our
earlier
surveys
traditional
ml
was
kind
of
a
dominant
one.
A
So
now
we
see
some
turning
point.
I
think
where
also
down
here
on
the
left
more
folks
are
using
pytorch
than
at
least
claim
to
be
using
scikit-learn.
That
was
definitely
flapped
before
or
it
was
psychic
learn
then
tensorflow
and
pytorch.
So
you
can.
We
can
see
Trends
over
time,
which
is
pretty
cool,
yeah,
I,
don't
know
what
else
I
need
to
call
out
here,
but
you
can.
You
can
take
a
look
at
these
offline.
Of
course
those
lines
will
be
shared.
A
We
ask
about
the
scale
of
resources
that
people
need.
So,
while
still
it
looks
like
the
bulk
of
respondents
have
problems
that
are
not
very
large.
They
can
train
models
on
a
single
GPU
in
hours.
Relatively
small
data
sets
tens
of
gigabytes,
maybe
single
device
or
single
node
kind
of
scale.
But
we
do
see
these
Tails
here,
where
we
need
to
try
and
think
about
how
we
support
those
users
that
you
know
it
takes
months
or
even
years,
apparently
to
train
models.
A
They
have
terabytes
of
data,
hundreds
of
terabytes
of
data
they
might
be
able
to
run
on
hundreds
or
thousands
of
gpus
and
use
various
forms
of
parallelism
in
training.
Their
models
also
come
back
to
the
the
forms
of
parallelism
here,
a
little
later.
A
Okay,
so
I
already
sort
of
said
this,
but
yeah.
We
do
see
folks
with
large
problems
and
potentially
a
need
for
large-scale
training,
not
too
much
to
say
on
this
one
other
than
just
some
takeaways.
You
know
about
half
of
people
say
they
like
to
use
Jupiter
notebooks
to
develop
their
models.
So
that's
something
we
have
to
kind
of
take
into
account.
A
lot
of
people
are
still
using
CPUs
like
on
Corey
Haswell
or
up
here
on
the
upper
right.
A
It
looks
like
more
people
are
using
CPUs
for
inference
than
than
gpus,
which
is
a
bit
interesting,
but
again
could
be
about
a
little
bit
by
a
sampling,
because
it's
the
nurse
users,
a
lot
of
them,
know
Corey
a
little
bit
more
on
the
kinds
of
Outreach
that
we
do
so
that
empowerment
aspect
of
our
strategy
we
for
for
a
couple
years
in
a
row.
We
did
this
deep
learning
for
Science
school
in
2019.
It
was
an
in-person
event
week
long.
A
It
was
really
great
a
lot
of
great
speakers.
We
had
a
good
Hands-On
sessions
and
posters.
You
can
find
all
the
videos
and
content
there
on
the
web
in
2020
because
of
the
pandemic.
We
switched
to
a
webinar
Series,
so
every
week
there
would
be
a
speaker,
fewer
Hands-On
things,
but
still
some
some
code
examples
and
we
did
record
all
those
talks.
A
You
can
also
see
those
we
had
a
lot
more
introductory
stuff
in
2019
and
then
in
2020
it
started
to
get
we
featured
more,
not
quite
Advanced,
let's
say
more
advanced
scientific,
relevant
topics.
A
I
mentioned
that
we
do
this
deep
learning
at
scale.
Tutorial
we've
been
doing
that
quite
a
while,
or
at
least
since
2018,
at
pretty
much
every
Super
Computing,
it's
some
ISC
conferences
in
Europe
and
and
some
others
last
year
at
SC.
That
was
the
first
time
we
got
to
use
Pearl
mutter
for
this,
which
was
pretty
fun
while
we're
doing
it
again
this
year.
So
if
you're
going
to
SC
feel
free
to
check
it
out
and
I
link
to
the
full,
the
full
video
there.
A
We
also
posted
videos
because
we
were
pre-recording
videos
back
then
other
things
we've
been
doing
not
too
long
ago.
There
was
this
Nvidia
organized
AI
for
science
boot
camp
and
they
sort
of
did
in
collaboration
with
us
and
we
opened
it
up
to
users.
So
that
also
had
a
good
bit
of
introductory
stuff.
Sorry
for
the
slack
pinks
here-
and
you
can
I
think
view
slides
on
that
web
page,
and
then
we
do
things
like
the
new
user
training
events
regularly.
A
Day-To-Day
events
like
this
here
you
are,
and
probably
others
that
I
may
have
forgotten
about.
Okay,
so
now
I'll
switch
gears
a
little
bit
and
start
to
get
into
the
content
from
the
tutorial.
So
this
that
we
usually
do
like
a
full
day
tutorial,
so
obviously
I
can't
cover
a
lot.
But
this
is
to
give
you
a
little
flavor
and
cover
some
some
aspects
of
that
that
that
hopefully
you'll
find
useful
or
interesting,
and
maybe
you
can
follow
up
and
ask
questions
or
go
check
out
the
full
material
if
you're,
if
you're
interested.
A
But
you
know
the
real
theme
there
is,
how
do
we
optimize
deep
learning
workloads
on
HPC
and
particularly
for
them
to
run
at
Large?
Scale,
really
try
to
optimize
like
time
to
time
to
solution
right
for
scientists,
because
scientists
need
fast
and
efficient
methods.
They
need
this
to
enable
rapid
development
and
testing
of
their
ideas,
but
not
just
that.
They
may
also
really
need
optimized
machine
learning
workloads
to
fit
within
their
production
workloads
to
fit
whatever
computational
constraints.
There
may
be,
maybe
there's
an
experimental
instrument
like
the
Large
Hadron
Collider.
A
That
needs
to
be
able
to
very
quickly
make
decisions
about
what
data
to
ride
out
or
folks
are
maybe
trying
to
replace
part
of
a
simulation
with
a
machine
learning
model.
But
if
it's
not
fast,
then
you
didn't
really
save
anything,
but
also
as
a
center.
We
need
to
think
about
how
we
optimize
these
workloads
for
all
users,
so
that
overall,
the
the
throughput
of
nurse
in
terms
of
science
is
optimized.
A
So
if
you
can
make
effective
use
of
modern
HPC
systems
like
promoter,
this
can
greatly
accelerate
these
workflows
and
and
I
think
the
situation
is
getting
a
bit
easier
with
software
and
methods
and
stuff,
but
it
can
still
be
non-trivial.
A
So
there's
still
kind
of
a
need
for
for
this
sort
of
tutorial
content
falling
bit
behind
so
I'm
going
to
try
to
go
a
bit
fast
here,
but
hopefully
I'll
be
able
to
at
least
get
the
important
point
across
Point
points
across
and
and
folks
can
ask
questions
where
needed
so
yeah.
So
deep
learning
is
very
powerful
and
it's
it's.
A
It's
showing
a
lot
of
promise
on
a
lot
of
different
application
areas
but,
as
already
said,
it's
computationally
intensive,
especially
if
we
look
at
training,
so
training,
big,
deep
neural
network
models
and
and
again
as
we
look
at
more
complex
problems.
Larger
data
sets
larger
models
that
compute
crust
costs.
These
are
actually
growing
with
time.
A
This
is
an
open,
AI
plot,
it's
actually
a
bit
old
now
it
doesn't
show
all
the
latest
developments
with
language
models,
but
you
can
just
see
that
there's
this
exponential
growth
in
the
amount
of
compute
needed
to
train
popular
machine
learning
models
out
there.
So
what
do
we
do?
How
do
we
make
effective
use
of
HPC
for
this
in
the
tutorial?
We
break
it
up
into
these
sorts
of
categories.
So
first
we
look
at
optimizing
the
performance
of
a
training
workload
on
a
single
device,
because
there's
really
no
point
in
scaling.
A
If
you
can
just
get
a
lot
of
you
know,
it
makes
sense
to
First
Look
at
that
before
you
just
try
to
throw
hundreds
of
gpus
at
a
problem
right,
give
you
much
more
efficient
in
the
end
and
then
and
then
we
talk
about
Distributing,
the
training
across
multiple
gpus
and
multiple
nodes
on
our
systems,
and
then
we
talk
a
little
bit
about
optimizing
now
the
distributed
performance
at
scale.
I
won't
talk
really
at
all
about
the
third
one
here,
and
I
really
only
have
a
little
bit
on
the
first
one.
A
So
this
is
this
is
some
content
mostly
developed
by
the
Nvidia
folks
that
we
collaborate
with
in
that
tutorial?
This
slide
comes
from
from
one
of
our
our
tutorial
last
year,
but
yeah.
So
in
the
tutorial,
when
we
look
at
optimizing,
this
training
example
on
a
single
GPU.
A
We
use
Nvidia
Insight
systems
to
do
this,
which
is
really
a
pretty
powerful
tool
using
a
profiler
as
it
says
here,
it's
an
essential
step
in
optimizing
any
code
and
insight
systems
lets
you
view
a
nicely
organized
well,
debatably,
I,
guess,
I
think
you
have
to
get
used
to
it,
but
it
gives
you
a
nice
view
of
the
timeline
where
you
can
kind
of
look
at
what's
going
on,
and
our
tutorial
example
is
really
nice
actually
because
the
kinds
of
things
that
you
might
see
in
the
real
world
you
can
see.
A
In
that
example,
you
can
see
things
like
gaps
that
come
from
data
loading.
You
can
see
things
like
GPU
not
being
utilized
super
well,
because
there
are
a
lot
of
many
small
kernels
being
launched,
and
then
we
were
able
to
talk
about
the
ways
that
you
you
improve
on
that,
so
it
can
yeah.
It
can
basically
shed
light
on.
A
What's
going
on
in
your
data
Pipeline
on
the
GPU
scheduling
of
kernels,
you
can
annotate
things
with
these
nvtx
ranges
all
that
is
covered
in
the
the
tutorial,
but
here's
you
know
a
rough
idea
of
how
you
you
run
and
site
systems
down
below
and
then
the
kinds
of
things
that
are
important
for
optimization
and
these
are
really
just
lifted
from
the
tutorial
again.
It's
it's
a
nice
example,
because
all
these
apply
there
and
we
get
good
speed.
Ups
data
loading
is
a
frequent
cause
of
performance
loss
for
users,
even
for
experts.
A
Really
it's
it's
like
the
first
thing
to
check.
Basically,
so
you
know
in
the
tutorial
we
talk
about
ways
to
paralyze
your
IO
and
then
to
take
it
further
from
there.
For
example,
you
can
use
nvidia's
Dali
Library,
which
has
a
lot
of
nice
features
for
deep
learning
data
pipelines.
A
Nice
features
that
parallelize
and
kind
of
cache
data
on
the
Fly
I
also
do
a
lot
of
your
data.
Augmentations,
your
your
pipeline
stuff,
you're
pre-processing
on
the
GPU
and
this
little
platform
upper
right
shows
it
for
our
tutorial.
Just
how
much
how
many
the
kinds
of
speedups
we
get
from
from
various
stages
of
optimization
just
in
the
data
pipeline,
so
paralyzing
the
I
o
caching,
things
in
memory
and
then
going
to
Dolly.
So
we
get
over
2x
performance
on
the
end-to-end
thing.
Just
from
that
at
least
I
think
that's!
A
The
end-to-end
speed
up
mixed
Precision
is
is
something
that
that
is
often
a
very
useful
great
way
to
speed
up
training
helps
you
Leverage
The
tensor
cores
on
the
modern
gpus.
It
can
reduce
memory
and
stuff
like
this,
and
the
Frameworks
provide
pretty
nice
capabilities
for
this.
Now
they
make
it
pretty
easy
to
do
automatic
mixed
Precision,
where
it
will
use
fp16
where
it
can,
and
they
give
you
the
features
to
avoid
numerical
underflow
issues
that
that
can
come
about
by
by
automatically
scaling
the
gradients.
A
Just
when
doing
the
the
computations
that
that
that
may
have
a
risk
of
numerical
issues
and
then
there's
there's
ways
to
reduce
like
overheads
of
launching
kernels
that
are
fairly
small.
So
just
in
time,
compilation,
Nvidia,
Apex
library
has
some
fused
operators
and
then
there's
a
more
recent
Nvidia
Cuda
graphs
Library.
So
we
go
through
those
in
the
tutorial
as
well,
and
basically
these
these
are
mostly.
These
are
just
ways
of
fusing
kernels
together
and
getting
better
GPU
utilization.
A
They
can
give
you
good
speedups
and
there
are
other
tricks
as
well
that
I
won't
cover
here,
but
you
have
to
check
out
the
tutorial
tutorial
to
see
them
all
in
the
tutorial.
When
we
put
everything
together
just
on
a
single
device,
we
get
something
like
a
six
times
speed
up,
so
that
can
give
you
a
sense
for
really
sometimes
how
how
useful
it
can
be
to
go
through
this
before
trying
to
distribute
across
many
devices.
A
But
let's
say
you've
done
that
now:
you're
ready
to
do
some
actual
parallel
training
of
models.
There
are
different
ways
to
parallelize
the
training
of
neural
networks.
Data
parallelism
on
the
left
is
the
most
common.
That's
where
you
take
your
your
data,
your
data,
basically
your
data
samples
and
partition,
those
or
distribute
those
across
gpus
or
nodes.
You,
you
replicate
your
model,
so
everybody
has
the
same
copy
of
the
model
and
you
do
some
synchronizations
at
the
right
at
the
right
point
in
time.
It's
the
most
common.
A
It's
the
easiest
way
to
speed
up
training,
but
nowadays,
more
and
more,
we
see
folks
turning
to
model
parallelism.
Sometimes
it's
because
you
need
to,
in
fact
I
think
that's
the
most
common
case.
Really,
if
you
have
a
model,
that's
just
too
big
that
you
can't
fit
in
memory
on
a
single
device.
You
essentially
have
to
distribute
that
model
across
devices.
You
can
do
stuff
like
in
the
middle
here
where
every
layer
of
a
neural
network
itself
might
be
partitioned
across
devices
or
something
on
the
right
which
is
called
pipeline.
A
A
I
should
really
hurry
up
now.
So
I
think
I
mostly
skipped
this,
but
this
talks
a
little
bit
about
the
most
common
way
of
doing
this,
which
is
synchronous,
data
parallel
scaling
where,
let's
say
you're,
trying
to
use
more
and
more
gpus
parallelize
further
larger
scale.
There
are
different
ways
to
think
of
it.
You
can
kind
of
hold
your
batch
size
fixed
or
you
can
kind
of
try
to
grow.
A
Your
batch
size
have
a
larger
batch
size
as
you're
bringing
in
more
and
more
processors,
essentially
growing
the
global
batch
size,
but
there
are
different
trade-offs
here.
As
you
increase
the
batch
size,
it
can
be
harder
to
train
models,
but
if
you
keep
the
batch
size
fixed,
you
run
out
of
compute
per
per
GPU
as
you
further
subdivide
that
and
you
can
run
into
Network
bottlenecks.
So
that's
essentially
what's
covered
there,
but
more
generally
for
looking
like
how
do
you?
Actually?
How
does
this
actually
speed
up
training?
A
You
know
if
you
look
at
stochastic
gradient
descent.
Essentially,
you
know
you're
sampling,
batches
of
data
from
your
overall
data
set
you're
Computing
a
gradient,
and
then
you
have
a
step
size
that
says
how
how
much
you
try
to
optimize
the
the
parameters
of
the
model
to
get
a
little
bit
better
right,
and
so
we
can
converge
faster,
we're
trying
to
get
to
the
answer
faster.
You
know
we're
taking
a
sequence
of
steps.
You
can
do
that
by
taking
fewer,
bigger
and
and
or
faster
steps
right.
A
So
what
we're
doing
in
practice,
usually
with
data
parallel
training,
is
we're
trying
to
push
up
to
larger
batch
sizes,
which
actually
let
us
use
larger
learning
rates,
so
we're
taking
larger
steps
and
larger
batch
sizes,
also
paralyzed,
better
across
more
processors.
So
this
is
the
kind
of
way
you
do
it,
but
you
you
have
limitations.
A
You
can't
scale
to
arbitrary
number
of
gpus,
it's
a
bit
problem
dependent,
but
it's
it's
definitely
not
a
free
lunch,
and
this
slide
just
sort
of
says
that
there
are
some
rules
of
thumb
for
how
you
can,
let's
say,
increase
learning
rates.
As
you
increase
batch
sizes.
Sometimes
you
can
kind
of
scale
it
linearly
with
the
batch
size
or
using
a
square
root
rule
which
is
kind
of
more
motivated
by
how
the
the
gradient
noise
scales,
but
really
the
situation,
can
be
more
complex
and
for
a
given
problem.
You
know
a
situation.
A
Might
look
more
like
this
on
the
lower
right,
where
the
optimal
learning
rate
just
depends
on
the
batch
size.
According
to
some
relationship
like
that,
I'll
skip
the
other
parts
here
and
I.
Think
these
slides
too,
that
just
just
kind
of
dive
in
a
little
bit
further
into
what
what
are
the
sources
of
challenges
as
you
go
to
large
batch
sizes,
essentially
folks
have
just
found
that
at
large
batch
sizes,
you
tend
to
be
more
likely
to
overfit.
A
You
tend
to
end
up
in
sharper
Minima
in
the
object
in
your
in
your
loss,
objective,
landscape
here
and
sharp
Minima
are
very
sensitive
to
differences
between
training
and
test
data.
Sex,
hey
excuse
me:
can
you
go
somewhere
else?
I'll
skip
this
one
here.
There
are
other
tricks
to
try.
I.
Think
one
thing
to
call
out
here
is
that
there
are
more
modern
optimizers
for
training.
A
Deep
neural
networks,
things
like
lamb,
you
see
that
this
lamb
Optimizer
is,
is
particularly
popular
for
the
most
recent
state-of-the-art,
really
large
language
models,
which
are
the
largest
models
in
the
world
these
days.
A
If
we're
talking
about
scale
and
pushing
on
scale
ml,
perf
and
ml
comments,
this
is
one
area
where
a
lot
of
innovation
happens.
So
ml
Commons
is
an
organization
that
publishes
these
ml.
Her
benchmarks
they're
the
basically
the
standard
performance
benchmarks
for
machine
learning
in
Industry
these
days.
A
If
you
look
at
the
latest
results
now,
it's
kind
of
the
point
where
you
can
Train
resnet
50
in
like
12
seconds
and
they're,
pushing
up
to
4
000
accelerators.
We
got
involved
in
ml
Commons
to
help
develop
an
HPC
Benchmark
Suite.
So
here
we
drew
from
scientific
applications.
A
I
list
them
here,
but
I'm
not
I'm,
not
going
to
talk
about
them
in
depth.
But
these
are
interesting
things
you
make
applications
you
may
have
heard
about
before
we've
been
doing
some
releases,
so
ml
per
benchmarks
are
organized
with
these
submission
rounds.
Where
participants
come
from
all
around
the
world
on
their
own
with
their
own
HPC
systems,
they
measure
results
on
their
systems
and
and
things
get
published
during
super
Computing,
I
think
I'll
skip
the
rest.
A
Maybe
one
yeah
one
other
thing
to
say
here
is
that
this
has
been
a
really
valuable
experience
for
us
at
nurse.
At
the
last
submission
round,
which
was
published
at
supercomputing
2021,
we
got
to
use
Pearl
mutter.
We
had
really
nice
competitive
results,
leading
in
some
categories
or
like
close
to
leading
in
in
some
others,
and
it
was
a
really
great
opportunity
for
us
to
understand
the
performance
of
our
systems,
particularly
at
scale
and
ShakeOut
issues,
and
find
problems
that
need
to
be
fixed.
A
Then
I
just
have
a
few
examples
of
other
kind
of
state-of-the-art
large
scale.
Things
which
I'll
go
through
really
quickly
so
Megatron
touring
is,
is
it's
essentially
a
code
base
with
Nvidia
and
Microsoft
a
code
base
that
supports
really
really
large
language
model,
training
and
various
forms
of
parallelism?
There
was
a
bit
of
press
around
this
530
billion
parameter
model,
which
at
least
at
the
time
was
the
largest
I.
A
Don't
know
if
it
still
is,
but
it
was
state
of
the
art
and
in
some
natural
language,
processing
tax
tasks
and-
and
this
is
an
example
of
where
they
combine
all
forms
of
parallelism,
so
eight-way
tensor
parallelism.
That's
each
layer
of
a
model
is
partitioned
across
eight
gpus
on
a
node,
then
there's
that
pipeline
parallelism
across
nodes,
so
different
layers
of
a
model
are
now
across
35
different
nodes
and
then
on
top
of
that
they
also
have
data
parallelism,
replicated
up
to
thousands
of
gpus.
A
So
pretty
impressive
stuff,
and
you
can
read
more
at
those
blogs,
then
some
science
results
from
some
of
of
our
colleagues.
You
may
have
heard
about
these
before,
but
this
one
is
basically
doing
self-supervised
learning
for
Sky
surveys
to
detect
these
gravitational
lensing
events.
Peter
Harrington
is
one
of
the
authors
and
and
some
others
at
the
lab
and
yeah
I.
Think
like
an
important
takeaway
here
was
that
they
could.
They
could
do
pre-training
techniques
that
are
self-supervised
and
then
fine-tune
on
things
that
they
want
and
get
better
results
out.
A
Forecast
net
is
a
work
between
some
folks
here,
as
well
as
Nvidia,
and
maybe
some
others
too.
But
JD
was
our
former
post
operating
a
lot
on
this
and
then
current
post-doc,
Shashank
and
Peter
Harrington
work
a
lot
on
this
as
well.
So
this
is
basically
doing
weather
forecasting
using
some
fancy,
state-of-the-art
Fourier
operator
type
methods
and
basically
giving
really
state-of-the-art
results
in
terms
of
in
terms
of
machine
learning
methods
on
par
with
numerical
methods,
but
much
much
faster.
A
So
then
I
think
I'll
just
conclude,
since
I
I'm
actually
a
little
bit
over
time,
just
say
that
you
know
AI
for
science.
It
requires
super
computer
scale
capabilities,
we're
trying
to
deliver
this,
it's
great
to
see
all
the
growth
and
sophistication
and
maturity
in
science.
We're
excited
to
see
who
comes
next
and
feel
free
to
reach
out
if
you're
looking
for
jobs
or
want
to
collaborate.
That's
all
thanks.