►
From YouTube: Scientific Deep Learning on Perlmutter
Description
Part of the Using Perlmutter Training, Jan 5-7, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/using-perlmutter-training-jan2022/
A
My
name
is
peter
harrington,
I'm
a
machine
learning
engineer
in
the
data
analytics
and
services
group
at
nursk,
and
this
is
going
to
be
the
last
talk
of
today's
training
and
in
this
one,
you'll
hear
all
about
our
scientific,
deep
learning
ecosystem
on
promoter.
A
So
I'm
going
to
start
with
just
some
background
and
a
brief
little
overview
of
different
deep
learning
for
science
applications
and
and
how
we
see
deep
learning
being
used
at
nursk
and
then
my
colleague
steve
will
go
into
details
on
the
actual
deep
learning
software
stack.
We
have
on
promoter
how
to
use
the
frameworks.
We
have
how
to
get
them
to
be
performant
how
to
do
optimization
on
your
models
and
then,
finally,
I
will
come
back
at
the
end,
discuss
some
additional
useful
tools.
A
A
So
deep
learning,
as
you
may
have
heard
about,
is
very
exciting.
It's
a
subset
of
machine
learning
and
ai
that
relies
on
deep
neural
networks
for
the
computations
and
the
reason
these
deep
neural
networks
seem
to
work.
Well,
is
they
they
partition
a
problem
into
a
sequence
of
steps,
usually
called
layers
where
you
feed
an
input
in
and
each
successive
layer
processes
the
input
and
the
features
extracted
from
it
to
produce
an
output.
So
it
can
kind
of
break
a
complex
problem
down
into
simpler,
sequential
steps
or
sequential
computations.
A
This
is
this
diagram
on
top
is
sort
of
a
basic.
You
know,
vanilla,
neural
network,
but
of
course,
with
all
the
research
and
exciting
different
applications.
We've
seen
in
deep
learning,
there's
all
sorts
of
different
neural
network.
Architectures
now
designed
specifically
to
process
things
like
text,
data
or
image
data,
and
then
obviously
these
are
being
adapted
even
further
in
scientific
domains.
A
A
So
these
computational
structures
have
been
around
for
a
long
time.
Actually,
the
first,
like
you,
know,
machine
learner,
deep
learning
type
network
was
invented
in
the
50s,
but
only
recently
have
we
seen
them
really
start
to
take
off,
and
this
is
mainly
due
to
two
key
factors.
A
So
that's
been
extremely
useful
for
deep
learning.
The
other
great
advancement
has
been
the
advent
and
the
increasing
availability
of
accelerators,
like
gpus
in
particular,
linear
algebra
accelerators.
That
can
really
help
with
the
internal
computations
that
happen
inside
these
neural
networks,
so,
as
gpus
have
become
more
and
more
advanced
and
more
widespread,
we've
seen
even
further
ability
of
these
deep
learning
models
to
process
complex
data
and
achieve
complex
tasks.
A
Besides
these
two
main
key
factors,
there's
also
been
a
lot
of
hard
work
by
the
community,
doing
algorithmic
advances
developing
different
optimizers,
different
ways
of
regularizing
or
normalizing
deep
learning
training.
So
all
three
of
these
combined
have
have
contributed
to
this
excellent
performance
of
deep
learning.
A
It
can
accelerate
expensive,
computational
simulations
and
so
we're
seeing
adoption
on
the
rise
in
all
sorts
of
scientific
communities.
There's
rapid
growth
in
machine
learning
and
science
conferences
we're
seeing
good
recognition
of
achievements
in
ai,
like
the
2018
turing
award
or
the
gordon
bell
prizes
in
2018,
2020
and
importantly,
we're
seeing
hpc
centers,
awarding
allocations
for
ai
and
optimizing
next
generation
systems
like
promoter
for
ai
workloads-
and
this
is
a
sign
sort
of
that
you
know-
does
investing
heavily
in
ai
for
science
as
a
result
of
deep
learning's
success.
A
So,
just
to
sort
of
flesh
out
a
little,
you
know
more
explicitly
some
examples
of
what
machine,
learning
and
science
can
look
like.
We
have,
you
know
a
whole
host
of
really
powerful
feature.
Extractors
from
you
know,
deep
learning,
literature
nowadays
and
these
are
in
computer
vision,
trained
on
natural
image
data
sets,
but
we
can
easily
adapt
those
to
something
like
sky
surveys
and
use
those
feature
extractors
to
help
us
process.
These
very
large
data
sets
and
maybe
help
us
find
rare
objects
or
make
good
classification
models
or
regression
models.
A
Another
exciting
area
is
something
like
generative
modeling,
where
maybe
you
need
to
synthesize
some
some
high
resolution
or
fine
details
from
a
course
input,
and
this
is
something
that's
very
exciting-
for
applications
in
simulation
heavy
domains.
So
something
like
computational
fluid
dynamics
can
benefit
greatly
from
models
like
that.
A
Another
exciting
area
is
in
graph
neural
networks.
We
have
these
these
models
that
are
adapted
specifically
for
graph
structured
data,
so
something
maybe
like
a
social
network
modeling.
The
connections
between
people
can
also
be
used
to
model
the
connections
between,
say,
atoms
in
some
molecule
or
lattice
structure.
So
these
graphene
networks
can
also
be
really
useful
for
if,
if
you
adapt
them
in
the
right
way,
useful
for
something
like
catalyst,
research
in
materials
design,
but
obviously
the
possibilities
are
sort
of
endless
here.
A
And
we
at
nurse
see
this
great
diversity,
obviously,
because
we
are
at
berkeley
lab
where
there's
a
lot
of
different,
exciting
research
happening
across
all
sorts
of
science
areas
and
in
one
particular
way
we
track.
This
is
with
our
machine
learning
surveys
which
will
be
happening
again
this
year
and
these
surveys
just
track.
The
current
use
cases
of
our
machine
learning
stack
and
we
try
to
identify
some
areas:
how
to
improve
the
user
experience,
how
to
improve
performance
and
maybe
inform
our
strategy
and
anticipate
future
workloads.
A
But
from
these
surveys
we
see
you
know
great
diversity
across
things
like
cosmology
chemistry,
biology,
fusion,
all
sorts
of
research
areas
are
applying
machine
learning.
Obviously,
the
dominant
applications
tend
to
be
things
like
classification
or
regression
problems,
but
we
also
see
some
exciting
work
with
generative
modeling
and
segmentation
and
reinforcement
learning.
A
So
what
actually
goes
into
a
deep
learning
workload?
Obviously
you
need
to
train
your
model.
So
in
this
phase
it's
typically
more
iterative
and
interactive
sort
of
r
ds.
A
You
need
to
actually
get
your
data
set
process.
It
set
it
up
stage
it
and
feed
it
to
your
network
for
training,
and
this
can
be
very
compute
and
data
intensive.
Obviously,
especially
if
your
problem
is
a
large
scale
problem,
one
common
use
case
we
see
in
hpc
is
the
sort
of
model
selection
process
or
hyper
parameter
optimization.
A
This
requires
a
lot
of
resources
because
you
need
to
search
over
the
full
model
space
for
your
best
possible
model.
There's
lots
of
different
little
knobs
to
tweak,
right
and
and
deep
learning.
Pretty
much
always
requires
some
tuning,
so
this
typically
involves
a
lot
of
parallel
training
applications
running
concurrently,
so
it's
a
great
fit
for
hpc
resources
and
then,
finally,
once
you
have
a
good
tuned
model,
that's
all
trained!
A
You
want
to
actually
hopefully
use
it
for
something
useful,
and
so
this
would
then
be
using
it
for
inference
right
and
there's
things
like
production
analytics
or
you
know,
so
it
tends
to
be
more
high
throughput.
A
This
can
be
either
like
offline
analytics
or
even
deployed
like
in
real-time
data
processing
for
real-time
experiments.
So
yeah
there's
a
lot
of
different
sort
of
things
you
can
do
with
deep
learning.
A
A
very
common
thread
in
modern,
deep
learning
is
the
need
for
scale
and
the
ever-increasing
need
for
scale.
So
one
obvious
sort
of
reason,
for
that,
is
you
want
to
rapidly
prototype
your
model.
As
I
said,
you
need
to
tune
things.
You
need
to
try
a
lot
of
different
model
configurations
to
get
a
good
model,
and
so
you
want
a
good
turnaround
time
on
your
your
training,
preferably
in
the
minutes
to
hours
range.
But
we
see
time
and
time
and
again,
a
lot
of
these
big
models
are
taking
days
or
weeks
to
train.
A
So
it's
very
important
to
be
aware
of
what
scales
you
might
need
in
a
deep
learning
workload
and
that
is
usually
very
dependent
on
the
data
set
size
that
you
have
and
the
type
of
data
that
you're
processing
so
in
scientific
applications.
We
typically
have
pretty
complex
data
sets
either
you
know
high
dimensional
or
multivariate
these.
A
If
you
look
at
trends
in,
say
industry
or
you
know,
traditional
machine
learning,
research
we're
seeing
a
pretty
clear
trend
in
increase
in
scale,
so
this
plot
on
the
left
is
showing
the
the
size
of
state-of-the-art
natural
language
processing
models
in
in
terms
of
the
you
know,
just
the
number
of
parameters,
but
the
unit
is
billions
of
parameters,
and
so
you
know
if
we
go
to
like
the
most
recent
one.
This
megatron
turing
model
from
nvidia
already
has,
like
almost
has
over
half
a
trillion
parameters
in
the
model.
So
it's
pretty
gigantic.
A
We
expect
this-
maybe
maybe
not,
models
of
that
full
scale,
but
you
know
close
to
that
are
going
to
be
used
on
pro
modern.
So
this
will
be
also
interesting
to
follow
up
with
this
year's
surveys,
but
in
general,
as
we
tackle
more
and
more
complex
tasks
with
our
models.
Typically,
we
need
to
grow
the
model
size,
the
model
capacity,
and
to
do
that,
we
need
to
scale
up
the
training
process
in
some
way
right
because
it's
impossible
to
train
a
gigantic
model.
A
Just
on
a
single
compute
node,
for
example,
the
most
common
way
of
parallelizing.
Your
training
process
is
data
parallel
training
and
the
way
this
is
done
is
just
partition.
Your
data
set
or
your
batch
of
samples
across
the
different
processors
in
your
training
job
and
each
one
of
your
processors
or
your
gpus-
will
have
a
copy
of
the
model
on
it
and
it'll
just
be
able
to
run
the
forward
pass
and
training
as
normal.
A
A
Another
option
is
model
parallel
training,
and
so
this
is
instead
of
partitioning
your
data
across
the
different
processors.
You
can
partition
the
model
itself,
so
this
is
useful
if
you
have
one
of
those
gigantic
models
like
those
language
models
with
a
ton
of
parameters,
you
can
put
some
parameters
on
some
processor
and
others
on
other
ones.
You
feed
your
data
into
your
set
of
processors
as
normal
and
then
pass
results
around
as
needed.
A
One
common
sort
of
subset
of
model
parallelism
is
just
layer
pipelining,
where
you
split
up
the
sort
of
sequential
layers
of
your
network
onto
different
processors,
but
that
one
does
take
some
considerations
to
make
it
kind
of
efficient
model
in
general
model.
Parallel
training
tends
to
be
less
common.
It's
just
a
little
bit
more
tricky
to
get
set
up,
but
there
are
some
some
nice
methods
out
there
for
for
getting
it
working.
A
So
yeah,
as
I
said,
data
parallel
is
the
most
common
and
that's
what
we
see
our
users
using
the
most.
Typically
people
just
use
the
built-in
sort
of
you
know,
native
parallelism,
that's
set
up
in
tensorflow
or
pytorch,
but
another
leading
sort
of
you
know
non-native
distributed.
Training
framework
is
horovod,
which
we
recommend
and
yeah
all
these
use
either
mpi
or
nickel,
which
is
nvidia's
communication
library
for
communication,
and
they
all
perform
quite
well.
A
Of
course,
there's
no
free
lunch
here,
so
it
does
take
some
considerations
like
how
do
we
actually
scale
up
our
training
effectively,
so
in
data
parallel
training,
what
we
usually
want
to
achieve
is
weak
scaling,
or
you
know,
converging
faster
by
by
increasing
the
global
batch
size,
so
increasing
the
number
of
samples
that
we're
feeding
to
our
model
at
each
training
step.
In
this
way,
we
can
take
bigger
steps,
but
take
fewer
of
them,
so
the
idea
is
with
more
gpus
as
we
grow
the
batch
size.
We
have
a
better.
A
We
have
more
samples
that
we're
looking
at
so
we
have
a
better
estimate
of
the
actual
gradient
with
respect
to
our
loss
function,
and
so
what
that
allows
us
to
do
is
to
take
one
large
step,
maybe
with
a
larger
learning
rate
than
what
we
would
normally
be
able
to
do
with
a
smaller
batch
size.
So
with
reduced
noise,
we.
B
A
Now
to
get
this
to
actually
work
in
practice.
There
are
some
caveats:
you
need
to
be
careful
to
make
it
converge.
Stably,
you
have
to
sort
of
tune
things,
especially
the
learning
rate,
so
you
have
to
warm
up
the
learning
rate
usually
scale
it.
According
to
some
scaling
rule,
there's
been
a
lot
of
research
in
this
area,
so
there's
all
sorts
of
tricks,
adaptive,
optimizers
and
architectural
adjustments,
and
so
on
for
actually
scaled
up
training
and-
and
I
definitely
recommend
visiting
our
deep
learning
at
scale
tutorial
that
we
gave
at
supercomputing
for
lots
of
tips.
B
B
Okay,
just
let
me
know
if
it's,
if
it's
not
showing
up
or
if
you
can't
hear
me,
it's
good
yeah,
okay,.
B
It's
fine,
so
peter
talked
a
bit
about
the
things
like
how
deep
learning
is
going
to
be,
transforming
our
scientific
workloads
sort
of
why
we
think
it's
going
to
be
transforming.
So
we
certainly
see
this
as
an
important
emerging
workload
at
nurse
and
hpc
in
general.
B
B
We're
mainly
going
to
be
talking
about
the
software
and
tools
here
today,
but,
of
course,
the
the
real
overall
vision
has
to
include
the
hardware
as
well
as
methods
as
far
as
like
procuring
new
systems,
we're
not
going
to
talk
about
how
we
kind
of
informed
that
today,
but
of
course,
that
that's
a
whole
nurse,
quiet
effort
and
methods
also
will
be
important
too.
B
We're
not
gonna
talk
about
sort
of
maybe
how
we
come
up
with
new
methods
for
deploying
on
our
systems
today,
but
we
we
do
also
do
research
in
those
kinds
of
spaces
and
nurse
computers.
We
have
a
very
highly
diverse
user
base
in
terms
of
the
domains
and
the
applications,
and
we
do
know
that
machine
learning
and
deep
learning
can
potentially
transform
many
different
aspects
of
scientific
computational
workflows.
B
So
what
this
means
is,
you
know
again
we're
thinking
about
this
emerging
workload.
We
really
have
to
think
about
even
a
diverse
set
of
things
within
the
machine
learning
and
deep
learning
space,
the
kinds
of
things
that
we
might
need
to
support.
So
how
do
we
do
this
and
again
narrowing
in
more
on
the
software
and
tools
kind
of
thing?
Well,
we
we
try
to
deploy
optimized
software
installations.
We
work
closely
with
vendors.
So
of
course,
today
we're
talking
about
pro
mutter
we're
working
closely
with
hpe
and
nvidia
as
well.
B
We
do
a
bit
of
testing
and
benchmarking
of
our
system,
so
we
do
have
machine
learning
and
deep
learning
specific
tests
in
our
reframe
regression
testing
framework,
some
benchmarking
efforts
which
I'll
touch
on
in
a
little
bit.
We
do
our
best
to
put
out
good
documentation
and
do
training
events
like
this
one
today
to
to
help
educate
folks
on
how
to
use
our
systems.
B
B
So
the
first
layer,
of
course,
is
the
hardware
you've
heard
already
enough
about
promoter.
I
think
over
the
past
few
days,
I'm
not
going
to
give
you
the
whole
specs,
of
course,
but
just
call
out
a
couple
of
important
things.
So,
of
course,
promoters
are
our
first
system
at
nurse
with
gpus
and
uncoincidentally.
B
It's
our
first
system.
That's
really
good
for
these
new
kinds
of
deep
learning
workloads
and,
of
course,
most
of
that
comes
from
this
specific
chip
that
we
have
in
there,
the
nvidia
ampere
a100
gpus
and
the
fact
that
we
have
over
6
000
of
these
really
make
this
a
great
system
for
for
deep
learning.
B
So,
of
course,
we've
been
very
excited
about
it.
I
didn't
introduce
myself
sorry,
I
forgot
I'm
steve
farrell,
I'm
the
other
machine
learning
engineer
in
the
das
group,
the
the
same
as
peter,
so
we
we
both
kind
of
work
on
on
these
sorts
of
things,
on
supporting
the
the
system
at
the
software
level
and
stuff
like
this,
so
yeah
we're
very
excited
about
pro
mudder.
B
It's
been
very
exciting
to
kind
of
see
us
see
the
ways
that
we're
already
using
it
and
excited
for
the
ways
that
everybody
will
be
able
to
use
it
in
the
coming
years.
I
said
it's
a
it's
a
nice
system
of
over
6,
000
a100s,
in
fact,
nvidia
in
some
press
releases,
called
it
the
world's
fastest
ai
supercomputer
and
obviously
there's
a
little
bit
of
maybe
propaganda
to
that.
But
if
you
just
you
know,
consider
the
aggregate
compute
performance
for
deep
learning
of
six
over
six
thousand
eight
one
hundred
gpus.
B
Okay,
so
now
a
little
bit
on
our
strategy
for
deploying
the
the
deep
learning
software
stack,
so
we
we
try
to
take
care
to
provide
functional
performance
installations
of
the
most
popular
frameworks
in
libraries.
We're
not
going
to
do.
Optimized
builds
of
every
single
tool
or
framework
that
are
out
there.
Obviously,
but
we
do
use
things
like
our
machine
learning
and
nurse
user
survey
to
inform
us
on
what
our
user
base
cares
about,
and
peter
mentioned
this,
but
I'll
plug
it
again.
B
B
So
you
know
we
try
to
support,
let's
say
the
the
things
that
are
most
popular,
but
we
also
want
to
enable
flexibility
for
users
to
really
do
their
customization
deploy
their
own
kind
of
solutions.
I
think,
particularly
for
this
kind
of
user
base
and
these
kinds
of
workloads.
This
is
important
because
there's
always
new
tools
coming
out
every
day
and
in
terms
of
frameworks
really
what
it
comes
down
to
today
is
we're
supporting
we're
having
a
kind
of
a
deeper
level
of
support
for
tensorflow
and
pike
torch.
B
But
folks
can,
I
think,
pretty
much
deploy
whatever
they
want
and
in
terms
of
distributed,
training,
libraries,
we
support
things
like
uber's
horovod
and
the
native
pytorch
distributed
library,
and
then
things
that
peter
and
I
don't
work
as
much
directly
with,
but
involve
more
of
the
nurse
staff,
useful
services
and
tools
for
for
deep
learning
things
like
jupiter
and
shifter.
B
Okay.
So
how
do
you
use
the
deep
learning
software
stack
that
we
deploy
much
like
with
anaconda
python
or
compilers,
or
anything
like
this?
We
have
modules
that
you
can
simply
load
and
it's
you
know
almost
the
same
as
it
looked
like
on
corey
now
on
perlmutter,
so
you
can
do
muzzleload,
tensorflow
module
load,
pi
torch.
B
One
thing
that
I'll
mention
here
that
may
not
be
obvious
to
everybody
is
that
these
modules
are
actually
complete
python
installations.
You
don't
have
to
do
module
load,
python
and
then
module
load
tensorflow.
You
can
just
do
module
load
tensorflow
and
in
fact
that
also
means
that
you
can't
actually
compose
these
things,
so
you
can't
take
things
from
anaconda
python
and
from
our
tensorflow
and
from
our
pi
torch
modules.
B
B
I'd
say,
use
it
sparingly,
but
you
can
use
it
for
the
machine
learning
environments.
If
you
want
to
just
install
one
pip
package
on
top
of
our
modules,
you
can
do
that
with
piv
install
user.
We
do
also
set
this
python
user
base
environment
variable
in
those
modules,
so
that
directory
that
will
be
kind
of
unique
for
that
for
that
module
and
you'll
you'll
have
those
packages
tomorrow,
when
you
modulate
again
these
environments
that
we
install,
we
actually
use
conda
to
install
them.
So
that
means
that
actually
you
can
clone
them
as
environments.
B
So
you
do
have
to
get
the
the
path
to
where
they're
installed,
which
you
can
check
with,
like
a
module,
show
coach,
but
then
you
can
do
something
like
trying
to
create
clone
or
of
course
you
can
create
your
own
custom
kind
of
environments
from
scratch.
So
more
information
on
these.
B
These
methods
on
our
docs
are
a
great
way
to
do
deep
learning
on
promoter
and
we
support
containers
via
shifter,
which
is
the
current
container
solution,
of
course,
on
pro
motor,
it's
easy
to
use.
It's
also
very
performant.
I
don't
know
if
this
was
mentioned
already,
but
the
first,
like
top
500
list
for
promoter,
was
done
with
the
shifter
container,
so
you
can
check
which
images
are
currently
available
on
monitor
with
the
shifter
images
thing,
but
that
does
list
every
image
available.
B
If
the
container
you
want
is
not
there,
but
it's
on
docker
hub,
very
easy
to
just
call
the
pull
command
just
like
with
docker,
and
you
can
run
things
interactively
or
in
your
your
batch
scripts.
If
you
do
run
in
s
s-batch
scripts,
you
can
use
this
s-patch
argument
from
our
shifter
plug-in
that
you
specify
the
container
there
at
the
s-patch
level
and
that
just
does
some
pre-catching
pre-caching
of
the
container.
B
Is
there
a
question
or
somebody
just
unmuted,
okay,
a
little
bit
on
best
practices
for
using
shifter
here,
so
nvidia
is
really
like
the
go-to
place
for
the
optimized
containers
for
perlmutter,
because
they
obviously
optimize
for
their
nvidia
gpus.
So
they
have
these
ngc
containers
for
pytorch
or
tensorflow.
They
always
have
optimized
versions,
the
latest
versions
of
libraries
and
many
different
versions.
In
fact
they
put
out
a
new
container
version
every
month
and
we
try
to
provide
those
already
on
promutter,
but
we
also
have
our
own
images.
B
I
should
put
the
names
here,
sorry,
but
you
can
see
them
on
our
docs,
something
like
nurse
slash
pytorch,
which
is
a
little
more
similar
to
our
modules.
So
we
just
install
a
few
packages
on
top
that
that
our
users,
like
you,
can
also
build
your
own
containers
and
you
can
do
similar
things
with
customization,
one
drawback
with
shifter
on
promoter.
Is
you
know
you
can't
write
two
images
so
to
add
things
you
either
have
to
build
a
container,
maybe
on
your
laptop,
but
it
you
can
still
use
this
pip
install
user
thing.
B
That's
that's
a
work
around.
So
if
you
want
to
do
that,
make
sure
you
set
python
user
base
appropriately,
so
some
path
where
you
want
your
custom
packages
to
go
and
then,
if
you
do
pip
install
user,
the
thing
then
you'll
be
able
to
write
that
maybe
it'll
be
in
your
home
directory,
depending
on
how
you
set
it.
B
Okay,
so
some
more
general
guidelines
on
using
the
stack
at
nurse
before
I
get
into
the
framework
specific
things
I
may
have
shown
that
already
but
it'll
come
up
a
few
times,
that's
a
link
to
our
documentation,
page
for
for
machine
learning,
and
then
there
are
sub
pages
for
the
different
tools
and
frameworks
and
stuff
like
that.
We
do
recommend
that
you
use
our
provided
modules
or
containers
if
appropriate.
Sometimes
they'll
have
features
that
may
not
be
available.
B
If
you
just
do,
conda
install
pytorch
or
or
they
may
have
newer
versions
of
libraries,
they
may
be
more
performant
and
stuff
like
this,
but
of
course
you're
still
free
to
to
customize,
as
you
like
here
are
some
more
pages
on
our
docs.
So
sometimes
things
are
broken.
B
Pearlmutter
is
still
you
know,
being
deployed,
issues
come
up,
so
refer
to
the
current
known
issues
page
or
our
machine
learning,
current
known
issues
page
for
for
any
problems,
if
you're,
if
you're
having
issues
and
then,
if
what
you
see
what's
happening
to,
you
is
not
on
there,
then,
and
you
need
additional
help.
Please
feel
free
to
open
a
ticket
here
at
help.nurse.gov
and
we'll
help
you
out.
B
Okay,
I'll
go
really
quickly
to
these,
but
we
do
have
a
dedicated
page
for
pytorch
with
some
recommendations.
If
you're
doing
distributed
training
in
pie
torch,
you
know
we
recommend
to
use
this
distributed
data
parallel
utility,
that's
in
the
native
pytorch
distributed
library,
and
we
recommend
to
use
the
nickel
back
end
for
optimized
gpu
communication.
B
Did
I
lose
the
sensor,
or
did
I
skip
this?
I
must
have
skipped
the
tensorflow
one:
okay,
really
quick
another
page
also
for
for
tensorflow
for
distributed
training
with
tensorflow.
We
recommend
using
uber's,
horovod
library.
It's
just
really
easy
to
use
and
launch
with
slurm
there's
some
examples
from
horabad
here,
but
tensorflow
also
has
some
native
distribution
strategies
that
are
that
are
also
good.
B
In
particular,
I
think
there's
this
mirrored
worker
strategy
that
that
would
work
well
for
filling
up
a
single
node
make
sure
I
did
do
these
slides,
okay,
good.
So
now
I'll
switch
a
little
bit
to
talking
about
performance.
B
So
what
I
tried
to
cover
there
was
more
just
like
the
functionality
kinds
of
things
so
to
tell
you
how
to
kind
of
get
up
and
running
so
you
can
run
your
workflow
whatever
that
may
be,
but
good
performance
for
these
workloads
is
essential.
B
B
So
performance
is
important
for
those
kinds
of
workloads,
but
also,
if
we
think
about
production,
workloads
or
maybe
folks
are
starting
to
use
ai
in
their
actual
science
production
workloads.
It's
important
to
kind
of
meet
the
the
computational
constraints
there,
maybe
they're
doing
some.
Real-Time
computing
with
data
coming
from
an
experiment
and
performance
can
mean
a
few
different
things
so
for
you,
who's
developing
a
new
model
and
trying
to
train
it
to
solve
a
problem.
B
How
well
you're
utilizing
the
resources,
and
maybe
while
promoter
is
free
for
now,
that's
less
important
to
you,
but
eventually
you
will
have
to
be
charged
for
computing
allocation
on
pro
letter
and
then
you
know
you'll
need
to
care
about
how
you're
spending
your
hours
and
also
for
nurse
as
a
whole.
Of
course,
we
care
about
the
overall
system,
throughput
throughput
for
for
all
of
our
users,
who
are
trying
to
deploy
things
at
the
same
time.
B
So
performance
is
important,
but
it's
also
true,
regardless
of
your
type
of
workload,
whether
you're
running
on
a
single
gpu
or
thousands
of
gpus,
whether
using
a
jupiter,
notebook
or
running
things
in
batch
scripts
and
of
course,
ideally,
the
the
deep
learning
frameworks
would
give
you
everything
they
give
you
maximal
flexibility,
ease
of
use
and
the
best
performance
out
of
the
box
and
they've
come
a
lot
a
long
way,
they're
pretty
good
and
they're
pretty
good
recommendations.
But
of
course
it's
not
always
the
case.
There
can
definitely
be
performance,
limitations
or
pitfalls.
B
So
I
would
strongly
encourage
everybody.
I
think
it's
always
useful
to
spend
at
least
a
little
bit
of
time
evaluating
the
performance
of
your
workload,
because
you
may
find
that
you're
actually
not
using
the
system
very
well
and
you
could
potentially
have
a
lot
to
gain,
especially
if,
if
there's
an
easy
fix,
you
can
do
and
of
course
that
can
just
really
boost
your
productivity.
B
So
first
a
little
bit
on
us.
How
do
we
evaluate
system
performance?
So
we
run
various
kinds
of
functionality,
tests
and
benchmarks.
Nvidia
has
things
like
nickel
tests,
which
let
us
test
nickels?
All
reduced
bandwidth
and
things
like
that,
we
run
some
unoptimized
benchmarks,
for
example,
models
straight
out
of
by
torch's
torch,
vision,
library.
B
The
plots
on
the
right
show
resnet50
scaling
on
promoter
with
just
synthetic
data,
so
no
real,
I
o,
but
still
shows
with
something
a
model.
That's
you
know
not
super
optimized
and
not
a
super
optimized
setup
you
you
can
get
pretty
decent
scalability
up
to
whatever
this
is
like
512
gpus.
B
One
thing
that
I
spend
a
lot
of
time
on
is
ml,
perf
hpc,
which
is
this.
I
co-chaired
this
group
and
we're
doing
deep
learning
benchmarking
for
hpc
science
as
part
of
ml
comments
and
what
we
do
there.
Basically
is.
We
have
benchmarks
and
we
measure
time
to
train
models.
We
also
measure
the
system
throughput,
so
training
many
models
concurrently
and
what
is
the
like
models
per
minute
you
can
achieve.
B
We
do
submission
rounds
so
far
annually.
We
did
one
last
summer,
which
was
actually
the
second
submission
round
and
we
use
real
scientific
or
these
scientifically
motivated
applications.
You
may
have
heard
of
some
of
these
before
deep
cam
is
a
climate
segmentation.
Application
cosmo
flow
is
3d.
Convolutional
regression
open
catalyst
is
a
is
a
newer
one.
We
added
this
year,
which
is
a
graph
neural
network
for
atomic
systems.
B
So
this
summer,
this
last
summer,
we
actually
submitted
results
using
promoter
phase
one,
and
I
was
really
happy
to
see
that
it
turned
out
to
be
highly
competitive.
We
had
some
leading
results,
some
let's
say
second
place
or
sub
leading
results
for
the
benchmarks,
pretty
pretty
good
comparison
to
like
nvidia's
saline
system,
which
is
a
good,
a
good
thing
to
compare
to.
Of
course,
it's
not
exactly
the
same.
They
do
have
more
like
network
cards
and
other
differences,
but
we're
pretty
happy
with
how
it
all
turned
out.
B
Okay,
so
you're
deploying
your
deep
learning
workload
on
pearl
motor.
What
are
some
of
the
things
that
might
be
hurting
your
application
performance?
It
can,
of
course,
come
in
at
multiple
levels,
whatever
you're
running
so
at
the
single
gpu
level.
A
common
thing
is,
is
maybe
you're
spending
too
much
time
in
python
code,
which
is
inherently
single
threaded
and
interpreted.
So
you
want
to
make
sure,
as
much
of
the
work
load
as
possible
is
is
on
the
gpu
or
parallelized.
B
B
You
may
also
have
kind
of
weird
things
in
your
model:
architecture
that
that
use
kernels
that
are
not
super
optimized
yet
by
nvidia.
That's
another
problem,
a
little
bit
harder
to
solve.
If
that
affects
you
and
then
in
the
distributed
world
at
multiple
gpus
or
multiple
nodes
network
communication
can
become
a
bottleneck.
In
some
cases
you
may
be
able
to
just
tweak
some
settings
and
improve
things.
B
We
sometimes
see
in
science,
workloads,
irregularly,
sized
scientific
data
samples,
for
example,
atomic
systems
of
different
sizes,
and
when
you
have
these
kinds
of
things,
you're
trying
to
scale
up
on
an
hbc
load
and
balance
can
be
a
real
performance.
Limiter
we've
seen
that
and
then
the
file
system.
B
Besides,
just
maybe
what
your
code
is
doing
for
data
the
the
data
ingestion,
the
parallel
file
system
can
can
be
a
bottleneck
itself.
So
the
kind
of
read
patterns
that
that
are
in
deep
learning,
let's
say
deep,
learning,
training
workloads.
They
just
turn
out
not
to
be
super
friendly
to
these
large
parallel
file
systems
like
luster,
so
many
small
random
reads.
B
So
how
can
you
start
to
diagnose
your
performance
problems?
Well,
we
do
recommend
that
you
start
simple,
just,
for
example,
checking
gpu
utilization
and
you
can
do
that
with
kind
of
standard
tools
like
some
that
folks
have
already
been
recommending
things
like
nvidia
smi
or
I
think
rawlin
had
some
jupiter
tools
that
that
track
utilization.
B
There
are
things
like
if
you're
running
with
weights
and
biases,
which
does
experiment,
logging
and
hyperparameter
tuning
that
can
do
some
system
monitoring
for
you
as
well.
If
you
run
want
to
run
nvidia
smi
with
some
job
that
you
submitted
it's
fairly
easy
to,
let's
say
just
ssh
onto
one
of
your
compute
nodes
and
run
it
interactively
or
you
could
do
something
like
this
snippet.
I
have
here
where
you
run
nvidia
smi
in
your
bash
script
in
the
background
and
then
get
like
a
csv
file,
get
the
the
results
as
a
function
of
time.
B
So
then,
if
utilization
is
low,
you
know
there's
something
going
wrong
and
you
can
investigate
deeper
to
figure
out
what
it
is.
Then,
how
can
you
diagnose?
Oh
yeah,
just
continuing
that
so
the
the
next
stage
is
is
to
use
some
kind
of
profiler.
Folks
have
already
been
talking
about
insight
systems.
This
is,
of
course,
a
very
highly
standard,
nvidia
tool.
It
can
do
a
lot.
You
can
elect
the
the
data
the
execution
stuff
and
and
gives
you
nice
visualizations
of
the
execution
timeline.
B
You
can
annotate
things
in
your
in
your
model
and
in
your
training
to
see
oh
in
this
visualization.
This
is
where
I
was
data
loading.
This
is
where
I
called
the
the
gradient
computation
and
stuff
like
that.
So
it's
it's
still
definitely
a
very
valuable
tool
and
it
works
well
for
the
deep
learning
workloads
there's
also
insight
compute,
but
I'm
not
really
recommending
that
to
you
as
a
way
to
diagnose
performance
problems.
B
It's
probably
only
useful
if,
if
you're
really
an
expert,
but
that
can
give
you
lower
level
kernel
level
information
about
what
your
application
is
doing
and
then
there
are
framework
specific
pro
profilers
that
come
with
like
pytorch
and
tensorflow,
and
these
are
getting
better
all
the
time
they
give
you
a
nice
bit
of
information.
You
can
view
things
in
in
tensorboard
and
they
try
to
give
you
actually
high
level
recommendations.
B
I
mean
you
can
do
that,
because
this
is
domain
specific,
so
tensorflow
has
one
pytorch
has
one
nvidia
also
has
this
dl
prof
profiler
that's
kind
of
similar
in
some
ways,
so
we
you
know,
we
suggest
you
try
those
out.
They
may
be
a
good
place
to
start,
but
sometimes
we
notice
that
the
high
level
recommendations
they
may
not
quite
be
accurate.
B
They
can
be
misleading
sometimes
so
you
may
want
to
kind
of
mix
things
we
may
want
to
go
to
inside
systems
if
you
want
to
really
just
be
able
to
see
what's
going
on.
So
here
is
a
little
view
of
what
insight
systems
looks
like
as
you
visualize,
an
actual
deep
learning
training
application.
B
This
comes
from
our
tutorial,
we're
going
to
plug
our
tutorial
from
sc
several
times
in
here,
because
we
have
a
lot
of
great
material
there,
a
lot
of
great
material
on
profiling
and
optimizing
deep
learning
training
so
definitely
check
that
out
the
example
that
we
use
in
the
tutorial.
It's
basically
the
one
that
we're
using
today
for
the
hands-on,
but
just
the
tutorial
goes
in
a
lot
more
depth,
we're
not
doing
the
profiling
and
optimization
stuff
today,
but
for
that
example,
in
the
tutorial
it
just
it
works.
B
B
I
apologize
if
that's
hard
to
read,
but
the
idea
is
they
give
you
this
view
of
of
what's
the
time
breakdown
of
your
application,
how
much
time
is
spent
in
kernel,
launching
or
loading
data
or
stuff
like
that
and
at
the
very
bottom
right
of
the
pi
torch
one,
an
example
of
a
high
level
performance
recommendation.
It
says
you
know
you're
spending
all
this
time
in
the
data
loader.
Maybe
you
can
tweak
x,
y
and
z,
so
certainly
try
those
out
some
other
tips
for
improving
performance.
B
Most
of
this
is
just
lifted
again
from
our
tutorial
for
the
data
loading
stuff.
There
are
some
obvious
things
to
tweak.
There's
this
no
more
numb
workers
setting
in
pi
torch.
Sorry.
This
is
a
bit
quite
specific,
I
realized,
but
for
the
data
loader
and
pi
torch,
you
can
choose
how
parallel
the
the
the
input
file
reading
is,
which
turns
out
to
be
one
of
the
most
important
settings
for
for
optimizing.
This
there's
things
like
pin
memory,
which
is
a
setting
that
kind
of
can
help
with
the
host
to
device
transfers.
B
If
you
can,
you
can
consider
staging
your
data
sets
onto
the
nodes,
so
we
don't
have
ssds
on
the
nodes
on
perlmutter,
but
you
can
use
the
per
process
memory,
so
the
actual
ram
of
your
running
processes
or
slash
temp,
which
is
ram
disk,
so
it
uses
ram.
But
it's
it
looks
like
a
file
system
and
it's
shared
for
that
node.
So
all
the
workers
on
the
node
could
share
this
and
it's
up
to
126
gigabytes.
B
So
if
you
have
a
large
data
set,
you
may
actually
need
to
partition
your
data
set
across
nodes
to
fit.
This
can
have
some
performance,
sorry,
some
convergence
convergence
implications
and
how
that
affects
how
you're
doing
like
global,
shuffling
and
every
epoch
and
things
like
that.
But
in
practice
we
see
that
usually
that's,
not
a
major
issue.
So
partitioning
your
data
set
across
nodes
can
work
in
practice.
B
B
It
can
optimize
the
host
of
device
transfers
through
cuda
streams
and
some
other
things
so
for
tuning
the
single
gpu
performance.
I
always
want
to
stress
that
this
is
important.
It's
it's!
It's
not
a
good
idea
to
just
jump
to
oh,
you
know
my
my
training
code
is
slow,
so
I'm
just
going
to
run
it
on
hundreds
of
gpus
and
then
it's
it's
faster.
B
You
can
get
a
lot
out
of
just
looking
at
what's
going
on
at
the
single
gpu
level,
so
try
things
like
using
mixed
precision:
training,
try
it
compiling
your
model,
so
both
frameworks
have
different
ways
of
doing
these,
but
they're
fairly
straightforward
to
put
into
your
code.
Sorry,
I
don't
have
all
the
details
here,
but
we
can
help
you
if
you
need
help.
B
Nvidia
has
some
stuff,
like
this
apex
library
for
pytorch
things
like
fused
optimizers
that
are
are
good
to
use,
fusing
optimizer,
fusing,
kernels
and
optimizers.
Basically
just
helps
you
so
you're,
not
launching
so
many
small
kernels
and
things
can
run
faster
at
the
distributed
kind
of
scenario
to
tune
that
performance.
There's
things
you
have
to
consider
like
this.
This
trade-off
between
efficiency
and
runtime.
It's
always
an
important
thing
to
think
about.
B
As
you
take
a
workload
and
scale
it
up
to
more
gpus,
generally
you're
going
to
be
trading
off
some
notion
of
efficiency
for
runtime,
so
things
may
run
faster,
but
you
may
be
burning
through
your
hours
faster.
So
you
kind
of
have
to
tune
this
for
your
needs
and
then
there
may
be
settings
in
the
communication
backend
actually
in
the
libraries
that
you
can
tweak.
B
A
Great
well
thanks
steve,
so
yeah
by
now,
you
should
have
a
pretty
good
overview
of
you
know,
what's
available
at
our
on
perlmutter,
for
deep
learning
and
yeah
how
to
get
it
working.
Well,
things
steps
you
can
take
to
get
it.
You
know
optimized
and
performant.
A
A
So
obviously,
this
was
covered
very
well
by
the
previous
talk,
but
jupiter
is
an
except
excellent
service
for
deep
learning,
especially
for
interactive
stuff
right.
So
these
notebooks
are
very
popular
service.
With
hundreds
of
our
users,
it's
a
favorite
way
for
people
to
develop
machine
learning
code
in
particular.
A
I
I
personally
like
jupiter
for
things
obviously
interactive
things.
So
things
like
debugging
or
you
know,
analysis
or
you
know,
quick
testing
of
if
you're,
developing
some
custom
operation
for
your
architecture
or
something
like
that,
it
can
be
very
useful
for
visualizing
intermediate
results
and
so
on,
and
our
jupiter
environment,
thanks
to
the
hard
work
by
nurse
staff,
is,
is
very
flexible
and
I
think
easy
to
use.
So
you
can
run
your
workloads
on
it
quite
easily.
You
can
use
dedicated
promoter
gpu
nodes.
A
We
have
some
pre-installed
deep
learning
software
kernels
available,
which
are
just
based
on
our
module
installations
for
pi
torture
tensorflow,
but
you
can
also
use
your
own
custom
kernel
and
that's
quite
easy
to
get
set
up.
We
have
documentation
on
how
to
do
that.
A
Another
one
that
is
excellent
is
tensorboard.
This
is
probably
the
most
popular
tool
for
kind
of
tracking.
What's
going
on
in
your
deep
learning
training,
so
visualizing
and
monitoring
your
experiments
different
like
metrics
or
results
from
your
model
as
it
trains
both
tensorflow
and
pytorch
communities
are
very
enthusiastic
for
tensorboard.
They
both
support
it
very
nicely.
A
You
can
do
cool
things
like
not
just
visualize
things
like
you
know
your
loss
throughout
training,
but
you
can
also
plot
like
the
distribution
or
histograms
of
your
actual
weight
parameters.
You
can
see
if
maybe
your
weights
in
some
layer
are
collapsing
to
zeros
or
off
to
infinity
or
something
so
it's
easy
to
catch
little
debugging
tips,
and
then
you
can
also
visualize
custom
plots
that
you
make.
If
you
have
a
specific
like
science
metric
that
you're
trying
to
satisfy
you
can
plot
that
throughout
training,
so
overall,
very
helpful.
A
So
this
is
just
a
little
python
module
that
we've
written
that's
available
in
our
custom,
jupyter
kernels
that
we've
installed
and
this
one
you
just
simply
have
to
import
it
and
then
load
it
and
start
tensorboard,
and
that
will
give
you
a
link
that
you
click
on
in
the
notebook
and
that
link
will
bring
you
to
your
tensorboard
server.
A
So
beyond
those
two
which
are
kind
of
more
useful,
just
like
day-to-day
interactive
type,
things
if
you're
doing
hpo
hyperparameter
optimization,
there's
a
lot
of
different
options
out
there.
So
again,
this
is
a
pretty
critical
step
in
in
deep
learning
development,
because
you
pretty
much
always
have
to
tune
your
model.
A
So
there's
been
a
lot
of
different
libraries
and
methods
developed
over
the
years
to
to
do
this.
It's
usually
like
we
said,
computationally
expensive.
You
need
to
train
a
lot
of
models,
so
it's
a
good
fit
for
things
like
promoter,
but
because
of
these
libraries
out
here
you
can
also
sort
of
save
some
time.
They
have
some
nice.
You
know
more
advanced,
optimization
algorithms
that
are
that
will
help.
You
reduce
the
number
of
trials
you
have
to
run.
A
For
example,
if
you're
just
running
random
search,
you
might
have
to
run
100,
but
these
tools
contain
you
know,
alternative
methods
that
maybe
can
be
more
efficient
in
selecting
the
next
trials.
So
generally
we
we
can
just
you
know
you
can
use
whatever
you
want
to
and
if
you
run
into
trouble,
you
can
just
come
to
ask
for
any
help.
All
these
example
frameworks
I
have
listed
here
ray
tune
weights
and
biases
sig
opt
and
optuna,
there's
many
more
but
yeah.
A
So
beyond
those
additional
tools,
I
just
want
to
briefly
mention
some
some
other
sort
of
outreach
or
additional
research
or
resources
for
you
to
learn
more
about
deep
learning
in
general.
This
will
be,
you
know,
more
targeted
for
people
who
are
maybe
new
to
deep
learning
or
or
new
to
you
know,
deep
learning
applied
to
science,
so
a
great
resource
is
our
deep
learning
for
science
school
that
we
did.
This
happened
in
2019
and
2020..
A
The
2019
was
a
great
event.
It
was
actually
in
person,
so
there's
there's
videos
of
the
presentations
and
slides
and
code
exercises
all
online
in
2020.
We
ended
up
doing
a
webinar
series,
so
lots
of
different
topics
were
covered.
We
actually
got
into
a
fair
amount
of
depth
and
again
sort
of
targeting,
like
actual
scientific
applications
and
aspects
of
deep
learning,
workflows
that
are
relevant
for
science.
A
So
so
this
webinar
series
is
also
a
great
resource
and
all
those
talks
are
available
online,
as
we've
plugged
multiple
times
and
will
plug
again
right
now.
The
deep
learning
at
scale
tutorial
at
sc
is,
I
think,
a
great
resource
if
you're
maybe
familiar
with
ish
with
deep
learning,
but
you
you
aren't
really
familiar
with
how
to
get
it
to
scale
up
on
our
systems.
A
So
this
tutorial
we've
we've
done.
We
did
it
with
nvidia
we've
done
it
with
cray.
In
previous
years,
we
presented
it
at
sc
for
a
lot
of
the
past
few
years,
as
well
as
some
other
venues,
but
it
has
really
good.
You
know
detailed
lectures
and
hands-on
material
examples
of
doing
distributed.
Training
profiling,
optimization
our
sc21
material,
is,
was
all
done
on
perlmutter.
So
it's
a
good
basis.
We
actually
we're
going
to
use
that
for
today's
hands-on
exercises,
but,
as
steve
said,
yeah,
please
go
visit
the
full
sc
21
tutorial.
A
If
you
want
to
learn
much
more
in
depth
on
things
like
profiling,
optimization
or
just
derivative
training,
and
then
finally
I'll
just
plug
our
data
seminar
series.
This
is
a
interesting
set
of
seminars
that
we
host
from
time
to
time
isn't
like
specifically
limited
to
deep
learning
but
covers
a
variety
of
you
know:
data
centric
topics.
A
A
A
Always
our
documentation
is
a
great
resource
and
it
looks
like
we
have
a
little
bit
of
time
left.
So
if
you
have
questions,
I
think
you
should
yeah.
You
should
go
ahead
and
put
them
into
the
google
doc
now,
if
you're
interested
in
the
hands-on
exercises,
which
will
be
done
in
this
afternoon
session,
I
think
starting
around
12
30..
A
Then
you
can
stick
around
for
this
last
portion
of
the
talk.
So
in
this
last
portion,
I'm
gonna
just
quickly
go
over
some
background
for
these
hands-on
exercises.
A
A
Is
that
in
these
simulations
we
have
dark
matter
which
is
abundant
and
it's
essential
to
structure
formation,
but
we
can't
really
see
dark
matter
or
we
can't
see
it
at
all,
and
so
we
need
to
model
the
observables
right
from
actual
visible
matter,
and
this
comes
from
luminous
gas
or
galaxies
that
are
also
forming
in
this
large
scale
structure.
So
so
this
so-called
cosmic
web
forms
mostly
from
dark
matter
coalescing,
but
then
on
smaller
scales.
A
We
have
gas
dynamics
that
are
affected
by
you
know
hydrodynamic
interactions.
So
actually
you
know
like
pressure
and
temperature
and
so
on
affect
the
actual
flux
or
light
that
we
observe
in
the
universe.
So
to
get
this
observable
field
modeled
correctly,
we
need
to
both
model
the
large
scale
structure,
as
well
as
the
sort
of
fine
detail
of
the
gas
dynamics,
but
unfortunately
the
gas
dynamics
are
very
expensive
to
compute,
so
modeling,
this
full
combined
system
of
n-body
and
hydro
fields
is
very
computationally
demanding
it
requires
a
complex
multiphysics.
A
Fluid
solver
runs
on
hpc.
Resources
can
take
many
many
compute
hours
to
resolve
one
of
these
simulations
if
you're
running
a
high
resolution
for
a
long
time,
and
so
a
simpler
option
is
to
just
model
the
n-body
or
the
dark
matter
simulation
and
this
one
you
can
kind
of
ignore
all
this
complex,
hydrodynamics,
still
capture,
roughly
the
large-scale
structure
and
it's
a
decent
estimate.
So
it's
been
a
long-standing
goal
to
reconstruct
the
hydro
fields
from
the
n-body
input.
A
So
that's
the
sort
of
science
background.
What
we're
going
to
do
here
is
just
try
to
do
that.
Reconstruction
process
with
a
deep
neural
network,
we're
going
to
use
a
unit
architecture,
so
units
are
nice
models
because
you,
you
can
have
these
sort
of
sequential
convolutions
that
extract
features
and
and
down
sample
the
spatial
size
of
the
input
sequentially.
A
So
it's
sort
of
extracting
more
and
more
global
features
by
the
time
you
get
to
the
bottom
of
the
unit
and
then
at
this
stage
you
want
to
start
up
sampling
back
to
your
original
spatial
resolution,
so
you
can
actually
make
a
prediction
right
or
or
make
a
reconstruction
of
the
hydro
field
that
we
are
interested
in
and
to
do
that,
we
just
have
another
series
of
convolutions
with
upsampling
in
them.
The
key
thing
in
a
unit.
A
That's
helpful
for
getting
these
high
resolution
features
resolved
is
the
skip
connections
which
basically
just
copy
the
extracted
features
at
each
spatial
scale
across
the
network
to
the
other
side,
so
that
the
information,
the
sort
of
high
frequency
or
high
resolution
information
isn't
lost.
When
you
go
down
to
this,
the
bottom
of
the
unit.
A
And
these
big
3d
simulations,
the
data
volume,
is
kind
of
a
challenge.
There's
four
input
fields
and
four
output
fields,
for
this
example,
and
the
spatial
grid
is
very
large,
at
least
in
terms
of
deep
learning.
Standards
like
10,
24,
cubed
or
2048.
Cubed
is
pretty
big
and
it's
hard
to
fit
that
plus
a
model
plus
the
optimizer
utilities
and
so
on
for
training
all
on
a
single
gpu.
A
So
what
we
do
is
train
with
smaller
crops
or
sub
volumes,
and
the
way
that
looks
like
is
is
pretty
simple:
we
just
select
a
crop
out
of
the
simulation
for
our
input
and
feed
it
to
our
network
and
compare
it
and
get
our
prediction,
and
then
we
compare
that
to
the
corresponding
crop
from
our
target
simulation
and
a
special
sort
of
addition.
We
have
here
for
this.
Workflow
is
using
some
extra
data
augmentations
by
just
in
addition
to
randomly
cropping
a
sample.
A
A
So
in
in
the
code
today,
it's
all
pytorch
based
pytorch
is
pythonic.
It's
pretty
easy
to
integrate
with
other
python
code,
as
we
said
generally
performant
out
of
the
box
with
these
optimized
libraries
from
nvidia,
and
it
has
good
support
for
distributed
training
today,
we're
going
to
be
using
gpus,
so
we're
going
to
opt
for
the
nickel
back
end
for
communication
during
distributed
training.
A
This
link
right
here
is
the
link
to
all
of
our
example
code
and
I'm
not
going
to
go
through
the
readme,
because
it
has
everything
you
need
to
know
in
it
and
of
course,
throughout
the
readme.
We've
also
pointed
links
back
to
our
original
sc21
tutorial
material.
If
you
want
more
details
and
that
for
accessing
promoter
for
this
hands-on
we're
going
to
recommend
that
you
use
the
jupiter
hub
and
again
there's
instructions
for
that
in
the
readme.