►
Description
SC20 Deep Learning at Scale Tutorial
https://github.com/NERSC/sc20-dl-tutorial/
A
A
Have
so
first
what
are
hyper
parameters,
and
why
do
we
care
about
them?
So
when
you
think
about
machine
learning
models,
there's
a
couple
of
type
of
parameters
to
consider.
First,
you
have
your
model
parameters.
These
are
the
internal
values
of
the
models
like
weights
in
a
neural
network
that
you
learn
by
training
your
model
against
a
data
set,
the
other
type
of
parameter
that
we're
interested
and
concerned
with
is
the
external
values,
the
hyper
parameters
to
the
model
that
determine
the
capacity
of
the
model
to
learn
and
how
the
model
learns.
A
A
A
So
why
do
we
care
so
much
about
hyper
parameters?
Well,
it
turns
out
that
finding
a
good
set
of
hyper
parameters
for
your
deep
learning
model
can
have
a
very
big
impact
on
everything
from
your
your
final
accuracy
to
the
time
to
converge,
and
it
can
also
help
with
preventing
two
significant
problems
with
machine
learning
models
which
is
overfitting
and
underfitting.
A
A
So,
given
this,
how
are
hyperparameters
typically
optimized,
so
one
method
is
the
manual
or
by
hand
method.
This
is
often
guided
by
intuition
and
various
rules
of
thumbs
where
hyperparameters,
essentially
selected
and
tuned
manually,
based
on
the
the
knowledge
and
the
intuition
of
the
of
the
machine,
learning
engineer
or
data
scientist
working
on
training
the
model.
A
Another
technique,
that's
starting
to
become
more
popular
and
more
tools
are
appearing
around.
This
is
automatic,
hyper
parameter,
optimization,
so
clearly,
a
brute
force,
search
of
your
entire
search
space
is
intractable,
there's
just
too
many
in
a
typical
model.
There's
too
many
hyper
parameters
too
many
possible
values
of
those
hyper
parameters.
So
instead
these
automatic
hyper
parameter.
Optimization
techniques
focus
on
evaluating
a
subspace
of
possible
hyperparameters.
A
But
first
you
know.
Why
are
we
talking
about
hyper
parameter,
optimization
at
super
computing?
Well,
it
turns
out
much
like
in
the
the
previous
portions
of
this
tutorial.
We
were
talking
about
why
hpc
systems
are
great
for
training,
deep
neural
networks,
they're
also
great
for
hyper
parameter.
Optimization,
because
you
are
dealing
here
with
a
very
large
search
space
of
many
different
types
of
hyper
parameters.
A
Many
different
varieties
as
well
of
hyper
parameters
from
integer
to
categorical
to
continuous
hyper
parameters
and
evaluating
these
hyper
parameters
can
be
expensive
so
to
evaluate
a
specific
point
within
our
search
space
of
hyper
parameters.
That
means
training
a
neural
network
with
those
hyper
parameters
and
evaluating
the
accuracy
or
the
time
to
convergence
or
whatever
your
fitness
metric
of
choice
is
for
the
optimization
problem.
A
In
addition,
if
you're
going
to
be
training
at
large
scale
on
an
hpc
system,
it's
important
to
understand
that
the
ideal
hyper
parameters
also
vary
with
scale,
so
you
don't
necessarily
want
to
do
hyperparameter
tuning
at
small
scale
and
then
training
at
large
scale,
because
some
hyper
parameters,
in
particular
learning
rate,
can
vary
with
your
effective
batch
size,
which
also
typically
varies
with
the
scale
that
you're
training
at.
In
addition,
your
cost
function
is
not
necessarily
smooth
or
continuous
in
many
regions.
This
is
a
very
complex
search
problem.
It
requires
multiple.
A
A
So
what
are
some
of
the
techniques
that
are
used
for
automated
hyper
parameter?
Optimization,
probably
the
simplest
and
sort
of
you
know
baseline
technique
for
automated
hyper
parameter.
Optimization
is
a
grid
search
here
where,
essentially,
you
can
think
of
each
of
your
hyper
parameters
that
you're
interested
in
tuning
as
a
as
an
axis
of
some.
You
know
n-dimensional
grid,
and
you
have
some.
You
know
step
size
here
and
you're
evaluating
every.
You
know,
you're
evaluating
all
the
points
at
those
individual.
You
know
grid
points.
A
This
is
simple
to
understand,
clearly
easy
to
parallelize,
because
a
bunch
of
independent
evaluations,
you
know
exactly
which
evaluations
you're
going
to
do
disadvantages,
though,
of
course,
are
you
know
your
cursive
dimensionality
and
that,
as
you
start,
adding
lots
of
hyper
parameters,
this
quickly
explodes
in
complexity,
so
it
can
be
very
computationally
expensive.
Nevertheless,
this
is
a
good
sort
of
baseline,
automated
hpo
technique
that
we
can
use
to
compare
other
automated
hbo
techniques
against.
A
Another
strategy-
that's
that's
often
discussed
and
sometimes
used-
is
random
search,
so
random
search
here,
you're
just
picking
points
at
random.
Once
again,
this
is
easily
paralyzable.
You
can,
you
know,
have
multiple
nodes
going
off
and
determining
you
know
different
evaluations
to
just
randomly
picking
values,
type
of
parameters.
A
It
turns
out
it's
more
efficient
than
grid
search,
and
I
will
kind
of
get
into
a
little
bit
more
about
why
that
is
on
the
next
slide.
However,
it's
not
really.
It
still
tends
to
be
more
computationally
expensive
than
some
of
the
more
advanced
techniques,
because
it's
not
really
using
any
intelligence
to
narrow
down
on
the
search
space.
A
It's
just
randomly
picking
points
I
mentioned
on
the
previous
slide
that
random
search
was
more
efficient
than
grid
search
as
far
as
sampling
of
hyperparameters.
So
why
is
that?
Well,
if
you
look
at
the
simple
example
in
this
diagram
here,
where
we
have
one
hyperparameter
say
that's
more
important
than
another
hyperparameter.
A
There's
also
an
interesting.
You
know
probabilistic
theorem
about
this,
which
basically
says
that
if
you
have
any
distribution
with
it
with
a
finite
maximum,
if
you
take
60
random
observations
within
that
there's,
a
95
percent
chance
that
at
least
one
of
those
will
be
in
a
five
percent
window.
A
five
percent
range
around
the
the
actual
point
that
has
the
maximum
value.
So
if
your
fitness
metric
is
somewhat
smooth
typically,
this
means
that
there's
95
chance
that
you'll
find
at
least
one
fairly
good
value
of
hyperparameters.
A
A
So
the
idea
here,
taking
this
and
applying
the
type
of
parameter
optimization,
is
that
we
treat
each
hyperparameter
as
essentially
a
gene
and
a
distinct
set
of
hyper
parameters
that
you
may
evaluate
to
train
a
model.
You
consider
that
distinct
set
of
hyper
parameters
as
an
individual
organism
within
a
population.
A
A
So
the
nice
thing
about
genetic
algorithms
is
that
they're
a
great
way
to
combine
a
couple
of
priorities
that
you
have
when
you're
doing
an
optimization
problem
exploitation,
so
that
is
sort
of
looking
down
narrowing
in
on
promising
areas
of
your
search,
space
and
exploration,
which
is
the
idea
that
you
know
you
may
be
stuck
in
the
local
maximum?
A
So
you
also
want
to
go
off
and
explore
other
regions
of
the
search
space
as
well,
and
genetic
algorithms
are
nice
in
the
fact
that
they
have
nice
dials
that
you
can
tune
to
play
with
that
trade-off
between
exploration
and
exploitation.
So,
for
example,
there's
a
there's
a
mutation
rate.
So
when
you
combine
two
parents,
there's
a
probability
that
each
of
these
genes
will
mutate,
so
that
of
course,
encourages
further
exploration
of
the
search
base.
A
In
addition,
you
often
have
separate
populations,
and
so
you
know
you
may
have
different
founders
for
different
populations
and
they
go
off
and
those
those
different
populations
will
be
searching,
maybe
different
areas
of
your
search
space
and
then
periodically
you'll
take
good
elements
from
each
of
the
population.
Good
individuals
from
each
of
these
populations
and
you'll
combine
them
to
get
even
further
exploration
by
combining
individuals
which
may
have
kind
of
optimized
very
different
areas
of
the
search
space
and
combining
those
so
the
basic
cycle
within
the
genetic
search.
A
Is
you
start
with
for
each
population?
You
start
with
a
founder.
You
apply
a
number
of
mutations
to
get
your
initial
population
and
then
you
just
repeat
this
process
for,
however
many
generations
you
want
to
search
for
so
you'll
start
with
evaluating
each
of
the
members
of
the
population,
selecting
the
ones
that
are
the
most
fit
from
some
set
of
the
most
fit
members.
You
will
apply
reproduction
to
create
new
children.
Those
will
become
your
next
generation
and
you
will
repeat
the
process.
A
Next,
we
evaluate
the
fitness
of
the
individuals
in
that
population
and
we
select
some
of
the
most
fit
individuals
in
this
case.
The
green
points
we
then
choose
pairs
of
those
particularly
fit
individuals
apply
reproduction,
which
is
crossover
so
combining
of
hyper
parameters
from
the
different
individual
from
the
different
individuals,
as
well
as
potentially
mutation,
depending
on
the
mutation
rate.
A
Select
another
set
of
parents
create
another
child,
another
set
of
parents,
another
child
repeat
the
process.
Until
we
have
produced
our
next
generation,
we
then
repeat
the
process
evaluating
the
fitness
of
that
generation
and
we
keep
going
reproducing
evaluating
the
fitness,
selection,
etc.
Until
we
end
up
with,
hopefully,
a
set
of
individuals,
a
population
that
has
all
has
a
high
fitness
level.
A
So
the
the
basic
idea
behind
this
is
you're,
typically
applying
genetic
search
to
an
epic
doing,
checkpoints
and
then
selecting
the
the
best
values
determined
for
each
epoch,
restoring
the
checkpoint
from
that
and
reapplying
genetic
search
to
calculate
the
best
hyper
parameters
for
the
following
epoch,
and
we
can
see
an
example
of
this
here
where
we
had
sort
of
original
values
for
for
learning
rate
and
for
weight
decay
and
then
using
hype
using
population
based
training.
We
learned
a
new
schedule.
A
Time
so,
as
I
mentioned,
there's
a
number
of
tools
out
there
for
doing
automatic,
hyper
parameter.
Optimization.
I've
just
highlighted
three
here
that
I'm
most
familiar
with,
there's,
of
course,
the
the
creai
hyperparameter,
optimization
library,
developed
by
by
hp
and
gray.
This
is
available
currently
on
nurse
query
system
also
and
many
other
hpe
cray
systems.
A
It
has
a
python
front-end
python,
apis
to
interface
with
the
library
and
it
runs
on
the
back
end
using
chapel.
This
is
all
compiled
down,
though
you
have
no
need
to
download
chapel,
runtimes
or
anything
to
run
with
us.
We
just
utilize
chapel
for
its
high
performance,
distributed,
computation
and
and
ease
of
programming.
A
This
is
designed
really
for
doing
hyper
parameter,
optimization
on
high
performance
computing
systems,
as
you
would
expect
from
something
from
cray
there's.
Also
the
dpiper
library
developed
by
argon
and
available
on
their
theta
system,
also
built
for
hpe
systems
and
the
ray
tune.
Hbo
library
built
on
top
of
the
ray
framework
and
I've
got
links
here
for
for
both
of
those.
A
So
this
library
supports
both
distributed,
optimization
as
in
running
many
different
trials
of
different
hyper
parameters
in
parallel
on
many
different
nodes
of
your
system,
as
well
as
distributed
optimization
with
distributed
training.
So
you
can
have
your
individual
evaluations.
Not
only
have
many
evaluations
but
have
each
of
those
evaluations
be
at
scale
as
well
fairly,
simple
steps
to
use-
and
this
is
this-
is
common
with
most
of
the
automated
hpo
libraries
out
there.
You
create
a
wrapper
script,
typically
in
python
that
will
import
your
your
hyper
parameter.
A
Optimization
library
you'll
define
the
optimizer,
you
want
to
use
and
any
parameters
to
that,
and
you
will
then
define
the
hyper
parameters
that
you're
interested
in
training.
Here
we
see
learning
rate
and
dropout
rate.
We
see
default
values
and
ranges
to
explore
for
those,
and
then
you
also
create
a
training
script
and
that's
what's
what's
referred
to
here
by
the
evaluator.
So
that
is
typically
just
your
your
deep
learning
training
script
that
you
would
have
used.
Otherwise,
with
a
couple
minor
modifications
which
I'll
which
I'll
show
on
a
show
on
a
future
slide.
A
So
if
you
look
here
actually
at
the
the
model-
training
script,
here's
the
ways
you
you
need
to
change
that
in
order
to
interface
with
the
wrapper
script
and
the
hpo
library,
you
need
to
make
sure
that
the
hyper
parameters
that
you're
interested
in
training
are
exposed
to
the
hyperparameter,
optimization
library
in
the
wrapper
script,
and
this
is
done
by
adding
them
as
command
line.
Arguments
typically
make
sure
you
use
those
those
hyper
parameter
command
line.
A
Arguments
in
your
training,
of
course,
and
then
you
need
to
just
need
to
add
something
that
prints
out
a
figure
of
merit.
This
is
the
the
function
you're
trying
to
optimize,
so
it
could
be
time
to
convergence
could
be
accuracy,
it
could
be
whatever
you
want.
A
And
then,
as
I
mentioned
before,
you
also
have
to
create
a
wrapper
script
and
we
see
in
here
we
have
our
parameters.
We
have
our
genetic
optimizer.
We
have
the
various
parameters
to
that
genetic
optimizer.
These
are
sometimes
referred
to
as
your
meta
parameters
to
distinguish
between
them
from
the
model
parameters
and
the
hyper
parameters,
and
then
once
you've
created
your
evaluator,
your
parameters
and
your
optimizer,
you
just
run
optimizer.optimize.
A
There's
a
couple
of
new
features
that
we've
recently
been
been
working
on
and
added
to
the
the
hpe
crate,
ai
hyper
parameter,
optimization
library,
one
of
them
is
the
ability
to
do
extrapolation
and
early
termination.
So
this
is
useful
if
to
prevent
the
hyper
parameter,
search
from
exploring
evaluations
and
wasting
time
and
evaluations
that
are
clearly
not
going
to
do
well.
So
we
have
the
ability
to
stop
an
evaluation
early.
If
it
doesn't
meet
a
specific
threshold,
you
can
also
set
multiple
durations
and
multiple
thresholds
that
it
may
need
to
meet
at
different
times.
A
A
So
far,
there's
also
the
ability
to
do
extrapolation-
and
I
show
in
this
example
example
here-
linear
extrapolation-
you
can
also
do
degree,
2
degree,
3,
etc,
extrapolation
and
the
idea
here
is
you
basically
say
in
this
in
this
particular
call
here
we're
saying
after
three
intervals,
we
want
to
extrapolate
to
what
our
metric
our
fitness
metric
will
be
after
eight
integrals,
eight
intervals
using
a
linear
extrapolation
degree.
One
here
and
we
wanna
see
if
the
extrapolated
loss
after
eight
is,
is
less
than
zero
point,
if
not
terminate
early.
A
Another
thing
we've
been
working
on
recently
is
the
addition
to
to
the
hpe
creai
of
a
analytics
module
which
allows
you
to
better
study
and
examine
the
results
of
your
hyper
parameter
optimization.
So
you
can
study
the
relationship
here
between
either
scale
and
a
specific
hyper
parameter.
So
we
see
here
we're
examining
the
relationship
between
the
number
of
nodes
we
trained
on
and
the
learning
rate
with
the
color
there
indicating
that
the
fitness
metric.
This
could
also
be
changed
to
examine
the
relationship
between
two
different
type
of
parameters.