►
Description
More about this lecture: https://dl4sci-school.lbl.gov/richard-liaw
Deep Learning for Science School: https://dl4sci-school.lbl.gov/agenda
A
Hi,
my
name
is
richard:
I'm
a
software
engineer
at
any
scale.
Previously
I
was
a
phd
student
at
uc,
berkeley,
working
on
machine
learning
systems
and
cloud
computing.
During
my
phd,
I
primarily
worked
on
rey,
which
is
a
distributed
computing
library
in
python,
targeting
ai
applications,
and
so
today,
I'll
be
talking
to
you
about
hyperparameter
tuning.
A
A
Let's
begin
with
some
context
about
machine
learning,
today,
machine
learning,
specifically
deep
learning,
is
experiencing
rapid
growth
and
adoption.
This
is
happening
not
only
in
academia
but
also
in
industry,
where
more
and
more
applications
such
as
speech,
recognition
and
autonomous
vehicles
are
leveraging
deep
learning.
A
Despite
this
rapid
growth
and
adoption,
all
deep
learning
practitioners
know
about
one
dirty
secret
and
that
is
the
reliance
and
need
for
hyperemergening.
Let's
give
an
example:
I'll
talk
quickly
about
convolutional
neural
networks.
These
neural
networks
are
very
powerful
and
attributed
to
many
of
the
recent
advances
in
computer
vision.
A
Here
on
the
screen,
we
have
one
of
the
original
convolutional
neural
network
designs
by
yonder
on
the
bottom.
We
have
a
more
modern
neural
network
design
called
alexnet,
also
convolutional
neural
network.
This
was
developed
after
20
years
of
research,
and,
what's
interesting
is
that
the
basic
idea
over
the
last
20
years
have
remained
largely
unchanged
over
the
last
20
years.
We've
actually
simply
just
modified
the
shape
of
the
neural
network
and
size
of
each
layers,
and
as
a
result,
we
have
this
new
wave
of
deep
learning
research
that
is
so
active
today.
A
These
hyper
parameters
can
clearly
make
a
huge
difference
in
the
performance
of
these
models.
Now,
what's
making
matters.
Worse
is
actually
two
common
trends
that
we're
seeing
the
first
one
is
that
models
are
getting
larger
and
larger.
With
the
most
recent
open,
ai
gpt3
model
containing
nearly
200
building
parameters,
these
state-of-the-art
models
are
not
only
larger,
but
they're
also
more
complex.
So,
as
you
see
on
the
screen,
we
have
another
famous
recent
language
model
containing
over
a
dozen
hybrid
parameters
that
you
have
to
tune.
A
A
A
A
Some
of
these
techniques
are
applicable
to
many
traditional
machine
learning
methods,
and
some
of
these
techniques
are
have
much
more
significance
and
importance
in
the
regime
of
deep
learning,
for
the
sake
of
clarity
I'll
be
using
the
word
trial.
Quite
often
in
the
rest
of
this
talk,
trial
in
in
hyperparametering
literature
typically
means
one
configuration
evaluation,
so
essentially
one
sample
of
the
hyper
parameters
that
that
we
plan
to
evaluate.
A
A
A
A
So,
specifically
on
the
screen,
we
have
some
pseudo
code,
we're
essentially
doing
a
cross
product
across
all
the
different
listed
samples
or
listed
hyperparameter
values
for
each
of
the
different
hyperparameter
dimensions.
A
A
A
However,
on
the
right
hand,
side
we
have
another
technique.
Random
search,
random
search
is
able
to
provide
good
coverage
over
the
hyper
parameter
space,
allowing
us
to
to
actually
reach
the
optimal
point
of
this
important
parameter
and
in
exchange
for
a
ability
to
have
a
structural
analysis
of
random
of
of
the
hyperparameter
tuning
grid.
A
So
random
search
is
just
what
it
sounds
like.
You
have
different
distributions
for
each
hyperparameter
tuning
value
and
you
sample
parameters
from
these
distributions
over
and
over
again,
eventually,
finding
the
best
model
again,
there's
a
couple
benefits
to
doing
random
search.
One
is
that
it's
easily
parallelizable,
because
each
evaluation
is
independent
of
each
other
and
second
turns
out
in
high
dimensions.
Random
search
is
actually
very
hard
to
beat
so
again.
A
One
problem
with
random
search
is
that
you
lose
the
ability
to
have
explainable
hyperparameter
tuning
space
that
you
evaluated
and
second
there's
a
couple
things
that
we
can
do
better
since
random
search
is
actually
quite
ineffective
and
expensive
or
inefficient
and
expensive
after
all,
you're
trying
things
at
random.
A
So
what
if
we
used
prior
information
from
evaluated
training
runs
to
guide
our
tuning
process?
Well,
this
is
what
bayesian,
optimization
and
other
model
based
optimization
processes
do
I'll
spare
you
the
details
and
the
mathematics
of
bayesian,
optimization
I'll,
just
simply
provide
a
very
high
level
overview
of
how
this
sort
of
model
based
optimization
works.
A
So
we
essentially
construct
a
optimizer
optimizer
that
is
aware
of
the
search
space.
So
in
this
particular
example,
we
have
a
range
for
learning
rate
to
be
from
negative
or
from
0.1
0.01
to
0.1,
and
we
have
a
range
of
different,
say,
neural
network
layers
that
we
want
to
evaluate
from
two
to
five.
A
A
A
A
However,
the
again
the
the
because
it's
inherently
sequential
the
benefit
of
parallelization
decreases
significantly,
as
you
add
more
workers,
so
so
now
that
we
understand
bayesian,
optimization,
there's
actually
still
room
to
do
better.
A
A
Well,
there's
a
hyperparameter
technique
that
tuning
technique
that
addresses
this
precisely
and
that
is
typical,
most
famously
known
as
hyperband
hyperband
and
its
variants,
including
asha
successive
having
etc.
Is
these
these
families
of
this
family
of
algorithms
are
essentially
early
stopping
algorithms?
A
What
does
that
mean?
It
means
that
these
algorithms
aim
to
allocate
resources
to
better
performing
trials
and
reduce
the
number
of
resources
or
the
time
spent
evaluating
bad
trials.
So,
let's
quickly
walk
through
some
pseudo
code
as
similar
to
random
search,
we'll
sample
from
the
hyper
parameter,
search
space
and
we'll
evaluate
this
particular
trial
or
model.
Given
these,
this
hyper
parameter
sample
for
a
max
number
of
epochs
or
steps
or
iterations,
so
every
step
we
will.
A
A
All
the
trials
will
be
compared
against
each
other
and
if
a
particular
trial
is
in
the
top
fraction
of
trials
at
five
epics,
then
we
will
continue
training
that
trial,
otherwise
we're
going
to
pause
it
and
release
the
resources
allocated
to
that
particular
trial
in
due
of
another.
Perhaps
more
promising
trial
to
to
take
make
use
of
these
resources.
A
So
essentially,
what's
happening
is
if
you're
not
performing
very
well
the
we're
not
going
to
evaluate
anymore
and
if
you're
evaluating,
if
you're
performing
very
well
you're,
you
know
a
very
promising
hyper
parameter
configuration
then
we're
going
to
keep
evaluating
you
until
the
end,
and
there
have
been
recent
advances
that
have
made
hyperband
capable
of
being
combined
with
bayesian.
Optimization
hybrid
ban
is
also
nice
because
it's
easily
paralyzable,
which
actually
improves
its
efficiency,
but
there's
actually
some
more
room
for
improvement,
turns
out
in
deep
learning.
High
parameter
schedules
matter
a
lot.
A
We
will
terminate
them
similar
to
these
early,
stopping
methods,
but
for
the
best
ones
we
will
continue
training
them
and
we
will
use
them
as
templates
to
replace
these
terminated
low
performers.
When
these
templates
are
are
used.
They
are
essentially
cloned
and
the
hyperparameters
are
mutated,
so
they
are
perturbed
in
some
way.
This
effectively
allows
us
to
search
over
hyper
hammer
tuning
schedules
and
is
also
efficient
in
that
it
terminates
bad
performers.
A
We'll
start
off
with
four
different
trials
say
we
have
four
different
values:
the
learning
rate
from
0.1
to
0.4,
we'll
train
them
for,
say
one
epic,
and
at
one
epic
we
will
have
a
a
evaluation
across
all
of
the
given
trials
that
are
running
so,
let's
say
it
turns
out.
0.4
is
the
worst
performing
trial
of
the
four.
A
A
Obviously
this
isn't
perfect,
but
it
actually
performs
quite
well
in
practice.
The
mind
ran
this
technique.
When
they
published
this
pbt
work
over
multiple
previously
published
algorithms,
they
found
that
across
the
board,
pbt
was
able
to
provide
a
non-trivial
performance
increase
over
the
state
of
the
art.
A
A
Why
well
turns
out
there
in
modern,
deep
learning
models
as
presented
at
the
beginning
of
the
talk?
There
are
dozens
of
high
parameters
that
you
can
tune
in
modern
machine
learning
models
right,
and
so
we
have
here
again
this
very
famous
language
model
roberta,
and
it
has
more
than
a
dozen
hybrid
parameters.
A
So
this
means
that
you
want
to
effectively
choose
which
hyperparameters
you're
searching
over
and
again
choosing
the
hyperparameter
to
space
itself
is
an
important
decision,
so
you
might
be
asking
yourself
okay.
Well,
I
know
that
there
has
to
be
one
of
these
or
two
or
three
of
these
things
are
incredibly
important,
but
I
have
a
list
of
20..
How
do
I
choose
my
hyperparameter
space?
How
do
I
choose
the
right
hyperparameters
to
evaluate
in
the
first
place?
A
Well,
my
second
tip,
for
you
is
that
you
should
make
use
of
the
available
tools
for
visualizing
and
understanding
your
hyper
hammered
tuning
landscape,
a
common
tool
that
researchers
use,
especially
at
well
well-served
places
such
as
google
is.
This
parallel
coordinates
tool.
It
helps
you
visualize
multiple
dimensions
at
once,
which
is
hard
to
do
in,
say
a
2d
or
3d
graph.
A
So
here
is
a
graphical
representation
of
how
that
might
work.
Typically,
these
parallel
coordinate
tools
are
allow
you
to
filter
out
particular
runs
and
identify
different
relationships
between
multiple
hyperparameters.
At
once,
this
in
turn
allows
you
to
better
inform
how
you
structure
your
search
in
this
sort
of
iterative
process.
A
So
there's
many
tools
that
that
provide
this
sort
of
tuning
visualization
techniques
such
as
tensorboard
weights
and
biases
cave
and
neptune.
A
A
A
So
there's
multiple
tips
that
you
can
do
and-
and
you
will
have
to
either
engineer
it
yourself
or
look
for
a
framework
that
does
this
for
you,
but
typical
things
that
you
would
want
to
do
to
reduce
overfitting
and
denoise.
Your
your
optimization
inputs
include
making
sure
you
do
cross-validation
making
sure
you
evaluate
the
same
hype
parameter
across
multiple
seeds
and
then
also
look
at
consider
looking
at
different
metrics
in
addition
to
accuracy
such
as
the
gap
between
validation
and
training
or
model
entropy
or
even
training
versus
validation
loss.
A
I'll
talk
about
ray
tune,
which
is
a
scalable
hyperparameter
tuning
developed
now,
primarily
at
any
scale,
but
previously
at
uc,
berkeley
and
ray
tune,
is
a
scalable
hyperparameter
tuning
library
that
works
with
any
machine
learning
framework.
A
A
So
raytune,
specifically,
is
the
library
that
handles
the
execution
of
hyperparameter
search.
It
provides
hooks
to
plug
in
different
high
parameter,
search
algorithms
and
automatically
handles
the
parallelism
and
scaling
for
you.
Why
is
tune
special
well
tune
is
built
with
deep
learning
as
a
priority.
Now,
what
does
that
mean
specifically
tune
is
built
so
that
you
can
utilize
and
spread
your
training
and
tuning
across
multiple
gpus
across
cluster.
A
It
also
allows
users
to
tune
models
with
any
machine
learning
framework
and,
most
importantly,
tune
allows
you
to
run
high
performance
tuning
at
any
scale.
So
you
can
go
from
running
on
a
single
process
to
running
across
a
bunch
of
gpus
to
run
across
multiple
nodes.
All
without
changing
your
code,
as
mentioned
today,
hyperparameter
tuning
algorithms
are
very
important
to
leverage
so
tune
offers.
A
Raytune
offers
many
algorithms
to
optimize
your
hyperparameter
search,
including
all
the
algorithms
mentioned
today,
tune,
also
integrates
with
a
lot
of
open
source,
hyper
hammer
tuning
libraries,
so
these
optimization
libraries
such
as
hyperopt
or
recent,
this
recent
library
called
ax
from
facebook
in
addition
to
services
such
as
sig,
opt
and
and
others.
A
A
This
increases
the
number
of
samples
that
you're
going
to
take
from
the
the
training
distribution
and,
specifically
we're
setting
that
to
a
hundred
the
parallelism
that
tune
will
operate
at
is
determined
by
the
size
of
your
cluster,
so
it
automatically
leverages
all
the
course
available
to
you
in
in
your
particular
cluster.
A
Oftentimes
you'll
want
to
leverage
a
gpu
and
in
prior
torch
and
other
distributed
training
frameworks
or
hype
model
tuning
frameworks,
you'll
be
forced
to
handle
ugly
environment
variables
and
manual
device
placement
and
such
however
tune
is,
you
know,
built
for
deep
learning,
so
it
will
automatically
set
your
environment
variables,
isolate
your
training,
jobs
across
multiple
gpus,
allowing
you
to
paralyze
your
search
even
across
you
know,
multiple
machines
without
ever
setting
these
environment
variables
by
hand
due
to
narrow,
very
narrow
api,
essentially,
two
two
different
code,
two
different
api
calls
tune,
exposes
a
variety
of
features,
including
automatic
checkpointing,
and
specifying
different
tuning
algorithms.
A
So
and,
as
we
mentioned
above,
it's
incredibly
important
to
analyze
your
hyperparent
tuning
run
afterwards.
So,
if
you
wanted,
you
can
provide,
you
can
capture
the
the
results
in
a
data
frame
which
is
provided
to
you
automatically
so
that
you
can
analyze
different
training
results
across
all
the
different
models
that
you've
trained
and
all
the
different
hyperparameters
that
you
evaluated.
A
In
addition,
we
talked
about
the
importance
of
visualization,
so
tune
automatically
generates
tensorboard
files
so
that
you
can
visualize
and
understand
your
training
with
different
scalar
graphs
and
parallel
coordinate
plots
so
at
a
very
high
level.
A
A
So
to
recap:
in
this
talk,
we
motivated
the
importance
of
and
highlighted
the
complexity
of,
hyperparameter
tuning.
We
overviewed
some
of
the
state-of-the-art
result:
techniques
for
tuning
hyper
parameters
and
finally,
we
talked
about
ray
tune,
which
is
a
library
built
on
top
of
ray
to
simplify
and
scale
hyperparameter
tuning.
A
As
a
final
call
out,
we
are
hosting
a
race
summit
which
is
going
to
be
a
free
online
conference
on
september
30th
to
october
1st,
covering
workshops
and
different
tutorials
and
keynotes
on
all
sorts
of
different
scalable
machine
learning
and
skillable
python
topics.
So,
if
you're
interested
please
check
out
racesummit.org
thanks
for
listening,
if
you
have
any
questions,
your
feel
free
to
reach
out
to
me
on
twitter
or
or
at
my
email
and
happy
to
take
any
questions
now,
thanks.