►
From YouTube: 17 - Hyperparameter Optimization - Ben Albrecht
Description
Deep Learning for Science School 2019 - Lawrence Berkeley National Lab
Agenda and talk slides are available at: https://dl4sci-school.lbl.gov/agenda
A
B
All
right
so
we're
gonna
dive
right
into
a
hyper
parameter.
Optimization
I'm,
gonna
start
with
some
background
just
to
get
all
of
us
on
the
same
page.
It's
this
is
relatively
simple
or
maybe
elementary
compared
to
the
stuff
we've
been
covering
already,
but
just
to
make
sure
we're
using
the
same
semantics.
So
what
does
a
model
parameter?
A
model
parameter
is
our
model
parameters
or
values
that
are
within
a
model
and
determined
from
the
data
itself.
B
So
this
is
the
this
is
the
model
parameters
that
you're
training,
so
in
a
linear
regression,
this
might
be
R.
This
would
be
our
slope
and
intercept
M&B
in
a
decision
tree.
This
would
be
your
your
splits
in
your
tree
that
you're
making
in
order
to
optimize
your
model,
to
predict
some
data
and
in
neural
networks.
B
This
is
our
weights
and
biases
that
we've
that
we've
been
learning
about
and
this
week
so
that
transitions
us
over
to
model
hyper
parameters
so
model
hyper
parameters
are
values
that
are
external
to
the
model,
but
that
influenced
the
model
capacity
I
will
elaborate
a
little
bit
more
on
what
model
capacity
means
in
the
next
slide.
But
first
so
in
linear
regression
we
don't
really
have
any
hyper
parameters.
It's
a
there's,
no
knobs
to
turn
in
that
in
that
procedure,
in
in
decision
trees,
we
can
think
about
modifying
our
tree
depth
as
a
hyper
parameter.
B
B
Okay,
so
what
do
I
mean
by
model
capacity?
It's
it's
really
a
general
term
for
capturing
how
how
much
we
are
over
fitting
or
under
fitting
both
things.
We
want
to
minimize
also
our
time
to
accuracy
how
how
quickly
we
reach
our
desired
accuracy,
our
model
trains
to
its
desired
accuracy
and
related
to
that
as
efficiency.
So
the
total
CPU
time
required
to
achieve
that
accuracy.
B
We
also
have
feature
selection,
which
can
be
thought
of
as
a
hyper
parameter
and
model
selection,
which
can
be
thought
as
a
hyper
parameter.
So
there's
kind
of
a
spectrum
of
types
of
hyper
parameter,
optimization
strategies
or
classes
rather
down
at
this
end,
I
consider
this
traditional
hyper
parameter,
optimization
while
doing
spatial
hpo
is
more
of
a
neural
architecture
search
and
when
you're
doing
all
of
these
things
together,
we
consider
this
an
an
automated
machine
learning.
B
Okay,
so
in
hyper
parameters
and
deep
learning,
there
are
a
lot
of
hyper
parameters
to
deal
with,
there's
a
lot
of
knobs
to
turn,
and
in
fact
there
was
a
really
good
quote
from
earlier
this
week
from
someone
who
spoke,
they
said
deep
learning
is
a
hyper
parameter
soup
that
was
from
Josh
and
his
talk.
So
I
dedicated
this
slides,
Josh.
B
And
it's
true
there's
a
lot
of
hyper
parameters
to
deal
with
in
deep
learning.
You
know.
On
the
training
side,
we
have
our
optimizer
learning
rate
momentum,
I'm
not
going
to
read
all
of
these
off
but
and
you've
seen
a
lot
of
these
throughout
the
week
and
probably
have
tried
to
tune
some
of
these
in
your
models
yourselves.
We
also
have
a
lot
of
spatial
hyper
parameters
to
modify
as
well
so
hyper,
parameter.
Optimization
and
deep
learning
is
a
very
high
dimensional
problem.
B
In
addition
to
the
high
dimensionality
and
hype
in
a
deep
learning,
hyper
parameter
optimization.
There
are
a
number
of
other
challenges,
so
we
have
so.
We
have
the
fact
that
hyper
parameters
can
be
continuous
values
or
categorical
or
integer
values
which
poses
some
mathematical
challenges.
Computing
gradients
is
pretty
challenging
to
do.
It's
still
an
open
research
question.
B
B
The
cost
function
is
being
minimized,
only
represent
a
sample
of
the
performance
as
we
are.
Typically,
we
are
typically
optimizing.
Our
hyper
parameters
for
only
a
subset
of
the
data,
as
we
should
evaluations
are
expensive.
It's
just
the
nature
of
deep
learning
and
evaluation
times
can
vary
greatly
across
hyper
parameters.
This,
depending
on
the
choice
of
hyper
parameters,
you
can
one
obvious
one
is
the
number
of
epochs
we're
using
in
our
training.
B
B
So
hyper
parameters,
in
this
case
you're
selected
in
tune
manually-
and
you
know
this-
is
okay
for
a
lot
of
cases
and
I'm
sure
a
lot
of
you
are
doing
this
in
in
some
of
you
in
some
of
the
handouts,
and
this
is
typically
guided
by
intuition
or
rules
of
thumb.
It's
okay
and
some
it's
okay
in
limited
cases,
but
when
you
really
want
to
find
the
optimal
high
performers
for
your
model,
you
want
to
move
into
the
regime
of
automated
hpo.
B
B
So
let's
go
over
some
of
these
algorithms
that
try
to
divide
and
choose
how
to
explore
the
subspace
of
hyper
parameters.
So
I'm
trying
I'll
try
to
present
this
as
number
of
different
categories
of
hyper
parameter,
optimization.
So
so
there
is
a
class
of
HP
OS
called
exhaustive
search.
Algorithms.
This
concludes
grade
random
and
genetic
I'm
going
to
go
through
all
of
these
more
in
depth.
B
There's
also
surrogate
models
which
really
try
to
minimize
the
number
of
valuations
in
order
to
reach
the
minima
and
there's
also
early
stopping
out
or
them
switch
exploit
this
facts
in
hyper
permanent.
This
property
and
hyper
parameter.
Optimization
that
you
can.
You
can
approximate
the
value
of
a
you,
can
approximate
the
value
of
a
set
of
hyper
parameters
before
the
before
the
training
is
completed
and
there's
also
gradient
based
algorithms
I'm,
not
really
going
to
say
much
about
this,
but
I
just
want
to
acknowledge
that
they
exist
and
I
really
think
the
research
being
done.
B
B
B
B
Ok,
the
next
the
next
strategy
to
discuss
is
random
search,
so
random
searches,
as
the
name
implies
randomly
sampling,
hyper
parameters
from
a
from
the
hyper
parameter
space.
It's
embarrassingly
parallel
as
well,
and
it
exploits
the
fact
that
some
hyper
parameters
matter
more
than
others.
This
is
this
is
an
observation
made
in
this
famous
paper
by
Bergstrom
and
I.
Don't
know
you
pronounce
this
name
NGO
back
in
2012
and
pretty
much
every
place.
You
read
on
the
internet
that
suggests
that
suggests
using
random
search
sites.
B
This
paper,
as
the
main
motivation-
and
this
is
a
famous
figure
from
that
paper
showing
how
grid
search
fails
to
fails
to
explore
the
hyper
parameter
space
where
one
hyper
parameter
matters
much
more
than
the
other,
whereas
random
search
is
much
more
successful.
With
a
fewer
number
of
evaluations
and
in
the
HBO
research,
you
can
see
a
lot
of
researchers,
reference,
random
search
as
frustratingly
successful,
because
it's
kind
of
like
the
dumbest
strategy
you
could
possibly
think
of,
but
it's
ridiculously
good
for
for
how
dumb
it
is.
B
Ok,
the
next
strategy
I
want
to
talk
about
is
genetic
hpo,
so
I'd
like
you
to
think
of
genetic
algorithms,
applied
to
hpo
as
an
automatic
iterative,
stochastic
grid,
search
with
pruning
and
what
that's
really
getting.
You
is
kind
of
the
best
of
both
worlds
between
random
search
and
grid,
search
with
with
benefiting
from
previous
knowledge
learned
in
previous
iterations,
so
genetic
algorithms
in
general
excel
at
optimizing.
Many
parameters
of
varying
importance,
a
property
that
we
have
of
the
random
search
genetic
genetic
hpo,
is
inspired
by
biological
systems
found
in
nature.
B
B
B
Genetic
hpo
is
also
embarrassingly
parallel
per
generation,
as
you
need
to
have
some.
There
is
some
sequential
dependency
across
generations,
and
again
the
biggest
advantage
is
that
each
generation
benefits
from
the
data
of
the
previous
generation.
So
the
data,
the
work
you're
doing
continues
to
benefit
the
next
generation.
So
here's
here's
a
nice
figure
from
this
paper,
applying
large-scale
genetic
hbo's
on
image,
classifiers
and
what's
important
to
note
here-
is
that
this
is
the.
This
is
the
populate.
B
These
are
populations
over
time
and
you
see
that
their
accuracy
rapidly
jumps,
and
then
they
slowly
approach
they
slowly
of
converge
to
this
high
accuracy.
If
we
were
doing
a
random
or
grid
search,
you
might
imagine
a
lot
more
of
this.
Empty
space
here
would
be
filled,
but
because
the
genetic
HBO
is
learning
from
previous
generations.
We
are
doing
a
much
smarter
search
here.
B
So
population
based
training
is
also
a
genetic
based
approach
and
this
this
approach
trains
its
hyper
parameters
during
the
model
optimization.
So
this
is
an
early
stopping
algorithm
that
I
talked
about
earlier.
So
the
process
here
goes
as
follows:
we
select
a
random
set
of
hyper
parameters
and
train
multiple
models
in
parallel
and
then
every
n
epochs
we
do
the
following.
We
take
the
best
model
and
hyper
parameters
and
copy
over
the
copy,
those
over
the
worst
models.
B
So
the
worst
models
you
can
think
of
as
the
low-performing
individuals
in
a
population
that
are
dying
off
in
that
generation
and
then,
if
a
model
was
copied
over,
we
randomly
perturb
those
hyper
parameters.
So
that's
kind
of
that's
the
mutation
we
saw
in
the
genetic
hyper
parameter
optimization,
and
so
this
is
a
figure
from
google
deepmind,
their
original
blog
post
on
this,
which
was
late,
2017
I,
believe
so
you
can
see
here.
We
have
two.
We
have
two
models
and
sets
of
hyper
parameters
being
trained.
This
one
does
better.
B
You
can
see
in
the
performance
here
this
one
does
better
than
this
one.
So
we're
going
to
exploit
that
fact
and
copy
these
hyper
parameters
over
here
and
then
we
are
doing
an
exploration
with
this
one
that
was
copied
over
and
perturbing
the
hyper
parameters
where,
while
we
leave
these
alone
and
these
continue
and
then
a
later
time,
you
see
that
the
exploration
ended
up
doing
better
than
the
original
parent.
B
Your
interval
is
and
then
use
that
at
a
later
time
to
to
reach
Acuras
reduce
your
time
to
accuracy,
and
then
this
is
a
nice
figure
from
their
their
paper
or
blog
post,
where
they
visualize
they
visualize
the
population
based
training,
so
the
x
and
y
axes
are
actually
kind
of
meaningless
on
this
figure.
What's
in,
what's
important
is
the
color
and
the
darker
blue
or
the
yeah?
B
Okay,
jumping
over
to
Beijing
HBO,
yes,
I,
I
think
this
was
two
separates.
Let's
see
here,
if
I
recall
correctly,
I
think
this
was
just
two
separate
models.
They
were
looking
at,
so
you
the
easiest
way
to
think
about.
This
is
learning
rate.
It's
pretty
common
for
us
to
to
train
a
learning
rate
schedule
and
in
fact,
a
lot
of
these
optimizers
try
to
fund
a
lot
of
like
atom
optimizer
tries
to
find
that
learning
rate
schedule.
So
that's
that's
exactly
what's
happening
here.
B
Yes,
so
you
could
use
this
to
replace.
So
you
could
use
this
to
replace
you.
Wouldn't
you
wouldn't
so
rate,
so
you
wouldn't
use
a
learning
rate
schedule
with
one
of
those
optimizers
yeah
I'm,
not
sure
I
know
that
number
off
the
top
of
my
head,
but
it's
a
pretty
big.
They
did
a
pretty
big
experiment
here.
Yeah
you
can
check
you
can
check
out
this
this
source
or
just
if
you,
google,
population-based,
training
or
deepmind
PBT.
B
This
will
be
like
your
first
hit,
jumping
over
to
Beijing
HBO,
so
I
always
like
describing
Bayesian
optimization
with
pictures
and
figures
rather
than
math.
It's
easy
for
your
eyes
to
glaze
over
when
you
see
all
the
Bayesian
math,
so
the
intuitive
way
to
think
about
Bayesian
optimization.
These,
for
me,
is
if
you
look
at
the
following
figure:
what
number
would
you
choose?
B
What
number
would
say
say
you
have
this:
this
theta
of
random
forests
results
with
different
number
of
trees
and
you
are
assigned
with
about
choosing
the
next
point
to
evaluate
to
find
the
minimal
air.
What
area
would
you
choose
on
the
graph
or
on
the
plot,
and
you
would
probably
choose
somewhere
down
here,
because
we're
already
doing
pretty
well
down
in
this
area
and
that's
exactly
what
Bayesian
optimized
optimizers
are
doing
so
formal
finishing
of
a
Beijing
optimizer?
B
Is
that
it's
a
sequential
model
based
optimization,
that's
building
a
surrogate
model
for
the
objective
and
it
quantifies
the
uncertainty
in
that
circle,
surrogate
model
using
a
Gaussian
Gaussian
process,
regression
lots
of
caveats
here,
there's
tons
of
there's
tons
of
variations
in
this,
but
I'm
just
describing
the
most
popular
approach.
So
here's
kind
of
here's
a
nice
way
to
visualize
this
we've
collected
a
few
data
points
along
this
x-axis.
B
So
some
properties
of
Beijing,
optimizations,
Bayesian
hbo's,
is
that
they
are
ideal
for
optimizing
objective,
objective
functions
with
very
expensive
evaluations,
which
is
true
for
a
lot
of
deep
learning
models.
However,
they
are
best
suited
for
a
small
number
of
hyper
parameters.
They've
been
shown
to
be
relatively
ineffective
with
more
than
20
hyper
parameters,
that's
good
to
be
aware
of.
B
So
this
is
this
grows
cubic
Li
with
the
number
of
evaluations.
You've
done
so
this.
This
can
impact
you
if
you
do
a
large
number
of
evaluations
with
with
a
Bayesian
hpo,
but
some
efficient
approximations
exists
to
work
around
this
I'm
just
good
to
be
aware
of,
if
you're
employing
one
of
these
okay.
The
next
strategy
I'd
like
to
talk
about
is
hyper
band,
so
now
we're
getting
into
some
of
the
more
recent
developments
in
hyper
parameter.
B
Okay,
so
so
hyper
band
is
a
success
of
having
algorithm
that's
combined
with
random
search.
So
the
process
of
the
process
of
hyper
band
goes
as
follows.
So
you
sample
case
sets
of
hyper
parameters,
you
evaluate
them
for
n
epochs,
and
then
you
discard
the
lowest-performing
half
of
hyper
parameters,
and
then
you
continue
evaluating
in
and
continue
evaluating
the
remaining
hyper
parameters
for
n,
more
epochs,
and
then
you
discard
the
lower-performing
half
again
and
you
run
the
good
ones
for
even
more
epochs,
and
this
is
kind
of
visualized
here.
B
So
this
is
this
again
is
an
early
stopping
algorithm.
That's
finding
a
nice
balance
between
the
explore
exploit
problem.
So
we
start
with
a
bunch
of
sets
of
hyper
parameters
that
are
sorted
by
their
performance
after
they're
evaluated
to
some
some
level,
and
then
we
only
continue
training
a
certain
number
of
them,
and
then
we
keep
chopping
that
off
until
only
one
remains.
B
B
B
This
this
works
by
assigning
workers
to
evaluate
hyper
parameters
ranked
at
the
bottom
rung
and
then,
when
a
worker
finishes
their
evaluation,
they
request
more
work.
If
a
set
of
hyperparameters
qualifies
for
promotions
in
the
next
rung,
it
is
chosen.
Otherwise
the
workers
start
starts
with
a
new
set
of
hyper
parameters
at
the
bottom
rung
again.
So
this
gives
workers
something
to
do.
B
This
gives
workers
something
to
do
if
their
set
of
hyper
parameters
didn't
work
out,
and
you
can
see
here.
The
resource
efficiency
is
much
much
nicer.
Actually
on
your
gonna
next
slide,
you
can
compare
these
side
and
side
so
up
here
we
have
the
synchronous
success
of
having
I
believe
this
up.
So
I
don't
recall
the
year
this
algorithm
was
developed,
but
this
is
the
more
recent
asynchronous
successive
hasn't
having
from
I
think
2018,
where
we
now
have
a
parallel
strategy
for
for
doing
hyper
band.
B
Okay,
the
next
strategy
is
Bayesian,
optimization
and
hyper
band,
so
getting
really
creative
with
the
names
we're
just
combining
two
things.
So
that's
vo
HB,
so
this
so
Bo
HB
essentially
is
hyper
band,
except
instead
of
using
random
search
to
sample
the
hyper
parameters
is
using
a
bayesian
optimizer
to
sample
the
hyper
parameters.
So
this
is
a
pretty
big
improvement.
It
also
supports
a
parallel
formulation,
as
you
can
imagine-
and
this
is
from
their
paper,
which
they
do
a
blog
post
about
and
they
they
show
some
pretty
nice
speed.
B
B
Okay,
some
other
strategies
that
I'm
not
going
to
go
into
in
depth,
but
just
want
to
expose
you
to
there's
tree-structured
partisan,
estimators
TPE.
These
are
bayesian
approaches
to
that
utilize,
categorical,
hyper
parameters
and
tree
structure
such
as
the
connectivity
between
layers,
depending
on
the
number
of
layers.
B
B
Okay,
so
now
I
want
to
talk
about
some
give
me
an
overview
of
some
different
HP
oh--
software
out
in
the
wild.
So
I
am
the
developer
on
the
cray
AI
hpo
framework,
but
I
didn't
think
it'd
be
fair
to
just
present
on
create
Bo,
since
that
would
be
kind
of
biased
opinion.
So
I'm
going
to
give
a
kind
of
gentle
overview
of
some
of
this
HP
of
software
out
there
and
then
we're
going
to
dive
into
looking
at
Cray
HBO.
B
B
Unfortunately,
as
Steve
was
pointing
out
when
we
were
discussing
this
earlier,
development
on
this
project
has
kind
of
fallen
off,
but
it
does
seem
that
there
are
some
people
still
supporting
it:
HBO
Lib
as
part
of
the
auto
ml
sweet.
So
this
provides
a
common
interface
to
a
couple.
Different
standalone
packages
that
implement
some
algorithms,
snacks,
Biermann,
hyper
band
and
vo
HB,
unfortunately
looks
like
HBO.
Lib
is
not
not
receiving
a
lot
of
attention
either,
but
it
does
this.
B
It
is
pretty
useful,
as
is
I,
would
say,
then
there's
advisor,
which
contains
a
ton
of
HBO
algorithms,
which
can
be
nice
to
just
try
out
some
different
things
and
lastly,
on
this
list
is
create
I,
H
Bo,
the
framework
that
I've
I'm
working
on
at
krei.
So
this
is
a
distributed.
Hyper
parameter
optimization
for
H
intended
for
HPC
users,
although
you
can
run
it
on
your
local
machine
as
well.
Currently
we
have
grid
random
and
genetic
and
PBT,
and
we
are
currently
in
the
process
of
developing
bayesian
uplands.
B
We
have
this
HP
band
ster,
which
implements
hyper
band
and
Bo
HB
I.
Believe
that's
the
implementation
from
the
publication
and
then
hyper
grad
is
one
of
the
gradient
based
hbo's,
which
has
a
memory
usage
trade-off
for
storing
stochastic
gradient
descent.
Intermediate
results.
This
is
kind
of
just
the
research
toy.
At
this
point,
but
it
will
be
cool
to
see
that
mature.
B
Sage
maker
has
an
HP
oh--
sweet,
as
your
ml
has
an
HBS
sweet
and
Google
Cloud
does
as
well,
and
then,
if
you're,
if
you
recall
from
the
spectrum,
slide,
there's
there
on
the
far
left,
we
have
some
frameworks
that
deal
with
optimizing,
not
only
your
traditional
hyper
parameters,
but
also
your
topology
features
and
your
the
choice
of
models.
So
that's.
This
is
a
few
examples
of
those
frameworks.
So
automat
ml
is
is
a
pretty
big
framework,
which
has
a
lot
of
which
tries
to
unify
a
common
interface
to
a
ton
of
underlying
algorithms.
B
Teapot
is
another
auto
ml
workflow
that
utilizes
genetic
programming
there's
h2o
AI,
which
are
it's
h2o
by
h2o
AI
that
supports
population-based
training,
notably
I,
think
I,
think
they're.
The
only
other
main
HBO
package
that
supports
PPT
right
now
and
they
also
support
distributed
training.
You.
B
This
is
the
one
receiving
the
most
active
development
right
now
and
currently
this
supports
random
and
hyper
bands
and
we'll
see
what
we'll
see
what
they
have
in
store
looks
like
an
exciting
project:
okay,
so
next
I
want
to
transition
over
to
talking
about
some,
some
just
general
practical
tips
and
hyper
parameter.
Optimization.
Now
that
you
have
an
overview
of
the
available
algorithms,
the
available
software
out
there,
okay,
so
like
I,
mentioned
before
deep
learning
in
general
has
long
evaluations.
B
So
the
the
hpo
process
is
going
to
take
a
long
time
expects
hpo
runs
to
take
anywhere
from
hours
so
weeks,
depending
on
how
large
your
training
takes.
So
so
choosing
the
wrong
search
base
for
your
algorithm
can
have
large
consequences.
It's
it's
worth
taking
the
time
to
plan
your
experiment
for
how
you're
going
to
search
your
hyper
parameter
space
and
what
hyper
parameters
to
use.
B
B
If
you
have
distributed
resources
available
to
you,
you
should
definitely
utilize
some
kind
of
distributed
HP
oh--
resource
or
distributed
HP
oh--
software
package,
there's
no
reason
not
to
with
so
many
of
these
HP
oh--
algorithms.
Being
embarrassingly
parallel,
as
mentioned
in
believe
it
was
Brenda's
talk.
You
should
use
a
development
data
partition
out
of
your
validation,
set
to
to
train
your
hyper
parameters.
This
is
just
a
good
practice
to
make
sure
you're
not
over
fitting
to
your
validations.
B
And
then
it's
also
important
to
remember
we're
not
trying
to
find
the
global
minimum
without
some
kind
of
cross
validation
baked
into
your
your
performance
of
your
hyper
parameters.
You're,
definitely
going
to
overfit.
If
you
optimize
too
much
so
you
either
need
to
bake
in
some
kind
of
cross,
validation
or
or
just
be
careful
about
optimizing
too
much.
B
Okay
on
choosing
hyper
parameters,
so
so
for
choosing
hyper
parameters.
You
want
to
utilize
your
domain
knowledge
about
the
model
to
focus
on
important
hyper
parameters.
It's
important
to
start
with
initial
learning
rate.
The
next
good
candidate
is
to
jump
to
learning
rate
decay
schedule
such
as
decay,
constant
and
then.
Lastly,
regulars
regularization
strength
such
as
l2
penalty
or
dropout
strength,
is
a
third
candidate
to
consider.
They're
also
mentioned
earlier
this
week,
be
careful
about
the
pairing
of
incompatible
loss,
functions
and
activation
functions,
loss
functions
and
activation
functions.
B
And
also
limit
your
search
base,
so
starting
from
a
coarse-grained
search
is
reasonable,
a
reasonable
approach.
You
can
kind
of
do
a
hierarchical
approach
you
want
to.
You
want
to
use
a
log
scale
for
multiplicative
hyper
parameters
such
as
such
as
learning
rate
or
momentum
or
regular,
like
regularization
strength,
something
like
dropout
rate.
You
would
want
just
an
absolute
scale.
All
right.
Some
tips
on
choosing
HBO
strategy
grid
search
is
bad,
don't
use
it.
B
B
And
I
also
want
to
mention
that
it's
not
uncommon
to
mix
and
match
HP
o
strategies.
So,
as
I
said
earlier,
you
can
do
a
hierarchical
search
in
doing
something
like
that.
It's
it's
perfectly
reasonable,
to
start
with,
say
a
random
or
genetic
search
for
a
broad
search.
That's
including
your
topology
and
then
switch
over
to
a
Bayesian
search.
B
When
you
have
when
you
would
lock
in
some
initial
hyper
parameters,
you
only
want
to
tune
a
smaller
number
of
hyper
parameters,
because
Bayesian
Bayesian
optimizers
do
better
with
smaller
number
of
hyper
parameters,
and
then
things
like
PBT
and
hyper
ops,
because
they
have
that
early
stopping
mechanism.
They
cannot
be
used
for
topology
search.
So
once
you
have,
your
topology
locked
in
can
make
sense
to
switch
over
to
using
PBT
or
hyper
opt
to
acquire
a
reusable
learning
schedule,
so
that
were
that
was
a
just
kind
of
throwing
a
lot
of
question
yeah.
B
So
so,
if
you're
storing
some
kind
of
intermediate
result,
you
should
be
able
to
kind
of
look
over
your
data
and
see
or
kind
of
plot.
Your
data
is
a
really
good
way
to
kind
of
visualize
how?
How
different
hyper
parameters
impacted
the
accuracy?
That's
a
really
good
point
and
I
didn't
want
to
just
leave
it
off
there.
This
you
know
the
best
practice
for
hyper
parameter,
parameter.
B
Optimization
is
really
still
kind
of
an
open
research
question
and
there's
lots
of
people
working
on
it,
and
so
I
just
wanted
to
point
to
a
couple
resources,
some
of
which
have
contributed
a
lot
to
this
to
the
tips
here.
But
if
you
want
to
look
more
into
it,
what
are
some
of
the
more
recent
practices?
These
are
a
couple
of
good
resources
to
check
out.
B
Okay,
so
with
that
I'm
going
to
transition
over
to
talking
about
cray
AI
HPL,
so
this
is
craze.
This
is
crazy.
I
/
parameter
optimization
framework,
so
I
call
it
an
emerging
hyper
parameter,
optimization
framework
because
it's
still
under
active
development
or
not
1.0.
Yet
so
we
consider
ourselves
alpha
release
right
now
and
we
are.
We
are
reserving
the
right
to
make
breaking
changes
than
in
an
interface
which
is
actually
happening
right
now,
it's
portable,
so
it
could
run
on
your
desktop
to
run
on
a
supercomputer.
B
It's
a
lightweight
has
a
lightweight
black
box
interface,
so
it
treats
the
it.
It
defines
an
interface
to
just
a
executable
on
your
file
system,
and
that
can
be
anything
you
want
it
to
be,
so
it
can
be
a
Python
file
with
a
Python
script
using
any
of
these
machine
learning
toolkits
duplicates
or
it
can
be.
B
You
know
a
Fortran
program
or
something
if
you
want
to
use
it,
if
you're
using
Fortran
and
deep
learning
or
something
I,
don't
know,
there's
DoD
folks,
here:
okay,
it's
so,
as
I
mentioned
it's
distributed
in
HPC
environments,
so
it
supports
distribution
out
of
the
box.
It's
the
mechanism,
it's
using,
it's
just
interfacing
directly
with
the
workload
manager
on
the
machine,
and
it
supports
two
different
types
of
distribution.
So
we
can
do
to
distribute
it
HP
oh--,
where
we're
evaluating
we're
evaluating
multiple
sets
of
hyper
parameters
simultaneously.
We
also
support
distributed
model
training.
B
Where
say
you
have
an
allocation
of
64
nodes
and
say
each
evaluation
so
you're
using
the
distributed
tensorflow
package,
then
each
each
evaluation
could
be
running
on
16
nodes
within
that
within
those
64
nodes.
So
there's
kind
of
two
there's
two
different
types
of
distribution
that
can
be
used
simultaneously
and
then
an
important
feature
is
that
we've
tried
to
design
the
low-level
interface
which
hasn't
really
been
exposed
to
the
public.
Yet,
but
we
plan
to
the
low-level
interface,
it
tries
to
be
fairly
simple
and
generic
to
support
anyone.
B
B
B
So
just
a
quick
blurb
on
chapel,
so
Chapel
is
a
modern,
productive,
parallel
programming
language,
it's
open
source,
also
scalable
from
laptops
to
clusters
to
supercomputers
strives
to
be
as
performant
as
Fortran
it's
portable
SC
elegant
is
Python
and
doing
all
of
this.
With
this
in
a
distributed
parallel
setting.
B
Create
so
cray
I
projects,
utilize
chapel
for
a
number
of
reasons,
mostly
to
utilize,
their
modern
language
features
of
shared
built
in
shared
memory,
parallel,
ISM,
tomp,
built-in
tommix
in
the
language,
great
interoperability
with
Python
and
Fortran,
and
just
a
lot
of
other
great
modern
programming.
Language
features
like
generics
type
inference
memory
management
strategies.
B
Okay,
so
so
the
I'm
going
to
walk
you
through
the
components
of
a
create
I
HP
a
workflow.
So
this
is,
if
you're,
just
starting
from
scratch.
This
is
what
you
have
to
do
so.
There's
two
parts
is
the
training
kernel
and
there's
the
HBO
Driver,
so
the
training
kernel
is
the
model
training
program
to
be
optimized.
So
this
is
what
you're.
This
is
what
you're
starting
with.
So
you
have
a
Jupiter
notebook
from
one
of
these
handouts.
You
have
some.
B
You
have
some
code
in
there
that
optimize
it
that
trains
a
neural
network
and
then
prints
out
the
accuracy.
So
that
would
be
your
model
training
program
or
your
training
kernel.
So
the
interface
we
define
here
is
that
you
expose
those
hyper
parameters
through
command
line
arguments.
So
in
the
jupiter
notebook
case,
you
would
need
to
shift
your
code
over
into
once.
You
have
it
once
you
have
it
relatively
stable
you
would.
B
You
would
put
that
into
a
Python
standalone
Python
script,
that
you
call
and
you
would
maybe
import,
aren't
parse
and
expose
the
hyper
parameters,
and
then
we
also
need
to
expose
the
figure
of
Merit
or
you
can
think
of
this
as
our
cost
function.
This
is
the
this.
Is
the
value
to
be
minimized
in
our
hyper
parameter
optimization,
so
this
is
just
exposed
to
standard
out
with
a
unique
identifier.
B
So,
as
I
mentioned,
your
model
training
program
can
be
written
in
anything,
for
example,
can
be
python
plus
any
of
your
favorite
framework
or
julia
whatever
you
want.
Okay,
the
second
part
is
your
HBO
driver.
So
this
is
the.
This
is
the
program
that
actually
imports
the
create
a
I
module.
So
this
is
this
is
program
that's
being
used
to
optimize
the
hyper
parameters
of
the
train,
colonel
and
this
actually
must
be
written
in
Python.
B
All
right
so
say
you
say
you
start
with
a
say:
you
start
with
a
Python
script.
I
kind
of
already
went
through
this,
but
so
you
start
with
a
Python
script
that
you
want
to
use
pre
hpo
on,
so
you
would.
You
would
modify
that
Python
script
to
express,
expose
the
hyper
parameters
and
you
would
print
the
figure
of
Merit.
We
would
write
your
HBO
driver
and
I'm
gonna
walk
you
through
that
now,
okay,
so,
hopefully
that's
visible.
It's
kind
of
dark
on
here,
so
the
first
step
is
to
expose
the
hyper
parameters.
B
So
we're
doing
this
in
this
example
we're
exposing
our
learning
rates
and
our
dropout,
our
dropout
rate
and
then
we're
just
utilizing
those
within
a
we're
utilizing
those
flags
we
exposed
within
the
script
itself,
rather
than
plugging
in
hard-coded
values,
and
then
we
print
out
in
this
case
we're
trying
to
minimize
our
lost
value.
So
we
print
out
our
figure
of
Merit
identifier.
It's
fon,
that's
the
default.
You
can
set
it's
whatever
you
want,
so
we
print
that
out,
so
that
the
optimizer
can
pick
that
up
now,
jumping
over
to
the
driver
code.
B
B
It's
called
train
model
PI,
then
we're
going
to
provide
our
hyper
parameter
flags,
so
we
expose
the
learning
rate
and
dropout
rate,
so
we
plug
those
in
here
into
this
hyper
parameter
list
of
Lists,
and
so
the
structure
goes
as
follows:
you
have
a
flag
itself
as
a
string,
the
default
value
and
then
your
bounds
for
the
search
space.
And
then
you
set
up
your
hyper
parameter
optimizer
here
we're
using
the
genetic
optimizer
and
then
you
just
call
optimize
on
your
parameters
and
you
can
get
your
your
best
figure
of
Merit.
B
That
was
printed
out
in
your
best
set
of
hyper
parameters
from
Python
memory.
Here
we
also
log
a
bunch
of
data
along
the
way:
okay,
quick
note
about
the
params
class,
so
this
accepts
a
list
of
Lists
where
each
of
those
list
has
the
hyper
parameter
flag
default
value
in
the
search
space.
As
I
said,
but
it's
worth
mentioning,
these
values
can
be
integer
float
or
a
string,
and
the
search
space
can
be
a
tuple
of
bounds
or
a
list
of
values.
B
There's
also
the
evaluator
class,
so
the
evaluator
class
is
used
to
up
it's
already
mentioned
it's.
It's
used
to
describe
how
to
evaluate
a
set
of
hyper
parameters,
and
this
has
a
lot
of
bells
and
whistles
here,
most
importantly,
there
are
some
different
types
of
lawn
trees.
You
can
use.
As
I
mentioned,
we
interface
with
the
launcher
directly,
so
you
could
be
using
slurm
PBS
using
nothing
if
you're
just
running
on
your
desktop
or
the
Eureka
Eureka
launcher.
B
The
most
the
thing
I
want
to
point
out
here
is
that
when
you
are
running
distributed,
you
just
set
your
nodes
equal
to
an
amount
and
krazee-eyez
typically
able
to
infer
the
workload
manager
you're
using.
So
if
you
were
on
say
nurse
and
you
set
your
nodes
equal
to
4,
it
would
try
to
SL
hook,
4
nodes
for
you
and
run
run
its
evaluation
on
there.
Alternatively,
you
could
do
a
interactive
SLR
and
once
you're
on
there
you
could
run.
B
You,
okay,
so
feature
overview
of
HP
o.
We
have
grid
random
genetic
and
crepey
bTW
I'll
talk
a
little
bit
about
why
that's
called
KB
PBT
in
a
little
bit
and
we
have
Bayesian
on
the
way.
So
looking
at
the
grid
interface,
we
have
so
I
already
walked
you
through
most
of
what
that
most
of
how
this
interface
works,
but
just
real,
quick,
so
we're
and
we're
importing
our
sub
module.
Hpo
setting
up
our
evaluator.
That's
going
to
run
some
train
that
PI!
That's
that's!
B
Going
to
train
our
say
our
neural
network
and
you
emit
a
figure
of
Merit,
and
then
we
have
our
list
of
hyper
parameters
with
a
default
values
and
search
space.
So
here
we're
just
have
an
ABC
that
we're
searching
from
negative
ten
ten
starting
at
zero.
So
you
can
use
the
grave
optimizer
which
you
really
shouldn't
use
it,
since
we
have
better
better
algorithms
available,
but
it's
just
kind
of
a
benchmark.
B
You
can
set
your
grid
size
to
how
much
so
this
is
how
you'd
like
to
chunk
up
each,
how
you'd
like
to
split
up
each
hyper
parameter
space.
This
would
split
negative
10
to
10
four
times
and
then
your
chunk
size
is
how
many
it's
it's,
how
many
evaluations
to
do
before
reporting
back
a
results,
it's
just
kind
of
like
how
frequently
you'll
get
feedback,
and
then
we
call
our
optimize
on
the
params
jumping
over
to
random,
so
I'm.
B
Using
the
same
example
here
now,
the
only
difference
is
we
use
the
random
optimizer
and
our
number,
our
parameters
to
the
random
optimizer
are
going
to
be
specific
to
the
random
optimizer.
So
here
we
specify
a
number
of
iterations.
So
we're
just
doing
a
thousand
iterations
randomly
sampling
these
type
of
parameters,
and
then
our
genetic
optimizer
interface
has
a
lot
of
different
values.
You
can
set
the
ones
shown
here.
B
You
can
set
your
number
of
generations,
you
were
doing
ten
generations
population
size
of
ten
and
then
for
Dean's,
a
demon
oka
population
that
helps
you
avoid
falling
into
getting
all
of
your
population
stuck
in
a
local
minimum.
You
can
kind
of
start
your
deems
out
in
different
locations
and
let
them
evolve
separately,
occasionally
doing
migration
between
them.
B
Okay
and
then
jumping
over
to
a
distributed
genetic
hpo
example.
So
we've
switched
up
our
hyper
parameters
here
to
something
more
realistic
and
we
have
a
few
more
a
few
more
arguments
exposed
here.
You
can
specify
your
mutation
rate
crossover
rate
and
where
you'd
like
to
log
your
global
results,
but
the
key.
The
key
thing
to
note
here
is
that
is
that
we've
just
specified
our
number
of
nodes,
and
that's
that's
really
all
we
need
to
do
to
enable
distributed
hpo
here
you
can.
If
you
have
a
allocation,
you
can.
B
So
if
you're,
if
you
recall,
we
support
two
different
types
of
distributions,
so
say:
train
dot,
PI
actually
is
a
distributed
evaluator
or
distributed
training.
So
we
specified
ten
equals
four,
just
pretending
that
that
is
the
way
to
run
this
with
four
nodes,
and
then
we
have
to
tell
the
evaluator
we're
going
to
run
this
with
four
nodes
so
that
it
knows
to
tell
the
underlying
workload
manager
to
run
this
over
four
nodes,
and
so
with
sixteen
nodes
and
four
nodes,
pre-evaluation
we're
going
to
be
running
for
training
for
evaluations
at
a
given
time.
B
Oh
sorry,
well,
so,
actually
just
a
quick
correction.
So
this
script
you're
looking
at
here,
has
to
be
a
Python
scripts,
because
it's
calling
the
create
a
library,
but
this
can
be
whatever.
This
is
just
a
black
box.
This
can
be
whatever
you
want,
so
yeah,
yeah,
I,
I,
think
I
understand
your
question.
So
that's
it
that
is
kind
of
all
put
on
to
the
user
here.
So
the
it's.
B
B
So
just
showing
some
data
collected
with
the
with
the
hpo,
the
create
Pio
framework,
so
this
is
lunette
on
m
mist
so
that
this
is
a
this
is
a
neural
network,
primarily
trained
or
optimized,
for
image,
recognition,
and
so
in
this
is
the
classic
hello
world
in
machine
learning,
where
you
have
digits
1
to
10
that
you're
trying
to
identify,
and
so
this
is
what
this
is.
What
this
looks
like
in
and
create
Pio.
So
this
is
actually
a
particularly
big
run.
B
We're
doing
250
generations
with
a
hundred
population
size
only
one
game,
so
we're
not
using
that
feature.
So
this
is
actually
going
to
do
quite
a
few
quite
a
few
evaluations,
but
we're
searching
a
pretty
big
space
here.
We're
actually
searching
the
if
I
go
back
here.
We're
searching
the
topology
of
these
layers
specified
through
these
arguments
here
and
we're
also
searching
the
momentum
and
drop
out
so
so.
B
This
is
just
showing
an
example
of
how
you
can
do
a
topology
search
in
create
HBO
by
exposing
these
hyper
parameters
via
the
command
line,
command
line
flags
and
just
searching
over
this
integer
space.
Now
it's
worth
mentioning
this
is
this
looks
clean
on
this
side,
but
on
the
user
side
they
do
need
to
basically
handle
that
inside
of
the
inside
of
there
m-miss
pie.
B
Fortunately,
I
think
this
one's
relatively
simple,
but
there
are
cases
where
you
have
dependencies
between
hyper
parameters
like
say
say.
We
also
wanted
to
do
say.
We
also
wanted
to
do
like.
Like
a
I,
don't
know,
say
we
wanted
to
do
another
hyper
parameter
that
depended
on,
say
the
this
hyper
parameter
here.
We
can't
do
that
today
in
Krejci
I
we
don't
support
dependent
hyper
parameters.
B
There
are.
There
are
a
handful
of
frameworks
out
there
that
do
support
that,
though
okay
just
showing
some
results.
So
this
is
the
Luna
on
an
amnesty
with
the
genetic
algorithm
applied.
So
this
is
with
the
original
hyper
parameters
chosen
from
the
paper,
and
this
is
the
accuracy
they
reached
and
here's
our
genetic
approach,
genetic
search,
finding
the
optimal
hyperplane
erse-
are
reaching
the
accuracy
in
a
much
shorter
training
time
after
after
finding
the
optimal
high
performers.
B
Okay,
next
I'm
gonna
jump
over
to
Craig
population-based
training,
all
right
so
crazed
population
based
training
implementation
has
a
few
extensions
to
deepmind's
original
PBT,
and
that's
if
you
want
more
information.
This
is
in
the
paper
linked
here,
but
the
main,
the
main
extension
is
that
we
do.
We
use
reproduction
with
a
probabilistic
multi-point
crossover
between
three
parents
instead
of
two,
so
we
have
two
hyper
parameter
parents.
B
B
Just
redesign
of
the
interface,
so
so
the
way
you
enable
the
way
you
enable
PBT
today
is
that
you
still
use
a
genetic
optimizer.
The
underlying
algorithm
is
still
a
genetic
optimizer.
However,
the,
however
we're
doing
this
early
stopping
throughout
the
genetic
optimization,
so
the
key
feature.
The
key
way
to
turn
on
PBT
today
is
to
enable
a
checkpoint
file
or
checkpoint
directory.
So
passing
this
checkpoint
argument
to
the
evaluator,
and
then
you
also
need
to
specify
you
also
need
to
include
a
checkpoint
variable
specified
by
this
@
symbol
in
your
command
line.
B
Flags
to
your
to
your
training,
colonel
in
your
training,
colonel
needs
to
take
this.
Take
these
flags
so
take
these
values
of
I
have
a
checkpoint
directory
in
a
model
and
I
Nate
I
know
I
need
to
load
from
that
one,
and
then
take
this
one
and
say:
I
have
a
checkpoint
directory
path
to
another
model.
I
know
I
need
to
save
to
that
one.
So
the
optimizer
frameworks
going
to
be
handing
these
paths
to
the
evaluator,
but
the
evaluator,
but
the
training
colonel
needs
to
know
what
to
do
with
them.
B
B
Yeah
and
that's
that's
all
you
need
to
do
to
turn
on
the
PBT.
Today
we
are
going
to
be
moving
on
to
a
different
interface
where
we
are.
We
have
a
standalone
PBT
optimizer
that
you
use
instead,
so
just
to
show
some
data
from
our
PBT
implementation,
so
here's
ResNet
20
on
the
CFR
10
data
set,
which
is
I,
forget
the
exact
number,
but
a
bunch
of
very
large
data
set
of
images
with
10
different
classes.
B
Yes,
so
we're
going
to
take
the
ResNet
20
original
hyper
parameters
and
try
to
optimize
those.
So
what
you
can
see
here
is
where,
with
the
PBT,
we're
discovering
an
improved
training
schedule
over
the
original
learning
rate
and
weight
decay,
so
so
this
black
line
here
is
our
weight
decay
from
the
original
paper
and
this
jagged.
B
Let's
see
here,
oh
I'm,
sorry,
this
jagged
red
line
is
our
learning
learning
rate
as
from
the
original
paper,
and
you
can
see
that
PBT
it's
up,
it's
optimizing
the
hyper
parameters
as
the
model
is
being
trained
it
it's
finding
a
different,
more
optimal
learning
rate
schedule,
as
well
as
finding
a
separate
weight
decay
schedule
for
training.
Our
model
and
I
guess
it's
not
showing
here
that
it's
actually
better,
but
the
next
slide
you
can
see.
B
So
here
is
the
air
of
the
original
ResNet
20
approaching
approaching
convergence
down
here
and
then
here's
with
the
with
the
PBT
learning
schedule
that
we
found
you
can
see
that
we
we
drop
much
quicker.
There
is
a
there
is
a
point
where
they
pass
they
pass,
but
this
is
optimized
to
find
the
best
accuracy
overall
and
we
do
end
up
improving
the
air,
reducing
the
air
over
the
original
resident
20
by
11%-
and
this
is
a
just
the
highlight
that
you
can
do
this
in
a
distributed
environment.
B
B
B
Okay
and
then
I'm
just
going
to
talk
about
some
ongoing
work
with
Cray
I,
some
of
which
I've
already
mentioned.
So
we
want
to
continue
improving
the
features
and
stability
we
want
to
support
more
launchers
than
we
do
today,
I
think.
Today
we
have
slurm,
PBS
and
then
Eureka
systems.
We
want
to
support
Jupiter
integration,
so
today
you
can
run
Cray
I
and
a
Jupiter
notebook,
but
unfortunately
like
any
Python
program
that
calls
out
to
an
on
Python
program.
B
You
have
this
problem
and
Jupiter
notebooks,
where
anything
written
to
standard
out
our
standard
air
from
the
non
Python
program
does
not
get
piped
forward
to
the
output
and
Jupiter.
So
there
are
some
solutions
that
exist
to
that
that
we're
looking
into
so.
The
result
of
that
is,
if
you
do,
go
and
pick
this
up
and
run
it
in
a
Jupiter
notebook.
It's
going
to
be
it's
going
to
look
like
it's
totally
unresponsive,
because
it's
going
to
be
running
your
training
for
a
very
long
time
and
then
it'll
finish
at
some
point.
B
You'll
get
you'll
get
some
output.
We
want
to
continue
to
implement
new
strategies.
We
have
Bayesian
on
the
way
and
we'd
like
to
element
some
of
these
more
modern
approaches
in
recent
development
and,
of
course,
I,
would
like
to
open
source
it.
We
would
like
to
open
source
it
as
a
team
and
just
to
give
you
a
bigger
picture
idea
of
what
we're
doing
at
Cray.
B
So
this
is
just
one.
A
I
workflow
component
of
many
that
we're
planning
on
developing
so
Cray
also
has
plans
to
develop
a
a
AI
workflow
framework
where
the
hyper
parameter.
Optimization
will
be
just
once
one
stage
here.
So
this
is
crazy.
I
hpo.
Our
next
target
is
feature
selection,
which
may
not
be
as
important
in
deep
learning,
but
we
hope
that
this
will
be
something
useful
to
machine
learning
workflows
in
general.
B
I
guess
I
could
do
a
quick
demo.
Let
me
jump.
Let
me
do
a
quick.
Well,
okay,
let
me
do
my
acknowledgments
and
before
you
clap
I'll,
do
a
quick
demo.
So
quick
acknowledgments
I'd
like
to
acknowledge
some
people
from
the
AI
team
at
Cray:
Alex,
hey
I'm,
Aaron
Voss,
who
was
the
original
author
of
the
crepe
EBT
Alessandro,
who
contributed
the
Bayesian,
optimization
Benjamin
Robbins,
my
manager
and
Zach,
a
chapel
team
who
made
all
this
possible
and
then
at
nurse
Steve
for
Steven
for
BOTS
for
providing
a
lot
of
user
feedback.
B
Okay
and
I'm
going
to
jump
over
to
a
quick
demo
of
this
live,
and
then
we
are
going
to
take
questions
all
right.
So
so
here's
just
a
quick
random
example.
I
showed
earlier
here
we're
specifying
our
seed
in
our
optimizer
and
we're
going
to
optimize
this
set
of
parameters.
Oh
so
this
example,
this
is
kind
of
our
hello
world
example.
B
We
showed
not
because
it's
anything
interesting,
but
because
it
shows
results
quickly,
because
hpo
of
machine
learning
and
deep
learning
models
in
general
is
very
time
and
takes
a
lot
of
time,
and
so
it's
nice
to
do
something
that
evaluates
quickly.
So
here
we're
just
we're
just
creating
a
sixth
order,
polynomial
and
trying
to
fit
that
to
a
sine
wave
in
the
range
of
0
to
100.
B
B
B
B
So
we're
just
printing
out
our
baseline
hyper
parameters,
random
search
with
a
hundred
generations,
so
I
just
ran
a
very
short,
very
short
example,
but
we
print
out
the
best
type
of
parameters
we
found.
This
was
the
figure
of
Merit
that
I
evaluated,
which
was
1.2
times
better
than
the
original
set
of
hyper
parameters,
so
not
a
huge
improvement,
but
at
least
it
did
improve.
And
then,
if
we
printed
out
the
figure
of
Merit
and
the
total
set
of
hyper
parameters
down
here,
I'll
show
one
more
quick
example
and
then
we'll.
B
Our
results
in
the
CSV
file,
so
I'm
going
to
run
genetic
example
and
while
that's
running
I'll
just
point
out
so
for
each
generation,
it's
printing
outs,
the
global
best,
the
identifier
of
the
best
individual
and
its
figure
of
merits,
as
well
as
how
well
it's
done
since
the
beginning
set
of
hyper
parameters.
So
this
is
one
point
seven
times
better.
It
also
shows
the
global
average
of
all
the
hyper
parameters
that
were
evaluated.
So
we
get.
B
We
get
this
set
of
hyper
parameters
listed
and
then
we
get
some
information
about
the
breakdown
of
the
deems.
So
the
you
can
kind
of
track
your
populations,
how
they're
progressing
and
the
best
set
of
hyper
parameters
per
deem.
Then
we
get
some
timing,
outputs
if
you
want
to.
If
you're
writing
some
large
check
point
files,
it
can
be
important
to
to
track
how
much
time
it's
taking
you
to
write
and
read
those
and
see
if
they
become
a
bottleneck
at
some
point
and
then
this
should
be
done
now.
Oh,
it's
not.
B
We
okay
and
then
we
set
of
hyper
Crammer.
So
this
one
found
a
4.9,
almost
a
5x
improvement
over
the
original
set
of
hyper
parameters,
and
then,
lastly
out
so
you
can
it
prints
out
these
files.
So
you
can
go
and
print
one
of
these.
It's
just
a
big
CSV
file
of
a
bunch
of
data
on
your
with
all
of
your
hyper
parameter
values,
the
fitness,
the
figure
of
Merit
and
so
on,
and
there's
also
a
global
file
with
global
information
on
all
of
the
evaluations.