►
Description
More about this lecture: https://dl4sci-school.lbl.gov/richard-liaw
Deep Learning for Science School: https://dl4sci-school.lbl.gov/agenda
A
Thank
you
richard.
There
are
a
few
questions
on
the
q,
a
if
you
would
like
to
take
some
of
those
live.
B
C
How
about
I
just
read
you
one
right
now
can
ray
tune,
also
spit
out
the
posterior
of
hyperparameters
and
posterior
predictive
of
the
neural
net,
using
those
hyper
parameters.
B
Yeah,
so
this
is
a
great
question.
I
think
the
main
focus
of
raytune
is
to
provide
an
execution
framework
and
again,
as
mentioned
in
the
talk,
raytune
integrates
with
all
sorts
of
different,
optimization
libraries.
Specifically,
these
model
based,
optimization
libraries.
B
These
model
based
optimization
libraries,
typically
build
a
model,
a
predictive
model
over
the
hybrid
parameter
space
and
those
are
typically
usable
to
be
to
be
queried
for
posterior
values,
and
so
so
to
answer
this
question,
it
depends
on
the
particular
library
you're
using
for
optimization.
B
Yeah,
I
I
can
I
can
choose,
I
can
actually
go
through
this
myself
right.
So,
let's
see
so
so
one
one
question
was:
how
can
slurm
be
used
with
or
sorry?
How
can
this
learn
be
used
with
rey
and
what
is
involved
in
specifying
available
resources
to
ray
so
yeah?
This
is
a
great
question.
Essentially
you
can
query
slurp
you.
You
request,
learn
to
to
provide
multiple
nodes
and
what
you
need
to
do.
Is
you
just
need
to
start
the
race
service
on
top
of
each
one
of
them?
B
We
have
documentation
about
how
to
how
to
run
rey
with
thurman
from
why
I
understand
mustafa
is
also
this,
so
I
personally
don't
use
term
I'm
more
based
on
the
cloud,
but
there's
many
users
that
have
gotten
this
working
and
if
you
run
into
questions
or
issues
running
the
running
the
examples
on
the
documentation,
you
can
feel
free
to
reach
out
on
the
slack.
A
Yeah,
so
we
have
also
the,
as
you
mentioned,
the
slurm
nurse
creepo,
with
examples
of
how
to
actually
build
a
ray
cluster
with
slurm.
So
I
can
refer
you
to
that
on
slack.
If
you
send
me
a
message
on
the
lecture
channel.
B
So
another
question
about
slurm
was
there's
a
time
limit
on
jobs.
How
does
ray
handle
that
property
for
restarting
of
slurring
jobs
right
so
right
now
we
don't
have
automatic.
Restarting,
however,
tune
has
automatic
checkpointing,
so
it
can
essentially
allow
you
to
in
certain
configurations
you
can
restart
the
job
from
exactly
where
you
left
off.
B
So
I
guess
a
more
general
question
for
for
is-
or
I
guess
for
arrays-
does
it
handle
node
failures
like
what
happens
when
a
node
goes
down,
so
typically
ray
is
fault
tolerant.
So
that
means
that
if
at
one
node
one
of
the
broker
nodes
go
down
like
the
the
term,
the
raid
job
continues
to
run.
B
However,
I
think
if
the
entire
cluster
goes,
it
becomes
a
little
bit
more
difficult.
B
All
right,
let's
see
so
oh
as
a
rule
of
thumb,
which
had
prayers,
often
give
the
most
the
biggest
bang
for
the
buck
to
tune.
So
so
this
is
a
great
question
and
I
was
actually
pretty
surprised
when
I
I
saw
one
of
the
important
plots
when
of
my
typical
hyperparameter
tuning
run.
So
I
think,
by
default,
the
most
the
most
important
hybrid
parameter
is
is
learning
rate
it's
it
is
attributed
essentially
to
like.
B
I
think
the
important
like
whatever
metric
was
roughly
like
70
of
the
performance
across
all
the
all.
The
different
parameters
that
I
was
sweeping
over
was
attributed
to
the
ring
rate,
and
it's
been
over
and
over
again
that
learning
rate
tends
to
be
what
really
decides,
how
how
well
your
your
mo
deep
learning
model
is
performing
so
yeah
and
I
guess,
for
random
forest.
You
probably
end
up
with
like
the
size
of
the
estimators
and
the
number
of
estimators
and
the
the
depth
of
the
tree.
B
Those
typically
are
the
the
the
most
important
hyper
parameters.
B
Yeah,
but
I
guess
as
general
I
I
there's
since
there's
so
many
machine
learning
models
like
I
can
only
speak
about
probably
the
two
most
general
ones.
Let's
see
so.
B
Yes,
so
one
one
one,
one
of
the
attendees
asked
you
mentioned
ray
tune
was
made
with
deep
learning
in
mind,
so
it
doesn't
work
well
with
other
machine
learning
models,
and
the
answer
is
yes,
you
can.
B
You
can
essentially
provide
any
deep
learning
model
or
sorry
any
machine
learning
model
anything
that
actually
just
returns
a
a
sort
of
objective
function
and
tune
essentially
allows
you
to
orchestrate
and
execute
a
essentially
an
optimization
process
over
over
this
object.
This
objective
function
providing
a
python
object.
B
So,
oh
one
question
was,
or
I
guess
two
questions
were:
how
do
you
specify
the
algorithm
for
tuning
array
tune,
and
richard
mentioned
that
there
was
built
in
support
for
scikit
optimized?
I
was
wondering
how
this
is
specified
so
right.
So
this
is
a
great
question.
I
think
where
you
want
to
go,
is
you
want
to
go
to
the
documentation
page?
So
actually,
let
me
just
quickly
do
a
walk
through
the
documentation
page
that
maybe
that
would
be
helpful.
C
B
Right
so
here
we
have
docs.ray.io.
B
Right
so
here's
the
or
maybe
I
should
do
a
lot
larger.
Here's
the
raid
documentation.
If
you
go
down
to
tutorials
and
guides,
there's
a
quick
walkthrough
of
all
the
concepts
that
you
might
want
to
know
and
what
the
attendee
is
asking
for
specifically
about
how
to
choose.
Optimizer
is
what
we
call
a
search
algorithm
right,
so
the
search
algorithm,
for
example,
here
we're
using
a
wrapper
around
a
popular
library
called
hyperopt,
and,
and
so
if
we
want
to
actually
use
this,
you
would
as
similar
to
to
present
it
on
the
screen.
B
It's
just
a
one
line.
Extension
of
the
tune
execution
function.
B
Now,
if
you
wanted
to
look
at
all
the
different
algorithms
that
were
provided
for
you,
here's
here's
like
a
list
of
all
the
different
integrations
that
you
can
choose
with
retune
and
each
of
them
have
their
own
documentation
and
different
features
that
you
could
interact.
B
B
Right
so
another
question
was:
how
does
ray
communicate
across
nodes
so
so
ray
does
not
use
files.
It
opens
sockets
between
different
nodes
and
communicates
mainly
through
tcp
npi.
All
right
so
ray
does
not
use
mpi,
underneath
the
hood,
but
ray
instances
are,
are
able
to
communicate
with
each.
C
B
Yeah
another,
I
guess
more
specifically,
another
question
was
like
of
the
distributed
aspect
of
ray
tune.
So
specifically
how
ray
tune
works?
Is
we
set
up
the
ray
cluster
underneath
on
the
slurp
and
upon
the
raid
cluster?
B
You
can
execute
your
ray
tune,
tuning
run
and
so
rey
provides
a
really
simple
abstraction
for
creating
actors
which
you
can
think
of
as
as
distributed
python
objects,
and
these
distributive
python
objects
you
can
interact
with
through
through
the
ray
api,
so
specifically
for
tune
tune,
essentially
constructs
a
bunch
of
these
different
objects
that
get
placed
across
the
raid
cluster.
B
These
different
actors
and
you
can
communicate
with
the
actors
to
retrieve
the
most
recent
training
result
or
say,
like
change.
The
hyper
parameters
on
that
particular
object,
and
this
allows
us
to
easily
implement
population-based
training
and
patient
optimization
so
again
to
answer
the
question:
how
is
communication
handled
between
nodes
from
ray
tune?
The
the
short
answer
is
like
using
an
active
framework
provided
by
ray.
B
All
right
so
now
there's
a
couple
questions
on
on
a
notebook
and
preps
I'll
just
do
a
walkthrough
of
a
notebook
on
collab
and
I
guess
we'll
do
the
well.
We
can
do
the
tensorflow
collab
notebook
and
I
can
also
let
me
just
quickly
post
the.
C
B
So
yeah
here
is:
how
do
I
post.
B
B
B
B
You
don't
change
any
of
the
code
where
you,
you
add
one
line
of
code
and
there's
tutorials
as
to
telling
you
which
line
of
code
that
is,
and
all
you
need
to
do
is
provide
a
underlying
ray
cluster
and
the
same
code
that
you
used
for
tuning
on
single
node
can
then
be
scaled
to
across
multiple
nodes,
so
where
I
will
go
first,
this
is
to
the
tune
tutorials
and
specifically
to
this
section
down
below,
which
is
is
collab
exercises.
B
B
It's
a
very
simplistic
example,
but
overviews
of
all
the
core
features
that
you
might
be
interested
in
using
so
the
first
thing
I'll
do
is,
I
will
comment
out
or
I
will
uncomment
the
first
section
which
installs
dependencies
on
collab.
B
So
so
as
a
quick
walk
through
this,
this
tutorial
will
cover
kind
of
the
process
of
visualizing
the
data.
So
you
kind
of
understand
where
we're
working
with
creating
a
a
neural
network
so
similar
to
something
that
you
saw
last
week
by
this
time,
using
tensorflow
tuning
this
provided
model
by
using
ray
tune
and
analyzing
the
model
by
using
some
of
raytune's
analysis,
objects.
B
B
And
here
we
have
a
couple
different
flower
characteristics
and
you
can
see
a
couple
of
these
characteristics
are
more
representative
of
or
allow
you
to
sort
of
separate.
The
the
flower
is
better,
so
there's
three
different
flowers,
they
all
have
different
characteristics
and
some
of
the
characteristics
are
more
telling.
B
B
It's
a
function
that
creates
a
neural
network,
keep
in
mind
that
we're
not
actually
crea
instantiating
this,
this
new,
this
neural
network,
we're
actually
just
defining
it
in
the
function.
This
is
important
because,
in
order
to
communicate
across
notes,
ray
depends
on
serial
serialization
and
oftentimes
machine
learning.
Models
have
trouble
being
serialized,
so
serialize
essentially
means
that
you
are
able
to
capture
the
model
in
in
a
byte
representation
and
then
transfer
across
the
network
and
reconstruct
that
byte
representation
into
a
neural
network.
B
So
here's
another
function
that
we
defined
it.
First,
it
essentially
trains
the
model,
and
there
is
a
nice
feature
in
keras
or
tensorflow
called
a
callback
which
is
essentially
a
hook
that
gets
invoked
every
couple
iterations
for
probably
every
iteration.
Every
time
you
do
an
update
here
we
have
a
callback
that
that
helps
us
checkpoint
the
model
so
that
we
can
preserve
it
and
save
it
to
use
afterwards
after
the
training
or
trading
process.
B
So
let's
just
quickly
check
that
this
works
and
we
should
see
an
accuracy
of
about
0.368,
so
so
that
that
was
mildly.
Interesting.
Let's
now
go
to
how
we
might
use
tune
ray
tune
with
with
a
callback
when,
with
the
keras
model,
so
here
we
define
a
simple
callback.
B
It
literally
take
it's
essentially
one
tune
call
which
allows
us
to
report
the
training
function
or
the
the
training
output.
You
can
call
this
method
anywhere
within
the
training
function
that
this
this
callback
happens
to
be
part
of
the
model
which
happens
to
be
invoked
within
the
training
function.
B
So
here's
a
couple
exercises,
but
I'm
just
going
to
quickly
add
this
in
so
essentially
what
I'm
doing
is
I'm
porting
this
again
the
same
code
to
the
the
same
code
that
we
saw
above
to
use
tune.
B
B
So
what
is
going
to
happen
now
is
that
this
function
is
actually
going
to
be
invoked
many
times
across
on
and
in
parallel
across
all
the
available
cores
in
in
your
computer
and
again,
if
you're
on
a
cluster,
then
this
function
is
going
to
be
invoked.
You
know
100
times,
for
if
you
had
a
hundred
decors
on
your
cluster,
so
we'll
define
that
for
now
and
then
the
second
step
after
we've
converted
the
training
function
to
use,
tune,
we're
just
going
to
define
a
hyperparameter
space.
B
So
what
we're
going
to
do
specifically
is
we're
going
to
define
the
learning
rate
to
have
a
uniform
distribution
over
the
log
space
from
0.001
to
0.1,
and
then
we're
going
to
set
some
some
model
architecture
parameters
and
then
we'll
also
specify
the
number
of
trials
that
we're
going
to
evaluate.
B
So
hopefully
this
works
out
of
the
blocks.
You
might
see
a
couple
warning
messages,
but
most
of
them
are
harmless
and
disappear
after
a
while.
B
So
what
you
see
on
the
screen
is
a
refreshing
or
a
self-refreshing,
tabular
format
that
tells
you
what
is
the
current
progress
of
the
hyperparameter
tuning?
It
also
presents
all
the
different
configurations
that
you're
using
so
all
the
different
hyper
parameters
that
you're
trying,
in
addition
to
the
corresponding
accuracy
of
each
hyper
parameter,
since
we
have
two
cpus
we're
actually
just
evaluating
two
two
trials
at
once
and
and
everything
takes
three
seconds
to
evaluate.
B
So
as
this
is
going,
it's
actually
outputting
a
couple
files
to
this
result
directory
which
you
can
configure
and
with
this
result
directory
you
can
then
specify
you
can
then
use
to
visualize
outputs
or
you
can
there's
a
couple
log
files
such
as
some
csv
formats
that
you
can
then
also
parse
yourself.
B
So
now
we're
done
with
the
hyper
hammer
tuning
run
and
we
want
to
identify
what's
the
best
tuned
model.
B
So
specifically,
what
we're
going
to
do
is
we're
going
to
again
create
this
data
locally
and
then
we're
going
to
plot
it
and
see
this
is
our
test
data.
B
So
so
what
we're
going
to
do
now
is
we're
going
to
take
this
object
that
we
so
this
there's,
this
analysis
object.
It
was
returned
from
tune.run
and
we're
now
going
to
leverage
a
couple
calls
so
it's
data
frame
and
also
its
ability
to
specify
the
best
log
directory
of
the
trial.
B
So
I
guess
a
bit
of
context
here
is
that
we
saw
that
above
there
was
a
directory
called
root.
Slash
raid
results.
Slash,
let's
see
so,
there's
only
one
doc
director
here
this
dash
in
iris.
B
So
this
is
the
experiment
log
directory,
but
if
you
look
inside,
there's
actually
20
different
folders
on
the
different
hyper
parameters,
the
each
of
the
different
trials
that
we
ran
within
any
single
one
of
these.
B
We
actually
can
see
that
there's
a
couple
of
different
files
that
we
can
get.
In
fact,
one
of
them
is
the
model
that
we
saved.
So
what
we're
going
to
do
now
is
we're
going
to
just
use.
The
analysis
object
that
we
got
from
tune.run
and
we're
going
to
obtain
the
best
log
directory
corresponding
to
this
particular
metric
minimizes,
so
the
best
meaning
the
minimum
one
and
this
validation
loss
again
was
provided
through
the
the
tune.
Callback.
B
B
And
then
in
comparison
to
the
ground
truth
again,
we
saw
that
this
is
perfect.
B
So
hopefully
this
one
works,
but
what
you
can
actually
do
is
you
can
use
tensorboard
within
this
jupyter
notebook
to
visualize
your
results
too.
So
what
we're
going
to
do
is
we're
going
to
point
it
to
the
experiment
directory
which
point
tensorboard
to
the
experiment
directory
which
allows
us
to
visualize
all
the
different
trials
at
once,
and
hopefully
this
works
all
right.
B
You
try
it
for
yourself,
there's
no
black
magic
here
and
and
another
nice
thing
is
about
the
visualizations.
You
can
also
click
the.
I
think
this
should
work,
but
I'm
not
totally
sure.
B
Yes,
there
we
go
right
so
tune
automatically,
takes
care
of
the
hyper
parameter
visualization,
which
allows
you
to
essentially
track
what
type
of
what
metrics
and
how
do
each
metrics
correspond
to
each
other.
So
if
we
just
filter
out
a
couple
of
these
extra
metrics
that
tune
provides,
what
we
see
is
that
the
mean
accuracy
corresponds
to.
B
You'll,
see
that
there's
a
lot
of
variance
across
these
different,
dense
layers,
and
my
reading
on
this
is
that
there
might
be
an
inter
there
might
be
some
relationship
between
dentist,
one
and
dance
two,
but
most
importantly,
the
learning
rate
is
what
decides
the
the
performance
of
the
model
so
yeah.
So
that
was
a
just
a
quick
overview
of
how
you
might
use
tune
for
a
typical,
a
very,
very
easy
hyper
primary
tuning
configuration
for
a
hyperion
tuning
run.
B
So
mustafa,
what
do
you
think?
What
should
we
do.
A
I
think
we
have.
We
have
a
lot
of
questions
actually
left.
If
you
would
like
to
answer
some
of
them
like
one
or
two
questions,
that's
that's
good
and
we
can
also
post
the
questions
on
slack
and
then
you
can
answer
them
later
at
your
own.
You
know
your
own
time.
B
All
right
how
about
I
do
so,
there's
16
questions
I'll
answer,
eight
and
the
rest
of
the
ones
we
can
do
on
slack.
B
You
all
right,
okay,
how
many
nodes
are
okay
before
distributed?
Bayesian
won't
be
affected.
Oh,
this
is
an
interesting
question
and
it
actually,
I
would
say
the
correct
response
for
this-
is
to
count
it
in
terms
of
number
of
parallel
trials
before
distributed.
Patient
won't
be
effective,
so
the
the
number
of
parallel
trials,
sort
of
also
correspond
to
how
many
you're
willing
to
do
at
once.
So,
let's
say
for,
for
example,
you
have
a
hundred
different
trials
that
you
want
to
run
or
you
want
to.
B
You
know,
evaluate
100
different
trials
if
you
did
say
like
100,
parallel
trials
at
once,
and
you
had
or
you
had
you
know,
100
parallel
gpus
that
you
could
access
then
running
100
trials
at
once
will
not
allow
you
to
leverage
a
prior
information
to
guide
your
search.
B
And
if
you
do
something
like
like
you
said,
let's
say
you
have
20
gpus
and
you
want
to
evaluate
100
different
trials.
You're.
There's
going
to
be
a
delayed
feedback,
so
let's
say
you
have
like
you
run
20
at
once.
In
parallel,
it's
going
to
take
only
on
the
21st
trial.
Will
you
actually
be
able
to
leverage
some
prior
information?
B
So
in
some
sense
you
won't.
I
guess
you
don't
lose
that
much
if,
depending
on
what
your
configuration
is
is,
but
I
think
the
only
thing
to
keep
in
mind
is
that
there
is
a
delay
in
feedback
and,
and
the
first
couple
runs
are
not
going
to
be
able
to
leverage
any
of
the
the
currently
training
like
they're,
not
going
to
be
able
to
leverage
this
model
that
you're
building
up.
B
Let's
see
so
does
bayesian,
optimization
and
other
events
methods
work
well
for
non-convex
problems.
So
my
understanding
is
that
deep
learning
for
the
most
part,
unless
you're
trying
to
make
a
convex
problem,
is
non-complex
and
we've
seen
bayesian
organization
work
well
for
many
deep
learning
methods
or
deep
learning
models.
B
In
general,
how
do
you
decide
how
many
trials
to
conduct
with
the
given
high
performance
optimization
algorithm
to
ensure
that
you
haven't
missed
the
most
optimal
regions?
So
I
guess
the
there's
always
this
this
illusion
of
optimality
that
we
get
in
high
parameter
tuning.
B
Essentially,
if
say,
I
had
12
like
a
dozen
hyper
parameters
right
and
each
of
them.
I
want
to
evaluate
for
three
different
values.
Where
essentially
I
have
this.
You
know
this
massive
grid
of
hype
parameters
and
it's
3
to
the
power
of
12..
That
means
that
if
I
really
want
to
find
the
absolute
optimal,
it
would
take
like
500
000
evaluations
right
because
there's
no
there's
no
absolute
guarantee
that
any
particular
parameter
that
you
choose
is
going
to
be
optimal.
B
What
that
means
is
you'll
start
with
the
very
core
screen
search
and
then
you'll,
slowly
narrow
down
the
search
until
you've
kind
of
identified
and
how
to
have
a
good
understanding
of
how
each
of
the
what
are
the
most
important
hype
parameters
and
what
would
what
would
be
able
to
be
done
to
in
order
to
optimize
performance
and
typically
you'll,
see
that
the
hyper
parameter
tuning
methods
are
only
going
to
provide
you
a
small
boost
over
like
some
defaults,
and
it
might
be
smarter
to
step
back
and
reevaluate
how
you're
designing
your
model
instead
of
trying
to
spend
so
much
money
or
a
lot
of
time
like
finding
the
optimal
hybrid
parameters.
B
I
would
say
in
terms
of
research
the
most
beneficial
thing.
The
hyperparameter
tuning
frameworks
can
provide
us
an
understanding
of
the
relationships
that
you've
designed
your
model
to
have
so
understanding
of
the
relationships
between
the
hyper
parameters.
And
that's
why
the
parallel
coordinates
is
incredibly
important
and
that's
why
people
are
still
doing
grid
search.
B
Does
ray
support
conditional
interactions
between
hyper
parameters?
Yes,
it
depends
on
also
depends
on
the
hyperparameter
tuning
library
that
you're
using
you
typically
specify
a
search
space
within
the
hyperparameter
tuning,
or
you
specify
your
conditional
operators
within
the
hyperimagining
space.
For
example,
you
might
say
hey.
I
want
four
layers,
but
I
want
one
to
four
layers,
but
if
I
had
a
fourth
layer,
then
I
want
to
have
the
fourth
therapy
from
50
to
100
width.
B
But
then,
if
I
had
three
there's,
then
this,
like
fourth
value,
doesn't
really
matter
so
a
lot
of
hypergram
tuning,
optimization
libraries
allow
you
to
specify
a
search
space
that
that
can
express
this
and
tune
it
sort
of
agnostic
to.
B
B
How
come
what,
if
anything,
does
the
population-based
tuning
approach
do
when
changing?
So
how
does
population-based
training,
perturb,
hyper
parameters
during
training,
so
what's
proposed
in
the
paper?
Is
that
if
you
have
two
types
of
values
you
have
a
category?
If
you
have
a
categorical
hybrid
parameter,
then
you
can
specify
a
list
of
different
categories
that
you
can
be
choosing
from
and,
and
you
can
like
re
sample
from
that
list.
Every
time
you
do
a
perturbation.
B
So
so
I
think
the
caveat
here
is
that
these
particular
parameters
that
you
can
perturb
are
typically
not
model
architecture
parameters.
The
reason
is
because
you
can't
easily
retrain
or
you
can't
leverage
it's
hard
to
change
the
model
architecture
during
training.
So
a
lot
of
a
lot
of
people,
just
just
don't
do.
B
B
Let's
see
have
you
tried
this
notebook
on
gpus
when
I
do
something
similar
on
gpus
with
tensorflow
2,
I
usually
have
memory
accumulation
problems,
as
the
gpu
memory
doesn't
clear
after
each
parameter
point
evaluation
yeah.
So
this
is
a
great
question.
This
notebook
does
work
on
gpus.
As
far
as
I
know-
and
I
guess
I'm
typically
using
pie
torch,
but
I
am
not.
B
I
would
say
the
reason
why
I
have
a
reasonable
prior
as
to
why
why
tensorflow
2
and
this
particular
notebook
would
work
in
practice
with
gpus
is
because
each
reactor
is
is
terminated
after
the
trial
is
done.
So
the
reactor
is
again
this
distributed
object.
It
runs
on
a
separate
python
process
and
the
memory
allocation
for
a
gpu
is
assigned
to
a
particular
python
process.
B
When
that
python
process
dies,
it
frees
up
the
memory
used
by
used
on
the
gpu
and
therefore,
typically,
we
don't
see
memory
leakage
across
off
across
different
tuned
trials
and
across
different
hybrid
point
evaluations.
B
Yeah,
I
think
I'd
be
happy
to
answer
more
questions.
On
slack,
I
know.
Let
me
just
get
some
of
the
earlier
answers
just
in
case
someone
feels
that
they're
young.
C
B
Yeah
so
does
ray
tune,
implement
the
semi-paralyzed
version
of
bayesian.
Optimization
answer
is,
yes,
you
can
specify
the
maximum
concurrency
and
you
can
also
connect
it
to
a
cluster
and
it'll
automatically
scale
up
the
patient.
Optimization
for
you.
Compare,
though
yeah.
So
I
guess
I
sort
of
answered
the
second.
This
other
question
which
was
do
you?
Can
you
still
obtain
optimal
convergence
if
you
tuned
hypothalamus
individually,
so
yeah
again?
B
Typically,
your
hyperparameters
have
a
biased
weighting
of
importance
and
you'll
want
to
sort
of
tune
like
the
main,
most
important
parameters
that
you
can,
that
you
can
find
and
and
sort
of
the
interdependent
relationships
between
hybrid
parameters
matter,
but
probably
to
a
lesser
extent
than
the
most
important
default
hyper
parameters
or
like
the
most
important
hypercameras,
such
as
learning
rate
or
momentum.
B
So,
typically,
what
I
would
do
is
I
would
try
to
identify
interdependence
by
by
using
the
sort
of
parallel
coordinate
plots
and
if
I
still
can't
provide
or
yeah
I
would
yeah.
That's
probably
what
I
would
do
and
then,
if
I
identify
something,
that's
particularly
interesting,
a
a
you
know:
interaction
between
hyper
parameters,
I'll
probably
run
another
grid,
search
over
over,
like
a
selected
evaluation
of
the
hybrid
hyperparameter
space.
Just
to
test
some
hypotheses
about
the
interactions
with
the
hyperparameters.
B
How
do
we
know
who
is
the
best
performer
in
pbt?
This
is
mainly
just
you.
You
can
identify
the
the
lowest
performing
model
or
the
best
performing
model,
and
that
particular
model
is
corresponds
to
a
sequence
of
perturbations
through
through
the
training.
So
it's
not
a
single
trial,
but
rather
it's
not
a
single
high
priority
evaluation
but
sequence
of
high
primary
evaluations
and,
typically,
what
you
can
do
is
you
can
track
them
attract
this
over
time.
B
So
just
so,
I
guess
in
practice,
convergence
guarantees
are
a
good
prior
for
whether
or
not
the
optimization
method
is
going
to
work
in
the
first
place,
or
it's
going
to
be
useful
in
the
first
place.
Yes,
many
of
these
hyper-camera
tuning
model,
the
models
that
you're
treating
for
hyperparameters,
are
non-convex
and
and
so
convergence
guarantees
with
hype.
Software
for
lays,
like
optimization
methods
that
have
convergence
guarantees
or
rate
guarantees,
they're,
typically
not
they're,
they're
good
prior
but
they're,
I
guess
they're
not
definitive,
and
that
you
won't
necessarily
converge.
B
But
yeah,
what
do
you
think.
A
Yes,
that
sounds
good,
so
save
the
rest
of
the
questions,
and
now
you
can
answer
them
when
you
have
time
on
slack.
Okay,.
A
This
was
the
this
was
yeah
very
pedagogical,
actually
at
so
many
levels.
I
also
enjoyed
the
the
demo
that
you
you
ran.
I
think
we
had
also
so
many
questions
and
a
lot
of
engagement
from
the
attendees.
A
So
thank
you
again
richard
and
thank
you
everyone
for
joining
the
second
week's
lecture.
I
just
want
to
remind
you
again
that
we
have
a
lecture
every
week.
We
might
have
a
break
in
the
middle
on
some
days,
and
so
please
join
us
next
week
for
the
deep
generative
models
talk
by
aditya
grover
from
stanford
university,
and
so
just
so
that
you
know
we,
you
have
a
slack.
A
If
you
don't
know
about
the
slack,
we
have
slack
that
you
can
join
through
here
and
you
can
continue
the
discussion
on
particular
lectures
on
the
specific
channel
for
their
lecture
and
also
we
do
link
to
these
slides
and
the
video
later.
So
you
find
a
link
to
the
video
here
and
read
your
slides,
for
example,
and
there's
also
all
the
recordings
will
be
available
on
youtube,
hopefully
in
one
to
two
days
max
after
the
lecture
thanks.