►
Description
Deep Learning for Science School 2019 - Lawrence Berkeley National Lab
Agenda and talk slides are available at: https://dl4sci-school.lbl.gov/agenda
A
Okay,
so
yesterday
we
talked
about
how
to
build
neural
networks.
We
saw
examples
of
that
and
we
so
also
the
echo
system
or
frameworks
like
tensor
go
with
the
API
of
Karass.
We
did
talk
generally
about
you
know,
neural
networks.
Are
these
complex
functions
that
we
try
to
use
to
approximate
relationships
that
we
have
in
the
data,
and
then
we
talked
about
how
we
we
find
the
weights
of
those
functions
or
we
how
we
optimize
them,
and
we
also
mentioned
how
to
monitor
the
learning
process.
We
mentioned
the
difference
between
optimization
and
learning.
A
Our
aim
is
really
to
generalize
the
performance
beyond
the
training
data
set
that
we're
looking
at
today,
you
will
focus
mostly
on
how
we
actually
do
that
process
in
practice.
How
do
we
improve
the
generalizability
of
our
network?
So
we'll
talk
about
a
few
hairy
issues
that
come
in
in
practice.
I
think
if
you
remember
Brenda
yesterday
said
that
one
thinks
that
we
spend
most
of
our
time
trying
to
find
the
best
architecture
for
the
problem
that
we're
working
on.
But
that
is
really
not
not
what
we
yeah.
A
We
spend
most
of
our
time
actually
trying
to
make
these
networks
work.
Even
after
you
find
that
architecture
or
you
find
some
architecture
that
makes
sense,
you
need
to
get
it
to
work,
and
this
is
not
simple
so
before
I
get
into
the
details.
I
want
to
remind
you
again
that
none
of
you
know
events
like
this
are
sufficient
to
actually
get
a
grasp
on
all
the
nuts
and
bolts
of
actually
doing
deep
learning
in
practice.
A
I
strongly
recommend,
looking
at
an
undergraduate
level
course
such
as
yes,
231
and
has
great
lectures
and
I'll
be
using
a
lot
of
these
lectures.
There
are
books
on
the
market
that
also
are
very
helpful.
There
are
academic,
looks
like
they're
discerning
by
a
and
Goodfellow,
and
this
I'm
sure
you
have
seen
this
this
book
before
it's
more
of
an
academic
nature,
and
then
there
are
practical.
A
That
josh
has
also
talked
about
yesterday.
Okay,
so
why?
Why
does
it
matter
that
we
actually
get
a
grasp
on
how
to
train
your
networks?
Because
we
do
think
that
the
future
will
the
future
of
software
engineering
will
feature
a
lot
of
deep
learning
and
neural
networks.
We
have
already
started
seeing
it
terms
like
software
2.0,
which
is
essentially
building
software,
that
is
powered
by
neural
networks.
We
talked
about
deep
learning
based
systems
where
you
have
your
entire
pipeline
of
software
or
entire
stack
composed
of
multiple
parts
and
those
multiple
parts.
A
Many
of
them
are
neural
networks,
so
we
do
expect
that
in
the
future
we
want
to
actually
be
able
to
to
have
a
development
cycle,
that
is,
that
a
disciplined
development
cycle
for
for
this
software,
this
type
of
this
type
of
software.
So
what
characterizes
software
2.0?
It's
essentially
the
fact
that
we
use
it.
Okay,
let's
step
back
and
think
of
what
software
1.0
is.
We
always
talk
about
this
rule-based
sort
of
software
right.
A
Your
ball
learn
the
f
else
for
loops
and
recursion,
and
all
that
sort
of
algorithms
that
we
use
to
build
traditional
software
so
and
the
way
that
we
achieved
tasks
were
using
traditional
software
is,
we
think
about
the
actual
problem
that
we're
looking
we're
looking
at
the
word
time
to
solve,
and
then
we
come
up
with
an
algorithm.
We
can
even
write
a
flowchart
for
that
algorithm
and
that
flowchart
is
rule
based
right.
A
So
the
optimization
space
of
software
like
this,
this
would
be
software
2.0
just
come
up
with
an
algorithm
at
one
point:
software
2.0,
you
start
with
at
some
point
and
then
use
gradient
descent
to
do
an
optimization
process
to
get
to
the
software
that
you
want.
We
can
actually
achieve
much
higher
complexity.
A
Software
you've
already
seen
a
lot
of
tasks
yesterday
that
were
displayed.
There
are
extremely
difficult
to
actually
come
up
with
algorithms
to
do
to
do
those
tasks
or
perform
those
tasks
with
a
rule-based
sort
of
software
and
as
Andrew
hypothesis
gradient
descent
can
really
write
code
better
than
you
do.
Okay.
So
how
do
we
do
this
in
practice?
You've
seen
xkcd
like
this
goes
like
this.
Is
your
machine
learning
system
yup?
You
pour
the
data
into
this
big
pile
of
linear
algebra
and
collect
the
answers
on
the
other
side.
A
What
if
they
have,
the
answers
are
wrong.
Just
turn
the
pile
until
they
start
making
sense.
Okay,
so
it's
of
course
this
is,
you
know
kindly
facetious,
but
in
practice
we
know
we
need
to
do
a
lot
of
staring,
but
this
Turing
is
not
it's
not
easy.
You
will
see
a
lot
of
these
issues
today.
This
is
where
you're
going
to
talk
about
I'll
talk
about
data
normalization
briefly.
A
This
is
an
important
topic
that
sometimes
it's
necessary,
we'll
talk
about
learning,
great
DK
and
then
we'll
spend
some
time
talking
about
regularization
after
that
move
on
to
talking
about
debt
and
then
I'll
try
to
finish
with
a
couple
of
things
that
are,
we
use
in
practice,
things
like
transfer
learning
and
some
practical
tips,
I
hope
I
can
get
through
all
the
slides,
so
normalization
I
think
this
has
been
mentioned
yesterday.
A
couple
of
times
and
you've
also
seen
it
in
in
the
practical
sessions
that
we
usually
normalize.
A
Most
importantly,
so,
if
you're
looking
at
data
like
this
one
way
of
doing
normalization
is
to
move
your
distribution
of
your
data,
the
mean
of
that
distribution
to
zero,
zero
and
also
maybe
divide
by
the
standard,
deviation
to
standardize
or
to
normalize
the
different
dimensions.
It's
important
to
remember
that
you
don't
really
need
to
do
this
all
the
time.
A
You
only
need
to
do
it
if
you
have
reasons
to
believe
that
the
different
dimensions,
the
scales
of
the
different
dimensions,
do
not
are
not
important
to
your
to
your
algorithm
or
if
these
two
dimensions,
for
example,
are
equally
important.
You
think
that
these
features
are
important.
Otherwise
you
don't
need
to
do
to
do
this.
You
have
to
think
about
your
problem
a
different
way
of
doing
normal.
A
A
Do
a
transformation
to
a
diagonal
matrix,
put
your
data
on
century
on
the
eigenvectors
and
then
normalize,
and
then
you'll
get
a
white
and
data.
There
are
a
lot
of
a
lot
of
normalization
methods
that
you
need
to
look
at,
but
you
really
need
to
think
about
the
data
set
that
you're
dealing
with
okay,
I'm
gonna,
move
on
to
a
slightly
different
topic.
A
weight
initialization
I
didn't
mention
this
yesterday
kind
of
thought
that
it's
blessed
that
when
we
start
our
neural
network,
we
start
with
random
weights.
A
The
weights
of
our
neural
network
are
completely
random,
but
how
do
we
choose
that
initialization?
First
of
all,
if
we
don't
initialize
them
at
all,
we
know
that
there
is
no
learning
right
like
if
all
the
weights
are
zero.
You
can
think
about
it
a
little
bit
and
look
at
the
gradient
descent
to
realize
that
there
will
be
no
updates
whatsoever
if
I
initialize,
all
of
them
to
with
a
constant
value.
All
of
it.
There
is
no
symmetry.
A
All
of
them
will
be
learning
exceeding
exactly
the
same
gradient
update
and
you
are
not
gonna
get
anywhere
right.
So
we
need
to
break
that
symmetry
some
distribution.
The
first
thought
is
to
use
a
normal
distribution
right,
like
just
another
initialize
and
with
some
constant
standard
deviation
and
just
use
their
normal
distribution.
This
is
it's
okay,
it
works
for
for
narrow
or
for
shallow
neural
networks.
However,
it
turns
out
that
if
you
use,
if
you
use
that
in
deeper
neural
networks,
you
turn
to
the
activations
tend
to
to
go
to
two
zero
mean
I.
A
Think
it's
also
easy.
If
you
just
look
at
you
know
with
the
pencil
and
paper.
Look
at
this,
if
you
start
with,
if
you
initialize,
for
example,
with
normal
distribution
with
one
percent
sort
of
standard
deviation,
the
activations
of
your
first
layer
are
going
to
look
like
this
activations
of
the
second.
A
There
are
gonna
be
a
little
bit
narrower
as
you
go
deeper
by
the
time
you
get
to
the
sixth
layer
you're
already
at
0.05
standard
deviation
of
the
activations
and
as
we
talked
about
yesterday,
the
activation
is
going
to
zero
that
that
the
gradients
would
also
go
to
zero,
because
we
are
applying
the
chain
rule
right
back
propagation,
and
this
is
not
a
good
idea.
This
is
not
gonna.
Work
is
you're.
Gonna
kill
the
learning
very
early
on
okay.
A
What
if
I
start
with
a
larger
standard
deviation?
Maybe
that
could
be
a
solution,
but
it
turns
out
that
it's
not
because
also
the
activations
and
to
to
also
get
saturate
3-1
one.
This
is
with
a
sigmoid
activity,
the
neural
network,
with
the
sigmoid
activation.
You
can
see
that,
but
also
it's
just
just
to
illustrate
the
idea
that
even
going.
A
A
A
Which
is
you
the
normal
distribution
with
1
over
square
root
of
the
number
of
input
dimensions
and,
as
you
can
see,
it
gets
thread
of
the
problem
of
these
zero
oxidations
or
narrowing
activations.
So
we
get
around
this
and
then
we
can
learn.
Things
are
fine,
but
this
works
really
well
with
the
sigmoid
activation.
It
turns
out
that
it
has
a
problem
with
raloo
networks
and
networks
that
have
real
uu
activations
you
get
into
the
math.
You
look
a
little
bit
at
net
works
with
to
renew
activations.
A
It's
actually
very
easy
to
follow
the
Matt
and
in
this
paper
and
I
forget
the
citation
for
the
paper,
but
it's
essentially
by
caming
he
and
others.
They
show
that
for
real
who
activated
networks
there's
a
peer
initialization
tends
to
also
have
a
problem
with
with
learning
so
you'll
get
this
sort
of
error,
it
just
saturates
and
you
actually,
you
don't
get
anywhere,
so
you
would
need
to
use
the
hemming
he
and
most
of
frameworks.
This
would
be
they're
called
initialization.
So
you
need
to
look
at
this.
A
Dn
is
the
number
of
input
connections
to
the
neuron,
and
this
would
depend
on
if
you're
doing
this
is,
if
you're
doing
a
fully
connected.
This
would,
but
you
just
the
number
of
input
neurons
if
you're,
using
a
communal
kernel.
That
would
be
the
number
there
K
by
K
the
kernel
size
by
the
number
of
channels.
I
didn't
want
to
get
to
all
the
math
here
just
wanted
to
give
the
big
picture,
but
then
you
see
like
as
soon
as
you
switch
to
the
he
initialization.
You
actually
start.
You
know
the
network
start
learning.
A
There
was
a
blog
post,
I
think
I
saw
it.
Last
week
somebody
was
pointed
out
that
the
default
initializer
in
Karis
is
actually
severe
initialization.
So
as
soon
as
you
get
to
a
few
layers
with
riilu
networks,
it's
the
network
starts
learning.
So
you
need
to
pay
attention
that
there
are
default
values
for
all
of
these
things
in
your
framework,
especially
carrots,
it
kind
of
assumes
stuff.
So
you
need
to
actually
pay
attention
to
what
is
going
on,
because
this
can
be
frustrating.
A
You
can
spend
a
lot
of
time,
not
realizing
that
it's
assuming
some
initializer
yeah
I
can
see
that
sorry,
yeah,
please.
This
is
the
I
think
this
is
the
validation
error,
but
it
could
be
the
training
error.
I'm,
not
sure
there
idea
is
that
it
actually
the
updates.
Do
you
stop
getting
updates
to
the
network
at
all
to
the
wait,
so
it
stops
learning
it's
just
this.
They
the
two.
A
A
Don't
think
that
this
is
resonate
or
anything
like
that,
but
I
could
be
wrong.
Okay,
I
think
this
actually
came
before
resonant
the
this
paper
by
the
same
people
who
came
up
with
resonant
okay,
so
one
thing
about
initialization
is
that
we
tend
to
think
okay.
That
problem
has
been
solved.
Okay,
things
are
fine,
has
been
solved
since
2010
or
something
and
we're
moving
on.
A
However,
it
turns
out
that
initialization
is
way
more
tricky
than
just
that,
so
there
is
a
series
of
recent
papers
that
explore
essentially
the
effect
of
initialization
on
the
performance
on
of
the
network
and
the
relationship
between
utilization
and
generalization,
and
all
of
that
so
I'm
gonna.
Try
to
this
is
a
recent
paper
last
year
and
there's
a
lot
of
work,
follow
up
work
this
year
and
try
to
give
a
very
high-level
overview
of
what
is
going
on
here.
A
A
We
want
a
network
that
is
much
smaller
than
that
because
it's
much
easier
to
to
say
to
they're
much
faster
to
to
to
use
an
inference
mode,
and
also
we
know
that
if
we
look
at
all
the
neurons
and
the
network,
we
know
that
a
lot
of
those
neurons
after
the
training
finishes.
One
of
these
neurons
are
not
necessary,
so
they're
dead,
already
and
they're
not
really
doing
anything.
So
we
apply
something
called
pruning.
A
So
essentially
we
take
the
original
network
after
full
training,
and
then
we
prune
it,
and
we
get
a
network
like
this
that
performs
equally
well
today
to
the
initial
network.
So
the
question
that
these
authors
have
explored
is
that
why
can't
we
just
start
with
that?
Pruned
Network
and
with
exactly
the
same
initialization
that
we
received
in
the
first
time
and
see
where
we
get.
A
There
is
one
of
them
that
actually
ends
up
being
the
network
that
that
works
when
you're
doing
entrance
using
the
full
network,
and
this
is
an
important
idea
because
it
could
be
essentially
it's
pointing
the
to
the
the
possibility
that
it
could
be
that
we
are
building
these
extremely
large
networks,
because
we
have
very
bad
initialization
and
if
we
figure
out
how
to
initialize
our
networks
better,
we
might
be
able
to
build
much
smaller
networks
to
achieve
the
same
performance
on
the
tasks
that
we're
looking
at.
There
is
a
recent
paper
actually
I,
think
yeah.
A
This
is
June
June
last
last
month,
where
essentially,
they
explored
this
the
same
idea.
They
they
took
a
network,
they
took
a
network,
they
initialized
it
and
then
pruned,
and
then
they
looked
used,
the
same
initialization
to
initialize,
so
many
other
networks
and
see,
if
that
initializes,
that
set
of
initial
parameters
generalize
and
also
help
us
train
other
networks
to
get
the
same
performance.
I
encourage
you
to
look
at
this
paper
is
very
interesting,
but
more
importantly,
is
that
this
is
an
active
area
of
research.
It's
it's
possible.
A
The
within
the
upcoming
months
to
a
couple
of
years
that
we
figure
out
a
way
to
initialize
our
networks
better,
and
then
we
can
build
smaller
networks
to
perform
the
same
tasks.
Yes,
I'm,
not
an
expert
on
that.
So
I
haven't
done
a
lot
of
that,
but
my
I
think
most
of
what
they
do
is
they
look
at
the
effect
of
the
participation
of
certain
neurons
in
the
final
decision
on
the
accuracy
of
the
network.
A
If,
if
killing
a
connection
does
not
affect
the
final
performance,
then
that
connection
is
not
necessary,
so
you
essentially
remove
it
yes,
so
this
is
exactly
what
this
paper
is
doing.
So
we
found
that
within
the
natural
images
domain,
winning
ticket
initialization
generalize
across
a
variety
of
datasets,
including
blah,
and
then,
moreover,
winning
tickets
generated
using
larger
data
sets
consistently
transferred
better
than
those
generated
using
sport
datasets.
A
Actually
one
of
the
authors
is
Michaela
she's,
one
of
the
organizers
I
think
she
will
be
here
today,
so
you
can
also
talk
to
her
about
this,
so
this
is
this
is
at
least
there.
The
paper.
The
case
study
was
on
image.
Classification.
Okay,
so
initialization
is
not
sure,
is
not
simple.
You
need
to
think
about
the
default
parameters
that
you
have
in
your
network
and
soon
we
might
it's
it's
possible.
We
might
find
a
way
to
have
better
initializations
for
our
networks.
A
If
we
start
with
good
parameters,
we
might
be
able
to
converge
much
faster
to
good
performing
model.
Okay,
when
I
move
on
to
talk
about
a
different
topic,
which
is
learning
great
decay.
So,
if
you're,
when
you
think
of
a
gradient
descent,
we
were
always
like
the
the
picture
of
gradient
descent.
Is
that
we're
using
these
the
gradient
information
to
try
to
get
to
a
point?
That
is
that
many,
my
that,
what
we
call
a
minimizer
right,
a
point
at
which
the
loss
function
is
low.
A
We
expect
what's
a
minimizer,
at
least
in
the
classic
picture
and
the
complex
optimization.
The
minimizer
is
a
place
where
we're
essentially
the
last
surface
is
flat
right.
So
that
means
the
the
slope
is
tends
to
go
to
zero,
if
you
so,
but
remember
that
we're
using
stochastic
gradient
descent.
So,
even
if
the
gradient
itself,
the
full
gradient
is,
we
are
getting
closer
to
a
minimizer
and
the
full
gradient
is
getting
having
us.
You
know
getting
to
zero,
so
it's
decaying
itself,
but
stochastic
gradient.
A
A
There
is
some
result
that
shows
that
if
you
want
to
use
SGD,
it's
sufficient
if
these
conditions
are
satisfied,
so
epsilon
here
is
the
learning
rate
and
the
conditions
are
the
first
one
says
that
the
sum
of
the
learning
rate
along
all
the
steps
or
the
sum
of
the
size
of
the
steps
along
my
optimization
equals
infinity
and
the
sum
of
the
squares
of
those
steps
or
the
magnitudes.
The
squared
magnitudes
of
those
steps
is
finite.
A
Intuitively
the
first
result
says
that,
if
the,
if
I
start
from
a
completely
random
initialization
point,
it's
a
matter
of
nowhere
with
the
number
of
steps
that
I
am
taking
I'm
guaranteed,
that
I
can
reach
the
minimizer
wherever
it
is,
I
have
infinite
range
to
get
to
it
right.
The
second
result
intuitively
says
that,
if
I
get
close,
I
will
be
able
to
converge
to
that
point.
I'm
not
gonna,
just
be.
A
You
know
swinging
around
that
point,
so
I'll
be
able
to
converge
to
that
point.
So
how
do
I
achieve
this
in
practice
with
SVD?
We
do
learn
in
great
detail.
So
essentially
we
decay
what
we
call
decaying
the
learning
rate
as
we
are
going
on
so
in
practice,
you'll
hear
about
something
called
the
learning
rate
schedule
and
that's
what
you
will
be
using
and
thinking
about.
A
There
are
a
lot
of
different
types
of
learning
rate
schedules,
people
use,
linear
decay
or
exponential
decay
or
cosine
or
inverse
square
root,
and
this
is
usually
as
a
function
of
the
step,
more
often
as
a
function
of
the
epoch.
The
number
of
passes
that
you
have
through
your
data,
so,
for
example,
linear
decay,
would
be
you
have
a
initial
learning
rate
multiplied
by
1
minus
T,
which
is
the
the
epoch
number
divided
by
the
total
number
of
epochs,
and
you
control
this
by
thinking
about.
A
What's
the
final
learning
rate
that
you'd
have,
and
you
see
that
this
is
a
decaying
function.
So
the
basic
idea,
which
josh
also
touched
on
this
yesterday,
is
that
the
networks
that
we
are
looking
at
we're
not
trying
to
get
to
a
global
minimum
there
are.
There
is
a
large
number
of
good
minimizer's
of
the
entire
loss
function
on
that
data
set,
so
there's
an
abundance
of
minima's
that
are
equally
good
and
there
are
results
showing
that
those
numbers
are
all
actually
within
good
performance.
A
Now,
a
more
practical
way
of
doing
this,
which
is
touches
on
your
point,
is
to
actually,
instead
of
trying
to
a
priori,
decide
how
you
want
to
decay
your
learning
rate
you
can
use,
you
can
monitor
your
loss
function
or
your
monitor
the
performance
on
some
validation
data
set
and
only
dick
reduce
the
learning
rate
when
you
get
to
when
you're
stopped
learning
with
the
current
learning
rate.
So
if
the
loss
here
tends
to
get
to
almost
zero
slope,
you
reduce
the
learning
rate.
If
you
get
to
another
plateau,
reduce
the
learning
rate.
A
This
is
actually
the
plot
of
training
resonate,
and
the
decay
here
is
by
a
factor
of
10.
So
if
they
divide
by
10
at
each
point
learning
rate,
there
is
in
the
frameworks
that
use
especially,
there
are
stuff
called
like
reduced
learning
rate
on
Plateau.
This
is
a
callback
that
you
can
add
and
then
you
can
decide.
What's
the
patient's
which
patients
here
would
be
the
number
of
epochs
that
you
would
wait
for
before
you
decide
to
decay
the
learning
rate,
so
here,
for
example,
this
it
hasn't
been
indicated
immediately.
A
A
The
the
basic
gist
of
that
paper
is
that
it
turned
out
that
the
more
complex
the
data
set
that
we
are
applying,
where
we're
applying
our
models
on
and
the
more
complex
the
the
models
themselves,
the
the
larger
the
batch
size
that
you
can
actually
use
and
that
having
a
larger
batch
size
helps.
You
paralyze
your
the
optimization
of
your
model
and
all
those
sort
of
stuff.
So-
and
this
is
about
to
mention
that
another
way
to
do
things
is
I
rather
than
decaying.
The
learning
rate
you
can.
A
A
One
way
to
do
this
is
instead
of
instead
of
decaying
the
learning
rate
itself.
We
decide
that.
Okay,
let's
just
increase
the
batch
size,
there's
a
bunch
of
papers
that
explored
this
idea.
One
of
them
is
with
the
very
interesting
titled
on
decay.
The
learning
rate
increase
the
batch
size
and
they
show
exactly
this.
A
Instead
of
lowering
decaying
the
learning
rate,
they
actually
show
that
the
king,
the
learning
rate
or
increasing
the
batch
size
you
can
achieve
almost
exactly
the
same
loss,
the
lost
curve
or
training
curve,
and
they
even
have
a
hybrid
approach
where
they're
decaying
the
learning
rate,
while
increasing
the
batch
size,
they
can
achieve
the
same
thing.
If
you
do
this
delicately,
you
can
achieve
the
same
thing
and
the
paper
that
you
mentioned
by
opening
I
explorers.
This
idea
further
and
actually
derives
a
lot
of
relationships
between
website
learning
rate
and
the
optimization
progress.
A
That's
a
good
question,
so
the
basic
idea
is
that,
if
you're,
if
you
want
to
decay,
we're
trying
to
reduce
anneal
the
noise
right,
so
we
can
decay
the
learning
rate,
but
if
I
increase
the
batch
size
instead,
I
can
paralyze
my
process
and
train
faster.
So
if
my
batch
size
I
go
from
batch
size,
whatever
10
to
100
I
can,
instead
of
using
a
GPU
I,
can
use
10
GPUs
at
the
time
and
then
I
finish
the
training
faster.
A
So
it's
it's
about
the
world
wall,
clock
time
finishing
the
training
faster
thanks
for
the
question.
Other
questions.
Ok,
so
now
I'll
move
on
to
talk
about
regularization.
So
remember
yesterday
we
said
that
if
we
monitor
the
training
error-
and
we
monitor
the
validation
error,
there
are
multiple
regions
and
we
said
that
we
want
to
get
out
underfitting
as
soon
as
possible,
and
this
this
is
usually
easy
to
get
out
up.
There
are
multiple
ways
of
doing
this.
A
Can
you
know
so,
and
then
we
spend
most
of
our
time
in
this
regime
where
we're
trying
to
essentially
push
this
point
along
as
much
as
possible,
where
we're
trying
to
reduce
the
error
on
an
unseen
data
set
or
the
validation
dataset.
We
talked
about
the
generalization.
So
how
do
we
do
this?
In
practice?
We
use
regularization.
A
A
You
could
have
a
label
here
why,
if
you
are
dealing
with
a
supervised
learning
setup-
and
this
is
on
a
mini
batch,
so
what
we
do
is
we
add
a
regularizer
to
the
to
the
last
function,
and
so
the
Reg
Arizer
is
usually
in
this
form.
So
you
have
some
penalty
on
the
norms
of
the
weights,
and
the
job
of
this
is
to
essentially
say
like
don't
fit
too
well
to
the
training
data.
Just
we
want.
We
don't
want
to
over
fit
to
the
training
data
itself.
A
There
are
many
ways
of
thinking
about
this,
and
this
is
not
special
to
neural
networks
or
deep
learning.
So
this
is
in
the
context
of
statistical
learning
in
general,
and
it
could
you
can't
think
of
it
as
a
way
of
of
choosing
a
simpler
hypothesis
than
an
extremely
complicated
hypothesis
that
that
over
fits
the
the
train,
they
are
very
well
so,
a
few
terms,
this
lambda
is
usually
the
strength.
The
lambda
coefficient
is
the
strength
of
the
regularizer
which
you
want
to
tune
as
well
as
it's
a
hyper
parameter.
A
The
weights,
the
the
penalty
that
you
apply,
is
applied
on
the
way
it's
not
the
biases.
The
rationale
is
that
we
have
to
to
a
small
number
of
biases.
Anyway,
we
don't
really
need
to
regularize
those,
and
then
we
need
the
biases
to
be
free,
because
if
there
are
any
shifts
in
the
activations
or
the
data,
we
want
to
be
able
to
get
those
and
in
practice,
if
you
try
to
regularize
the
biases,
you
tend
to
under
fit
your
day.
Your
training
data
anyway,
so
types
of
regularizer
is
I'm.
A
You
probably
have
seen
an
l1
or
l2
regularizer,
so
l1,
essentially,
you
add
to
the
last
function
the
sum
of
the
absolute
value
of
the
weights,
and
this
tends
to
create
sparse
representations,
which
assumes
that
if
you,
if
you
have
reasons
to
believe
that
your
activations
should
be
sparse
or
if
you
tried
it
and
it
turns
to
work
out
well
then
this
is
the
thing
to
use
you.
Can
it's
easy
to
think
of
why
this
creates
sparse,
no
sparse
representations?
A
The
penalty
here
is
on
the
size
of
the
weights
themselves
right
whether
the
weight
is
1
or
10
its
penalized.
If
it's
not
0
its
penalized,
so
it
actually
is
trying
to
force
the
network
to
learn
a
sparse
representation
only
have
a
nonzero
weight.
If
you
have
reasons
to
believe
that
this
is
helping
the
optimization
right,
another
type
of
regularization,
which
you
see
this
more
often
sometimes
it's
called
the
weight
decay,
as
you
add,
to
the
loss,
function
essentially
the
norm
of
up
your
weights,
and
this
is
a
different
type
of
your
regular.
A
It
says
only
like
don't
have
too
large
weights
right.
This
is,
if
it's
small,
it's
not.
The
penalty
is
not
that
strong,
but
if
W
is
very
high,
the
penalty
is
strong.
There
are
a
lot
of
connections
to
Bayesian
optimization.
If,
essentially,
it
could
act.
If
you
can
think
of
this
as
a
Gaussian
prior
on
the
weights,
you
have
reason
to
believe
that
your
weights
are
a
Gaussian
or
normally
distributed.
So
the
log
of
a
Gaussian
gives
you
W
squared
this.
A
In
practice,
we
tend
to
use
L
to
a
lot
other
types
of
regularization
things
like
noise
robustness.
You
want
your
network
to
not
be
too
sensitive
to
the
size
of
a
tutu
to
the
way
the
value
of
the
weights
themselves.
So
you
add
some
noise
to
the
way.
It's
true,
you
want
the
decision
to
be
independent
of
small
perturbations.
A
Okay,
so
yes
early
stopping
we
mentioned
yesterday
that
early
stopping
is
something
that
you
want
to
apply
all
the
time.
So
essentially
you
monitor
your
validation
error
and,
as
soon
as
your
validation
error
starts,
climbing
that's
where
you
want
to
stop,
because
this
is
where
your
model
has
the
best
performance,
and
this
is
it's
a
type
of
rigor
ization.
You
can
see
if
you
look
at
the
angered
fellows
book.
A
You'll
see
a
connection
between
early
stopping
under
in
some
simplified
setup,
early,
stopping
and
l2
regularization,
so
essentially
early
stopping
the
way
that
you
can
think
about
it.
Like
l2,
says:
I
don't
wander
to
too
far
off
from
the
initial
parameters
yeah.
So
this
is
something
that
a
type
of
regularization
they
use
in
practice
dropout.
You
have
seen
this
in
the
network's
yesterday.
The
basic
idea
of
dropout
is
the
following:
we
have.
This
is
a
fully
connected
Network,
and
this
is
what
we
are
training
there
is.
A
The
the
idea
of
dropout
is
to
randomly
drop
connections
while
you're
doing
the
training
of
your
network,
and
the
basic
idea
is
that
maybe,
if
I
randomly
drop
these
connections,
I
will
discourage
the
network
from
having
a
cohabitation
some
neurons
fire
only
when
the
other
neurons
are
firing.
I
can
also
make
again.
These
are
all
intuitive
pictures
that
are
helpful,
but
they
break
sometimes
so
another
way
of
thinking
about
it.
Is
that
your
forcing
the
network
to
not
to
rely
too
much
on
certain
representations
to
be
able
to
make
its
decision
right?
A
So
you're
forcing
it
to
rely
on
multiple
sort
of
representations
to
be
able
to
reach
the
same
same
decision,
and
what
you
do
is
that
in
at
inference
time
you
use
the
full
network
and
then
you
can
look
at
the
matter
a
little
bit
and
then
you
will.
You
can
Reba
rebalance
or
renormalize
the
output
so
that
you
essentially
cancel
out
the
property,
the
this
probability
of
dropping
the
network
connections.
A
Another
way
to
think
about
this
is
that
you're
continuing
your
training
instead
of
one
network,
you're
training,
a
large
example
an
exponentially
large
and
sample
of
networks
right
when
I
have
random,
dropping
out
connections.
All
the
way
through
my
network
I
have
a
random
sample
every
time,
I'm
training
a
different
sub
network.
A
In
practice
this
tends
to
work
really
well.
So
you
see
with
this
is
the
last
function
of
training,
a
certain
network
without
dropouts.
This
is
what
dropout
you
can
see,
how
it
improves
the
it
reduces
the
classification
error.
Where
do
you
insert
dropout?
Yes,
yeah?
That's
that's
exactly
how
you
should
do
it
so
essentially,
when
you
want
to
you,
want
to
average
so
you're
training,
these
sub
networks
right,
but
at
the
inference
time
you
want
to
average
the
decision
of
this
in
Samba
love
sub
networks
right.
A
A
You
want
to
pay
attention
to
that
when
you're
doing
when
you're
building
your
network,
that
sometimes
you
would
get
different
results
and
when
you
have
dropout
and
if
you
don't
pay
attention
to
what
your
you're
doing
or
you're
using
the
dropout
in
inference
in
training
mode,
you
will
get
a
different
result
that
when
you're
using
it
in
training
mode-
and
this
is
essentially
for
the
three
bars
the
probability
yeah.
Thank
you
it.
It
could
very
well
be
that
if
you
train
a
network
without
dropout,
you
will
get
more
more
neurons
would
die.
A
But
if
you
train
it
with
dropout,
you
will
get
more
neurons,
who
would
actually
stay
alive
and
then
your
prune
network
would
be
smaller.
It
could
be
very
wealthy,
but
yeah.
So
I
I
see
your
point.
They
might.
There
might
be
a
connection.
That's
a
good
question.
So
this
was
my
next
slide,
where
I
do
think.
That
dropout
is
not
necessary
and
there
in
between
the
convolutional
layers,
because
there
are
we're
trying
to
learn
filters
and
feature
extractors
and
there
usually
have
much
less
number
of
parameters
than
the
dense
layers.
A
So
I
think
that
it
doesn't
make
a
lot
of
sense
to
do
drop,
to
apply
drop
out
there
and
I
put
here
an
example
of
one
possible
way
of
putting
dropout
is
look
only
at
the
classifier
or
the
last
dense
layers
and
put
some
dropout
in
there.
That's
it.
It
might
be
that
if
you
put
dropout
in
your
network,
it
might
perform
better
I
haven't
seen
a
lot
of
results
on
that.
A
Is
this
idea
clear
that,
where
to
put
the
dropout
not
clear,
okay,
so
another
form
of
regularization
is
data
augmentation.
So
we
mentioned
yesterday
that
the
best
way
to
improve
the
performance
of
your
network
is
to
collect
more
data
right.
If
you
can
do
that,
that's
awesome.
If
you
can't
do
that
or
even
if
you
can
do
that,
you
might
still
want
to
also
apply
some
sort
of
data
augmentation
when
we
are
applying,
let's
think
of
the
context
of
object,
recognition
when
we
are
applying
neural
networks
or
confusion,
your
networks
on
object,
recognition.
A
We
know
that
the
decision
of
the
network
should
be
independent
or
invariant
of
the
orientation,
for
example,
of
the
objects
or
the
color,
the
you
of
the
objects
or
or
whatever
sort
of
transformation
mirroring
the
the
objects
we
should
get.
The
cat
should
still
be
the
same
tax
right,
so
one
way
of
forcing
the
network
to
learn
that
is
to
augment
the
data
set
by
applying
these
transformations
randomly
on
the
original
data
set,
as
I
am
training.
Of
course
you
don't
want
to
apply
that
during
the
test
or
validation,
but
during
training.
A
This
is
what
you
do,
and
it
tends.
This
is
something
that
we
do
a
lot
in
practice
and
it
improves
the
performance
of
the
network
of
all
of
models.
You
want
to
make
sure
that
whatever
transformations
that
you're
applying
actually
make
sense
for
your
networks
increase
the
size
of
the
training
dataset-
oh
you
can't
see
it.
This
is
a
Z.
A
A
It's
there
are
a
lot
of
things
that
we
do
to
help
the
performer
to
help
them
to
make
the
optimization
process
easier,
while
we're
training,
neural
networks
and
some
of
them
they
end
up
being
implicit,
regularizer
x'.
There
is
some
contention
about
the
word
implicit
here,
but
they
are
not
express
it
as
and
they're,
not
just
added
to
the
last
function.
One
of
them
is
called
batch
normalization.
A
The
essential
idea
of
nationalization
is
that
look
at
this
Network,
you
have
a
hidden
layer.
Hidden
layer
gives
some
activations
the
activations
go
into
the
neck
they're.
The
the
main
idea
was
that,
if,
if
I
updating
to
the
both
layers,
at
the
same
time,
this
layer
is
being
updated
to
respond
to
the
to
the
current
gradients,
but
by
the
time
once
I
update
this
this
layer
I'm
also
updating
the
previous
layer.
A
Once
I
update
the
previous
layer,
the
activation
of
the
Prius
previous
layer
are
shifting,
so
once
they
shift,
then
this
has
to
relearn
how
to
respond
to
the
new
activations.
This
picture
is
kind
of
the
the
basic
idea
that
people
you
know
I've
been
thinking
about
when
they
came
up
with
bachelor
ization.
There
are
a
lot
of
reasons
or
recent
work
to
believe
that
that
is
not
exactly
how
bash
normalization
helps.
I'll
talk
a
little
bit
about
this,
but
that's
that
was
the
initial
idea
you
can
think
about
it.
A
From
a
different
point
of
view,
we
talked
about
normalizing
the
input
to
the
entire
network.
The
batch
normalization
tries
to
normalize
the
activations
themselves,
so
every
one
of
these
layers
receive
the
same
distribution
of
data
rather
than
just
the
very
first
layer.
We
received
some
some
normalized
data.
The
way
that
you
do
this.
This
is
directly
from
the
original
bash.
Normalization
paper
is
that
you
take
the
sum
along
the
batch
and
then
you,
you
may
get
the
mean
along
the
batch.
You
can.
A
The
standard
deviation
along
the
batch
and
then
you
normalize,
the
activations
of
the
layer
by
subtracting
the
mean
and
dividing
by
the
standard,
deviation
plus
an
epsilon
for
numerical
stability,
and
this
is
great,
this
normalizes
the
outputs
of
every
layer.
However,
what,
if
the
that
you
know,
severely
restricts
the
capacity
of
the
network
and
the
capacity
doesn't
really
want
completely
normalized
distribution,
so
you
allow
for
those
shifts
by
having
learn
about
parameters
gamma
and
beta,
so
to
rescaling.
A
The
essentially
the
activations
by
gamma
and
beta
and
gamma
beta
are
learnable
parameters,
so
you
actually
update
those
with
gradient
descent.
I
think
it
will
depend
on
the
problem
that
you're
looking
at
a
lot
of
people.
They
do
augmentation
by
doing
random
crops
and
then
generating
texture
around
the
random
crops,
and
they
that
works
for
the
problem
that
they're
looking
at
yeah.
A
If
it
doesn't
make
sense
for
your
day,
you
should
definitely
you're
teaching
the
the
Euro
model,
the
wrong
thing
right,
you
want
it
to
in
your
case,
you
want
it
to
make
sense
of
the
order.
You
actually
want
it
to
use
that
information
as
a
clue
to
what
is
going
on
if
you're,
removing
that
and
you're
removing
an
important
feature
from
your
data,
you
want
don't
want
to
break
essential
features
in
your
new
data
or
essential
correlations.
A
Okay,
so
batch
normalization,
so
in
practice
it
helps
with
a
lot
of
things.
First
of
all,
you
remember
the
problem
of
vanishing
gradients
that
we
talked
about
when
we
build
deeper
networks.
Essentially,
we
have
when
we
multiply
gradients
along
the
layers,
the
the
gradients
tend
to
vanish
by
the
time
they
come
to
the
very
early
part
of
the
network.
We'll
talk
about
this
in
a
little
bit
more
details,
but
the
important
thing
here
is
that
batch
normalization,
just
because
the
distributions
now
are
normalized.
A
A
It
also
empirically
it
seems
to
be
that
Bachelor
ization
makes
the
network
less
sensitive
to
some
hyper
parameters,
so
we
can
use
higher
learning
grace
and
that
leads
to
faster
conversions,
we're
taking
larger
steps
right
as
we're
doing
the
gradient
descent,
and
it
also
tends
to
make
the
networks
less
sensitive
to
the
initialization.
You
don't
have
to
do
so.
Many
restarts
to
find
a
good
initialization
for
the
network.
I
talked
about
this
idea
briefly
of
the
shifts
of
the
distributions
between
the
layers,
and
that
was
the
issue
of
motivation
of
rationalization.
A
However,
there
is
a
work
from
end
of
last
year
actually
met
here,
but
it
was
a
new
Europe's
last
year
which
shows
that
the
idea
of
shifting
distributions
is
not
over.
Shifting
covariant
shifts
of
distribution
might
not
be
accurate,
they
show
that
empirically
and
then
they
try
to
argue
that
bachelor
ization
accelerates
the
optimisation
process
by
making
the
landscape
essentially
smoother.
A
So
you
can
actually
look
at
the
so
they
look
at
different
measures
of
essentially
the
smoothness
of
the
gradients
and
how
much
noise
there
is,
and
they
show
that
the
landscape
is
a
bit
smoother
when
you
use
batch
normalization
I'm,
not
sure
if
there
is
an
update
on
this
picture,
but
that's
one
of
the
ideas.
One
thing
to
remember
two
things
to
remember
the
first
one
is
that
batch
normalization
is
an
implicit
trigger
Iser,
so
it
does
affect
the
capacity
of
your
network.
I'm,
not
I.
A
Don't
think
that
there
is
an
explicit
way
of
trying
to
measure
the
impact
of
a
batch
normalization
on
the
capacity
of
the
network,
but
it
does.
The
other
thing
to
remember
is
that
bachelor
ization
behaves
differently
during
training
and
test
time,
so
during
training
we're
using
the
batch
statistics,
the
mean
and
the
variance
of
the
batch
to
do
the
normalization,
but
during
test
time
we
don't
want
to
use
that,
because
the
test
dataset
might
have
different
statistics
than
what
we
trained
on.
A
So
what
we
tend
to
do
is
to
accumulate
running
averages
of
the
of
the
of
the
training
batch
and
and
standard
deviation
and
use
that
during
test
time,
this
tends
to
be
one
of
the
very
common
bugs
that
you
have
when
you're
dealing
with
personalization
looking
things
looking
at
code
batch
normalization
normalizes
along
the
batch
dimension.
So
it's
a
long
if
you're
looking
at
this
is
your
very
fat
four
dimensional
tensor
input
to
the
network.
This
n
is
the
number
of
examples
that
you
have
in
your
mini
batch
batch
normalization
normalizes
along
the
batch
dimension.
A
There
are
all
sorts
of
other
types
of
normalizations
that
don't
normalize
along
the
batch
dimension,
layer
norm,
instance,
norm
and
group
norm.
You
can
see
more
of
this
in
this
paper.
Could
normalize
paper
the
basic
idea,
these
other
types
of
normalization
do
not
depend
on
the
batch
and
that's
nice,
because
we
have
the
same
normalization
during
training
and
test
time
and
also
helps
when
you're
doing
distributed.
Training
thurston
might
talk
about
this
on
Friday,
so
batch
normalization
is
for
the
activations
in
the
network
and
data
normalization
is
for
the
data
itself.
A
You
don't
apply,
yeah,
that's
a
good
question,
so
the
mini
batch
is
related
more
than
this.
It's
related
to
stochastic
gradient
descent
right.
We
mentioned
yesterday
that
the
noise
in
the
stochastic
gradient
descent
tends
to
help.
So
most
of
the
time
using
a
small
small
batch
size
gives
you
a
model
that
generalizes
well.
A
However,
it
could
be
so
the
practice
is
that
not
to
use
batch
size
larger
than
32,
but
it
could
be
that
the
learning
process
is
extremely
slow
with
that
batch
size
because
you're,
you
have
to
take
a
small
learning
rate,
there's
a
very
large
noise.
So
to
get
beyond
that,
you
want
to
to
increase
the
batch
size
beyond
that.
There
are
all
sorts
of
things
that
you
want
to
look
at
and
I
think
source
and
we'll
go
through
this
on
Friday,
but
generally
smaller
batch
size
gives
you
better
generalizability
larger
batch
size.
A
You
that's
what
you
want
to
do,
because
you
want
to
train
your
model
faster,
but
there
are
caveats
on
this.
I
think
I've
seen
things
like
this
on
Stack
Overflow
or
something
but
I'm,
not
sure
I'm,
not
I,
didn't
do
any
experiments.
Myself
and
I
haven't
read
any
concrete,
like
paper,
solid
papers
on
this
topic,
but
you
can
think
of
one
thing.
Batch
normalization
is
applied
in
the
layers
in
between
the
convolutional
layers
or
after,
like
a
block
of
convolutional
layers.
A
Dropout
is
applied
in
the
classifier
on
the
dense
layers,
so
they're
in
different
domains,
they're
not
usually
applied
after
each
other,
at
least
in
the
networks
that
I
have
looked
at
so
they're
in
different
parts
of
the
network.
It
really
depends
you
can.
There
are
people
who
swear
by
before
the
activation
or
the
activation?
A
If
that
works
for
you,
okay,
so
another
thing
that
improves
the
performance
of
your
network
is
the
idea
of
in
Samba
'ls.
So
the
basic
idea
is
that,
instead
of
using
one
version
of
our
model
at
inference,
time
is
to
use
multiple
versions
of
them
in
Samba,
with
models
and
then
average
the
results
of
those
example
of
models
it
tends
to
to
give
about
two
percent
extra
performance
in
practice.
A
This
is
an
empirical
result,
and
so
it's
easy
to
do
this
when
you
have
a
shallow
learning
sort
of
model
of
traditional
machine
learning
to
actually
train
multiple
versions
of
your
model
and
then
average
the
prediction,
but
it
could
be
really
expensive
in
deep
learning
specially
if
you're,
building
a
very
big
model.
So
another
way
to
do
it
is
to
use
multiple
snapshots
of
the
same
model
during
the
training
process.
A
So
you
can
do
this
in
various
ways,
but
save
different
checkpoints
after
you
get
to
the
region
where
you
start
getting
satisfactory
performance,
you
can
save
multiple
checkpoints
of
your
same
model
and
use
an
in
samba
love.
Checkpoints
to
do
the
averaging
another
way
is
to
keep
a
moving
average
of
the
parameters
of
the
actual
network
parameters
called
Puglia
averaging
yeah,
it's
another
way
of
doing
in
Sandow.
This
is,
if
you're,
applying
networks
and
practice
you
really
wanna.
You
might
want
to
test
this.
It
gives
two
percent:
it's
not
trivial.
A
Okay,
so
now,
I
want
to
move
on
to
in
the
last
20
minutes
to
the
idea
of
depth.
Another
issue
that
we
have.
So
if
you
look
at
the
winners
of
the
imagenet
competition
in
the
past
since
2012,
you
will
see
a
common
pattern
that
the
number
of
layers
that
are
used
in
the
winner's
network
they're
all
deep
learning
networks
anyway.
Since
then,
the
number
of
layers
has
been
increasing
right.
So
2012
was
the
first
winner.
A
This
idea
is
related
to
the
two.
What
we
talked
about
yesterday,
that
at
least
when,
right
now,
when
we
look
at
these,
what
these
networks
are
learning,
they
tend
to
learn
this
hierarchy
of
features
or
hierarchy
of
filters
that,
when
the
very
early
layers,
they
learn
edges
and
blobs
and
very
simple
motifs.
And
then,
as
you
go
to
deeper
and
deeper
sort
of
models,
you
can
start
building
more.
They
start
to
activate
on
more
abstract
sort
of
concepts.
A
So
it
seems
that
building
deeper
networks
it
tends
to
encourage
the
network
to
build
another
longer
hierarchy
of
filters.
That
said,
take
all
of
that
with
a
grain
of
salt.
This
is
again
on
hand,
wavy
sort
of
arguments,
but
in
private
the
important
result
is
that
deeper
networks
tend
to
perform
better
in
practice.
A
You
can
see
this
with
simple
examples,
so
maybe
two
plots
is
so
you
can,
for
example,
this
is
a
test
of
the
performance
of
a
network
as
the
number
of
layers
increases,
and
you
can
see
that
as
the
number
of
layers
increase,
you
can
get
better
and
better
performance.
If
you
don't
believe
this,
you
want
to
look
at,
maybe
because
the
number
of
parameters
is
increasing.
This
Y
is
performing
better
/.
A
That's
not
true,
because
if
you
look
at
the
accuracy
on
the
y-axis
versus
the
number
of
parameters
with,
maybe
you
can
look
at
only
at
the
blue
and
red
so
at
the
same
number
of
parameters
about
200
something
million
about
200
million
parameters.
If
you
put
them
in
11
layers,
you
tend
to
get
2
percent
or
more
than
2
percent,
better
performance
than
if
you
put
them
in
3
layers.
A
So
you
can
do
these
sort
of
experiments
and
see
that
with
the
same
number
of
parameters,
if
you
reconfigure
them
in
a
deeper
network
that
the
networks
tend
to
perform
better.
Now,
that's
the
result,
but
in
practice
training
deeper
networks
tend
to
be
more
challenging.
You
can
look
to
the
left
here.
These
are
the
training
errors
of
training
at
56,
layer,
Network
and
then
at
20
layer
network.
A
So
the
training
error
that
the
20
layer
network
tends
to
get
much
better
performance
than
the
56
Network,
and
if
you
think
this
is
overfitting,
it's
not
actually
true,
because
even
on
the
validation,
you
can
see
the
same
thing.
So
one
of
the
reasons
that
training
deep
networks
is
difficult
is
what
we
mentioned
yesterday
is
the
vanishing
gradients
that
we
have
to
use
back
propagation
and
then
we're
multiplying
a
very
long
chain
of
of
gradients
if
any
of
the
activations
along
the
weight
and
like
the
distributions
get
narrower
and
narrower.
A
So
that
seems
to
kill
the
gradient
and
the
gradient
flow
back
to
the
early
layers,
and
sometimes
you
also
get
exploding
gradients
in
recurrent
networks.
But
this
idea
of
making
essentially
making
that
information
or
the
gradient
updates
travel
all
the
way
back
to
early
layers
is
extremely
important
to
be
able
to
train
deeper
networks.
A
However,
the
optimization
process
becomes
difficult,
which
is
a
part
of
it
is
the
vanishing
gradient
that
we
talked
about
earlier,
so
the
idea
that
they
got
is
to
use
something
called
the
gradient
highway.
There
was
a
paper
called
the
gradient
highway
before,
and
this
is
a
follow
up.
Paper
called
the
residual
residual
net,
which
is
ResNet
that
we
talked
about.
They
come
up
with
this
idea.
A
Why
don't
we
have
the
identity
mapping
as
a
part
of
the
construction
of
the
actual
network,
so
instead
of
just
having
layers
and
they
are
stacked
after
each
other,
try
to
learn
instead
of
just
trying
to
learn
the
output
of
a
block,
we
try
to
learn
the
residual
of
of
an
output.
So
what
we
are
trying
to
get
is
f
of
X,
which
is
the
output
of
this
function,
plus
X.
A
So
this
function
will
not
try
to
pass
the
full
X
and
learn
some
useful
features
from
X,
which
is
what
we
really
want,
which
is
H
over
X.
It
will
only
try
to
learn
that
residual,
which
is
f
of
X,
and
this
way,
all
of
a
sudden.
We
have
a
highway
for
the
gradient
to
flow
right,
so
the
gradient
can
flow
from
the
loss
function,
which
is
here
all
the
way
back
to
much
earlier
layers,
and
we
can
start
immediately.
People
started
training,
hundreds
to
thousands
of
layers
network,
just
sidestepping
that
vanishing
gradient
problem.
A
Yes,
that's
a
that's
a
very
good
question,
so
I
think
in
this.
It
was
an
element-wise
addition,
but
there
is
closely
related
constant.
There
are
two
closely
related
concepts,
which
is
kept
connections
and
residual
blocks,
kept
connections,
I
think
one
of
them
I
can't
remember
it
actually,
the
terminology
which
one
but
the
one
of
them
concatenates
the
filters.
And
then,
if
you
have
yeah,
you
concatenates
the
filters.
A
A
When
this
paper
came
out,
they
won
all
competition.
All
you
know:
subproblems
and
imagenet
competition
in
2015
by
an
extremely
large
margin,
image,
net
detection,
16
percent
better
than
second
second
winner,
27
percent
on
object,
localization
and
then
other
competitions,
cocoa
and
cocoa
segmentation,
11
percent
12
percent.
This
is
ridiculous.
Okay.
So
this
the
idea
here
in
2015,
you
know
this
network,
the
already
the
image
net
classification
with
ResNet.
We
achieve
better
than
human
error,
so
this
ResNet
performs
better
in
on
an
unseen
data
set
than
a
bunch
of
humans.
A
Okay,
the
images
are
not
there
there,
there
are
128,
there
are
lower
resolution
pixels
and
they
are
taken
and
real
with
real
lightning
and
real
all
sort
of
real-life
conditions
that
it
can
actually
be
confusing
to
people
I'm,
not
sure
how
exactly
the
human
test
is
performed.
But
if
there's
time
you
know
constraint
or
something,
but
there
is
a
there-
is
an
irreducible
human
error
here.
A
Has
resolved
the
entire
problem
of
vanishing
gradients,
but
there
are,
but
there
is
another
way
to
look
at
it.
It
has
actually
shortened
the
affective
pathway
from
the
last
function,
all
the
way
to
early
layers.
If
you
are
interested
in
this,
you
can
look
at
this
paper.
So
essentially
you
can
see
that
the
affective
pathway
from
the
very
early
on
to
from
the
last
function
to
two
early
layers
is
about
nineteen
layers.
Instead
of
the
entire
stack
of
layers
that
you
have
I
want
to
touch
on
a
couple
of
topics.
A
Actually,
maybe
one
topic
before
we
finish
a
transfer
learning.
Unfortunately,
we
don't
have
a
talk
on
transfer
learning
for
reasons
the
speaker
couldn't
make
it,
and
this
is
a
very
important
topic
in
practice.
When
you're
training
neural
networks,
there
is
one
you
always
we
don't
always
have
millions
of
images
or
millions
of
data
label.
Data
sets
to
be
able
to
train,
resonate
from
scratch,
and
you
also
don't
have
the
time
or
the
computational
resources
to
be
able
to
do
this
from
scratch.
A
A
So
we
use
something
called
the
transfer
learning
and
there
is
a
closed
concept
to
this,
which
is
called
also
domain
adaptation
that
you
might
be
a
training
on
on
a
certain
data
set,
but
in
reality
we
are
applying
your
network
on
a
slightly
different
domain
or
slightly
different
data
set
than
the
the
training
data
set.
So
how
do
you
actually
deal
with
that
problem
of
domain
adaptation?
It's
a
closely
related
topic
and
it's
very
important
in
practice.
A
A
You
know
everything
that
is
there
to
say
about
this
is
that
in
traditional
machine
learning
we
do
feature
extraction
by
hand,
and
then
we
apply
a
classifier
network
or
a
classifier,
whatever
model
SVM
or
it
could
be
a
shallow
neural
network,
whatever
it
is,
and
then
we
get
our
output
in
deep
learning.
We're
doing
this
end
to
end.
So
our
neural
network
does
both
the
feature
extraction
and
the
classification.
So
all
of
that
those
convolutional
layers
that
we
had
in
our
network,
those
are
feature
extractors
and
then
the
dense
layers
are
the
classifiers.
A
So
the
idea
of
transfer
learning
is
that,
when
you're
going
from,
you
have
a
you,
have
you
want
to
train
on
a
small
data
set
or
a
slightly
big
data
set?
But
it's
not
large
enough
to
train
the
entire
resonate
or
entire
vgg
network
or
something
is
to
reuse
those
feature
extractors.
So
essentially
you
take
you
get
a
pre
trained,
neural
network.
You
keep
the
convolutional
layers,
because
these
are
feature
extractors,
you
think
they're
useful
features.
They
build
hierarchy
of
concepts
and
abstract
stuff,
and
then
you
retrain
the
the
classifier.
A
You
retrain
the
last
layers,
which
is
these
three
layers.
If
you
don't
have
a
lot
of
data,
you
could
train
only
the
very
last
one.
If
you
have
a
little
bit
more
data,
you
can
train
two
of
these.
You
can
might
be
able
to
train
the
three
of
them.
If
you
have
a
larger
data
set
using
pre
trained
network,
you
can
also
fine-tune
your.
You
can
also
fine-tune
your
feature
extractors,
which
is
the
the
convolutional
layers.
A
If
you
want-
and
you
have
data
enough
to
do
that-
and
this
is
very
useful
in
practice,
in
fact,
there's
this
slide
in
CS
231.
This
is
the
transfer.
Learning
is
pervasive.
It's
actually
the
norm,
not
the
exception.
In
reality,
you
don't
want.
You
have
an
idea,
you're
sitting
down
with
your
friend,
you
want
to
test
it.
You
know,
after
lunch
you
and
I
go
and
test
it.
You're,
not
gonna.
Just
you
know,
spend
ten
days
to
test
us
a
very
simple
idea
right.
A
So
what
you
want
to
do
is
to
you
want
to
get
results
by
the
end
of
the
day
right
so
you're
going,
you
use,
transfer
learning,
and
this
is
done
everywhere
in
practice.
This
is
what
people
do
and
they
encourage
the
framework.
You
look
at
that
yesterday.
It's
very
easy
to
do
this.
It's
just
one
line
to
get
a
pre
trained
Network
any
of
these
networks.
You
can
get
them
in
one
line
and
then
removing
parts
of
of
the
network's
is
also
super
easy
in
practice.
A
There
is
a
paper
that
showed
last
year,
that
is
this
extremely
necessary,
and
the
answer
is
no.
If
you
have
enough
data
set,
you
don't
really
need
to
get
to
the
same
accuracy
that
you
would
get
with
a
pre
trained
network.
You
don't
need
to
to
use
transfer
learning.
So
essentially
these
the
two
curves
here.
This
is
the
the
accuracy
of
a
ma
of
resident
model
that
the
two
curves
are
the
in
magenta
is.
The
random
initialization
in
gray,
is
with
retraining
pre
training,
which
is
the
transfer
learning
you
can
see.
A
If
you
have
enough
time
you're
you
know
have
nothing
to
do
and
then
you
have
enough
data
set.
You
will
get
the
same
accuracy.
However,
if
you
don't
have
enough
time,
the
pre
train
network
tends
to
get
to
a
much
to
a
better
accuracy,
much
faster
than
they
randomly
initialized
one.
And
if
you
don't
have
data
set,
you
don't
have
an
option
anyway,
you
have
to
use
pre
transfer,
learning,
okay,
yeah,
so
the
great
curve
I,
actually
don't.
Remember-
I,
read
this
paper
last
year,
end
of
last
year.
So
but
that's
a
good
question.
A
I
I
think
they
retrain
only
the
last
dense
layer
and
I.
Think
in
resonate
in
this
particular
resonate.
There's
only
one
last
dance
layer
and
they
don't
they
don't
fine-tune
the
convolution
layers
or
it
could
be
that
whatever
batch
size
is
still
much
less
than
the
maximum
batch
size
that
you
can
use.
This
is
related
to
the
paper
that
was
mentioned
this
morning.
A
The
opening
I
that,
as
the
as
the
training
goes
on
and
you
get
to
smoother
parts
of
the
loss
function,
you
can
use
much
much
larger
batch
sizes
as
long
as
there
is
a
ceiling
to
that
where
they
do
some
derivations
the
maximum
batch
size,
but
it
tends
to
be
very,
very
high
ceiling.
Oh
that's
the
learning
rate
decay,
that's
a
good
point,
so
you
remember
we
were
talking
about
the
learning
rate
decay
very
early
on.
So
these
are
the
points
where
you
decay
the
learning
rate,
and
then
you
jump.
A
A
A
Sorry
I
shouldn't
present
papers
that
I
have
read
six
months
ago,
so,
okay,
so
two
other
topics
that
are
important
in
practice:
hyper
parameter,
optimization
we've
talked
we
mentioned,
so
many
hyper
parameters
that
we
have
in
these
neural
networks.
And
you
might
you
want
a
way
to
do
that
type
of
parameter,
optimization,
there's
a
talk
tomorrow
by
Ben,
Albrecht,
I,
think
he's
here
and
then
please,
hopefully,
you'll
be
there
for
that.
A
You'll
talk
about
the
difference
between
doing
great
search
and
random
search,
and
also
maybe
some
other,
like
optimization
methods
like
Bayesian,
optimization
other
methods,
and
you
will
also
show
how
to
do
that
with
a
particular
framework.
There
are
frameworks
to
do
this.
You
might
want
to
look
at
when
you're
doing.
If
you
don't
have
a
lot
of
time
to
do
a
parameter
optimization.
There
are
a
few
parameters
that
are
extremely
important
to
tune.
The
first
one
is
the
learning
rate,
especially
if
you're
not
using
other.
A
If
you're
not
using
a
dime,
you
probably
is
less
sensitive
to
the
exact
value
of
the
learning
rate,
but
if
you're
not
using
Adam,
you
prob
want
to
do
at
least
learning
great
tuning,
and
you
want
to
do
that
using
random
search
rather
than
grid
search,
because
this
is
wasteful.
You
will
hear
more
about
this
tomorrow,
a
couple
of
training
tips.
We
talked
about
a
lot
of,
we
talked
about
things
like
the
initialization
and
how
the
initialization
can
can
prevent
the
learning.
A
If
you
have
the
wrong
initialization,
and
there
is
the
way
that
we
at
least
Illustrated,
that
we
showed
the
distributions
of
the
different
activations,
it
turns
out
that
in
practice,
when
you're
actually
doing
debugging
your
model
and
trying
to
find
out
why
it's
not
working,
this
is
a
very
good
way
of
actually
finding
out.
You
know.
Are
things
at
least?
Is
there
a
learning
impediment
somewhere?
A
A
Yeah
I
I
didn't
have
time
to
actually
find
plots
for
this.
So
sorry,
another
thing
that
you
might
want
to
look
at
is
to
watch
the
update
scales
for
your
weights.
So
you
can.
This
is
not
correct.
This
is
just
the
monitor
the
update
scales,
which
is
the
gradients
divided
by
the
weights.
So
you
want
your
gradients
divided
by
the
weights
to
be
somewhere
between
one
to
one
part
in
a
thousand
to
one
percent
of
the
way.
A
This
is
just
a
rule,
but
you
really
want
to
see
that
your
they,
the
updates,
are
not
much
larger
or
extremely
large
that
they
are
overwhelming
the
weights
and
then
supporting
the
whole
thing
off.
That's
another
thing
that
is
very
useful
to
monitor
in
practice.
Another
way
is
towards
the
end
of
your
training.
You
want
to
see
if
your
network-
this
is
good
enough,
and
this
has
worked
one
way
of
fine.
A
You
know
inspecting
the
quality
of
the
final
network
that
you
have
is
to
visualize
the
weights
of
the
first
layer.
If
your
weights
are
like
this,
their
crisp,
clear,
they're
very
there,
it's
very
clear
that
they
have
very
nice
edges
their
activators
and
filters
here.
This
is
where
you
want
to
be.
If
you
have,
if
you
have
this
type
of
you
know
visuals
or
of
your
or
your
first
layer
weights,
this
could
be
an
indication
that
you
either
haven't
convert
or
there
might
be
problem
with
your
weight
regularization.
A
So
that's
another
way
to
inspect
problems
in
your
models.
There
are
a
lot
of.
There
are
a
lot
of
yep.
Okay,
there
are
a
lot
of
such
you
know,
tips
that
I
think
they're,
really
really
useful
to
look
at
under
Carpathia
has
compiled
a
large
number
of
them.
Last
April
and
I
encourage
you
to
look
at
this,
this
blog
post
and
actually
think
about
each
one
of
them.
Why
it
makes
sense
to
do
that?
Why?
Why
is
it
useful
to
do
that?