►
From YouTube: 07 - Practicalities in Deep Learning - Joel Hestness
Description
Deep Learning for Science School 2019 - Lawrence Berkeley National Lab
Agenda and talk slides are available at: https://dl4sci-school.lbl.gov/agenda
A
So
up
next
we've
got
joel
estes
adults
going
to
talk
about
some
practical
kind
of
things.
Joel
is
a
senior
research
scientist
at
Cerebrus
systems,
so
services
there's
a
startup
that
does
some
specialized
AI
hardware,
pretty
cool
stuff,
I'm,
not
sure
how
much
we
can
say
publicly,
but
it's
cool
stuff,
so
at
Cerebrus
he
helps
to
formulate
strategy
to
support
practitioners
users
on
the
on
the
hardware.
He
also
leads
some
natural
language
research.
A
You
can
ask
him
about
that
later.
I!
Think
I,
don't
he's
gonna
cover
anything
related
to
that,
but
maybe
he
will
before
that
he
he
has
worked
at
by
do
at
the
Silicon
Valley
AI
lab
worked
on
scaling,
deep
learning.
We're
gonna
be
talking
about
that.
Also
later
this
week
very
interesting
stuff.
It
is
PhD
at
uw-madison
and
generally
has
broad
experience
with
a
lot
of
relevant
stuff
for
us,
so
computing
applications,
numerical
methods,
graph
and
graph
analytics
and
machine
learning.
So,
let's,
let's
thank
Joel
for
coming
today
and.
B
B
So
it
Stephen
said
I
I
recently
moved
from
by
do
research
to
cerebral
systems,
I
have
a
background
in
heterogeneous
system,
design
and
optimization,
and
so
some
of
the
my
past
life
was
in
performance,
optimization
for
large
scale.
Applications
and
I've
been
doing
deep
learning
research
over
the
last
three
years
or
so
I
want
to
kind
of
set
up
the
objective
of
this
lecture
and
hopefully,
maybe
get
some
nods
if
people
are
if
this
is
interesting
to
people,
so
I
can
get
a
sense
for
what
what
I
should
spend
most
time
on.
B
So
hopefully
that's
what
everybody's
here
for
I'm
gonna,
like
I,
said
I'm
gonna
sort
of
contain
this
in
the
context
of
a
research
study
that
we
did
so
that
you
can
get
some
ideas
about
learning,
algorithm
dynamics
and
help
build
up
some
intuition
so
that
when
you
sit
down
and
you
have
to
start
training
models,
something
you
have
ideas
for
directions
to
go
and
I'm
gonna
be
highlighting
throughout
the
talk,
some
vocabulary.
That
will
be
useful
for
you
to
refer
to.
B
B
There
are
some
practicalities
that
I've
noted
that
I
think
are
valuable
to
people
that
are
coming
into
this.
Just
in
training
like
the
the
first
things
to
note,
when
you're
trying
to
do
deep
learning,
training
I'll
go
through
those
and
then
really
what
I'd
like
to
get
at
and
kind
of
the
meat
of
this
talk,
hopefully,
is
we
we
are
going
to
have
datasets.
B
Hopefully,
the
datasets
are
going
to
be
telling
us
things
and
we
want
to
design
our
models
such
that
they
can
find
information
in
the
data
and
so
model
architecture
search
will
be
the
the
topic,
but
really
what
we're
trying
to
do
is
figure
out
what
data
is
telling
us
and
then
I
want
to
get
at
how
does
accuracy
scale?
How
do
you
go
to
like
if
you
want
to
train
these
things
on
large-scale
systems?
B
How
how
best
to
do
that?
There
are
some
rules
of
thumb,
helpful
tips
there.
So
let
me
ask
a
few
polls
that
I'm
gonna
assume
sort
of
table
stakes
that
people
in
here
our
scientists,
I,
think
I
saw
a
poll
you
have
datasets,
you
probably
have
some
compute
power.
You
probably
have
existing
models.
Is
that
are
those
safe
assumptions?
B
Okay,
cool,
so
I
want
to
know
what
what
we
are
aiming
for.
So
we're
we're
here,
learning
about
deep
learning,
but
why
is
it
that
we're
trying
to
learn
about
deep
learning?
Is
it
because
people
are
sort
of
looking
for
novel
research
insights?
You
maybe
raise
your
hand
if
you're,
if
you're
looking
for
okay,
so
lots
of
researchers
here.
So
how
about
validating
or
verifying
existing
models?
B
Maybe
okay,
and
how
about
product-oriented,
prove
improvements?
Does
anybody
working
on
product
things
couple?
Okay?
Are
there
other
things
that
people
are
planning
to
do
and
finding
to
use
deep
learning
for
Thorsten
might
have
the
answers
for
some
people
that
I've
done
a
little
bit
of
that.
Thurston
has
definitely
taken
that
to
the
extreme
I
will
cover
a
little
bit
and
then
can
people
tell
me
what
tools
that
they
currently
use
so
that
I
can
know
what
to
kind
of
gloss
over
if
necessary.
B
So
let
me
like
jump
in
I'll,
describe
this
research
study
that
we
ran
set
up
sort
of
the
context
for
some
of
the
tips
and
tricks
that
I
that
I'll
describe
later
so
when
I
started
at
Baidu,
I
was
trying
to
get
my
hands
dirty
with
some
of
these
existing
models.
This
was
kind
of
after
the
the
craze
of
computer
vision
where
there
were
nice
computer
vision
models
that
you
could
take
off
the
shelf
like
rez
nets,
you
could
train
them
on
large
data
sets.
B
I
was
speaking
with
a
bunch
of
the
machine
learning
researchers
there
and
we
had
this
sort
of
common
observation
among
a
lot
of
different
application
domains
that
there's
this
this
view
that,
if
you
increase
your
training
data
set
size,
you
get
improved
accuracy
and
so
here's
an
example
of
one
of
those
plots
from
Benko
and
brill's
paper
in
2001.
This
is
on
a
task
called
word
sense,
disambiguation,
it's
like
if
you
have
a
word.
What
what's
the
meaning
of
this
word
in
the
context
that
it's
in
and
they
had?
B
B
It's
it
felt
like
some
techniques
that
we
had
didn't
really
work,
but
the
techniques
that
they
tested
here
did
follow
these
trends
and
because
of
the
sort
of
history
of
this
there
wasn't
it
wasn't
clear
whether
we
should
go
after
is:
is
data
set
scaling
the
thing
that
really
matters
or
is
it
actually
the
models?
And
so
the
plot
here
suggested
to
them?
Well,
there's
a
different
model:
that's
the
best
model
at
each
data,
sighs
here
so
the,
for
instance,
the
window.
B
B
B
So,
given
these
observations,
we
had
a
few
hypotheses
that
we
wanted
to
test.
Maybe
all
applications
have
the
same
or
similar
trends
if
they
had
the
same
trends,
that'd
be
kind
of
nice.
So
then
I
could
say
you
know
in
this
application
domain.
I
have
this
trend
in
this
other
application
domain.
I
can
expect
it
to
be
the
same.
B
Maybe
all
models
also
follow
the
same
trends
and
I
think
the
the
characteristics
you
find
in
deep
learning
actually
suggests
that,
as
you
get
more
sophisticated
models
you
can.
You
can
eliminate
some
of
the
impediments
to
learning
that
we've
had
with
sort
of
traditional
machine
learning
techniques,
and
so
some
of
that's
probably
been
covered
here
already
and
then
for
a
machine
learning,
research
or
engineer
you
have
this
sort
of
conflict,
I
guess
about.
If
data
it
gets
me
better
accuracy
and
that's
all
I
need
to
do
that
kind
of
sucks.
B
Like
surely
surely,
there's
something
more
I
can
do
as
a
machine
learning
researcher
engineer.
So
we
went
in
and
kind
of
dug
around
to
look
at
what
the
what
what
is
theory
tell
us
about
how
accuracy
should
scale
with
data,
and
so
I
should
note
here
that
what
I'm
referring
to
is
generalization
error.
Everybody
turn
familiar
with
the
term
generalization
error.
It's
how
well
a
model
generalizes
to
unseen
data
elements.
B
B
Epsilon
is
the
air.
M
is
the
number
of
training
samples
and
then
alpha,
beta
and
gamma
are
some
constants
here.
Well,
the
basic
idea
is
that
error
should
scale
as
some
constant
times
your
data
set
size
to
some
power,
plus
some
other
constant,
and
this
this
the
exponent
here
beta
generally
in
these
papers,
is
minus
1/2
and.
B
B
Alpha
gamma,
alpha
and
gamma
are
constants
that
are
sort
of
related
to
the
problem.
They
capture
things
like
the
model
characteristics.
Things
like
inductive,
biases
and
gamma
is
something
that
captures
sort
of
a
minimum
error.
So
if
you
have
stochasticity
in
your
data,
there's
a
minimum
error
that
I
can
reach.
I
will
go
through
these
in
detail,
with
some
plots
going
forward,
but
anyway,
theory
is
telling
us
that
we
should
expect
sort
of
this.
This
minus
1/2
exponent
in
this
power
law.
B
So
the
large-scale
study
that
we
were
going
to
run
was
two
tests.
How
does
error
scale
for
a
bunch
of
different
applications
using
a
bunch
of
different
models,
and
so
we
started
with
didn't
start
with
machine
translation,
but
this
is
the
one
where
we
finally
I
think
hid
the
methodology
correctly
in
nmt.
So
the
the
task
with
neural
machine
translation
is
it's
a
text
to
text
task
I'm,
taking
in
strings
in
one
language
and
I'm
generating
strings
in
another
language.
So
this
is
like
Google
Translate.
B
B
One
is
208,
hidden
dimensions,
hidden
weights
in
each
layer,
so
it's
a
sort
of
narrow
graph,
a
narrow
model,
and
then
we
also
increased
the
model
size
to
512,
hidden
dimension
and
what
we
see
is
kind
of
this
sort
of
predictable
power
law,
plus
some
constant,
and
so
the
the
example
here
shows
that
this
small
model
has
a
high
gamma.
This
is
suggesting
that
there's
something
going
on
that's
impeding
its
ability
to
Train
any
farther
than
this
error,
and
but
but
what
was
sort
of
interesting
to
find
here
is
the
beta.
B
The
thing
that
we
were
sort
of
relying
on
for
scaling
accuracy
with
data
is
not
equal
to
this
minus
1/2.
That
was
expected
in
the
theory.
Instead,
it's
it's
a
smaller
exponent,
smaller
in
magnitude,
and
so
what
does
that
mean?
It
basically
means
that
I'm
not
scaling
as
well
as
theoretical
and
as
I
increase
my
dataset
size,
any
questions
on
this
so
far.
So
there
are
a
lot
of
factors
in
this
whole
learning
curve
collection
that
are
sort
of
tricky
to
to
decouple
we've
done.
B
We
did
a
bunch
of
ablation
studies,
sort
of
after
the
paper
that
we
put
out
about
it
and
I
do
have
some
of
that.
Insight
in
here
hopefully
get
at
some
of
those
questions,
and
maybe
ask
that
again
later,
all
right.
So
in
practice,
though,
I'm
not
going
to
just
train
a
single
model
size
on
every
dataset
size
as
I
grow,
the
dataset
size
I
probably
want
to
grow
the
models
also,
and
so,
and
when
I'm
going
to
deploy
a
model
say
I
was
going
to
take
a
model
to
production,
a
production
scenario.
B
What
I'm
going
to
do
is
I'm
going
to
train
a
bunch
of
models
and
I'm
gonna
choose
the
ones
that
generalize
the
best
and
so
without
changing
sort
of
the
model.
Family
I'm
just
changing
hyper
parameters,
tuning
the
learning
rates,
things
like
that.
We
get
a
different
curve,
so
here
I'm
picking
the
best
the
best
model
at
each
data
set
size.
So
here's
a
small
data
set
I'll
still
natural,
sorry
machine
translation,
the
the
other
models
that
I
was
showing
were
head
scores
that
were
above
this.
B
This
is
the
point
that
was
the
best
model
at
this
size
and
what
we
see
now
is
if
we
were
to
compose
the
these
curves
together,
we
actually
do
see
power
law,
even
without
a
constant
like
this
fits
very
well
for
small
data
set
sizes.
This
is
I'll
describe
what
this
is
a
little
bit
more
detail
here,
but
M
being
the
number
of
training
samples.
The
the
exponent
here
beta
is
actually
much
worse
than
what
we'd
expect
from
theory.
So
this
is.
This
is
somewhat
disappointing.
B
B
The
best
I
can
do
is
kind
of
guess,
the
most
likely
output
based
on
what
the
output
distribution
of
my
data
looks
like.
So,
if
most
words
in
the
English
language
or
the
word,
though,
if
I'm
trying
to
predict
the
next
word,
though,
is
a
good
choice.
So
that's
sort
of
what
I
mean
by
best
guess
error.
As
you
continue
increasing
the
data
set
size,
I
can
start
lifting
out
nuance
in
the
data
and
I
can
start
finding
relationships
from
the
inputs
in
predicting
the
outputs
and
I
can
fall
into
this
power
law
region.
B
The
power
law
region
is,
it
turns
out,
is
predictable
and
unfortunately,
I
have
to
collect
this
and
predict
it
empirically,
but
once
I
get
in
that
region,
I
can
sort
of
predict
what
my
accuracy
should
be
at
a
larger,
a
set
size
and
then,
if
you
have
so
this
is
for
all
real
applications.
All
real
applications
are
going
to
have
some
irreducible
error,
that's
nonzero!
B
What
I
mean
here.
Is
there
some
stochasticity
in
the
data
things
that
I
can't
predict,
no
matter
how
much
data,
no
matter,
how
many
input
features
I
have
and
so
there's
sort
of
a
lower
bound
here.
If,
if
we're
in
a
kind
of
vacuum,
you
could
expect
this
to
go
to
zero
error.
If
your
error
loss
function
is,
has
the
ability
to
go
to
zero,
this
power
law
region
is
what
we're
going
to
sort
of
leverage
and
I'll
talk
about
it
in
more
detail
in
the
rest
of
the
talk,
but
it's
it's
predictable.
B
We
can
collect
this
empirically.
Yes,
so
the
the
list
of
papers
that
I
put
up
there
has
details
on
each
of
these
and
the
proofs
end
up
being
very
complicated.
There.
Isn't
there
isn't
a
good
way
to
sort
of
summarize
this
and
describe
it
in
as
simple
a
there
are
and
then
the
one
of
the
problems
is.
There
are
different
assumptions
that
you
put
into
calculating.
So
if
you
look
at
nonparametric,
estimators
nonparametric
estimators
can
have
different
exponents
in
the
denominator.
B
Anything
that
you're
going
to
be
targeting
probably
with
deep
learning,
will
have
this
minus
1/2
as
the
as
the
best
case
that
you
can
get
yes,
so
this
is.
This
is
tricky.
It's
tricky
to
decompose
this.
The
major
reasons
that
we've
seen
are
things
like
visibility
of
all
of
the
information
in
your
inputs.
So,
for
example,
if
you
were
to
take
a
task
where
I
have
full
visibility
of
the
inputs,
the
the
irreducible
error
should
be
0.
B
There
I'm
gonna
try
to
integrate
across
the
other
inputs
to
make
a
prediction
for
the
dimensions
that
I
can't
see,
and
so
then
that's
sort
of
an
averaging
function
where
I'm
now
averaging
between
a
model
that
would
have
perfect
visibility
of
all
the
inputs
and
the
error
is
going
to
decline
or
it's
gonna.
It's
gonna,
get
worse
sorry,
increase.
B
Okay,
so
the
methodology
that
we
we
followed
for
this
is
nice.
In
a
vacuum
we
shard
the
data
set
in
to
power
to
sizes
we
for
each
data
set
size,
we
train
a
bunch
of
models
looking
for
kind
of
the
best
model,
a
best
fit
model
plot.
The
results
write
the
paper.
This
is
rife
with
the
spherical
cow
in
a
vacuum
and
so
hopefully
like
the
stuff
that
will
come
after
this
will
show
you
a
lot
of
the
practicalities
we
ran
into
when
we
were
trying
to
do
this
at
scale.
B
When
you're
setting
up
your
tools,
there
are
some
things
in
training
you've
seen
a
lot
of
this
probably
experienced
a
lot
of
this
already
so
I'm
gonna
try
to
hit
things
that
are
maybe
unique
from
what
other
people
have
said,
and
also
things
that
have
been
very
valuable
in
our
testing.
I
can
skip
over
code
because
a
lot
of
you
have
seen
the
kind
of
code
that
exists
in
frameworks.
B
Obviously,
there
are
many
useful
tools
available
online,
many
pre-trained
models
and
things.
My
maybe
short
story
on
this
is
when
we
were
doing
this
study.
I
had
a
young
engineer
who
was
he
was
familiar
with
cafe
from
some
work
during
school.
He
was
trying
to
set
up
cafe,
to
run
some
tests
for
this
large
scale
study
and
had
been
finding
finding
it
difficult
to
set
up
the
data
pipeline
and
other
things
a
couple
days
into
it.
B
We
sat
down
together
and
he
said
you
know
I'm
struggling
with
these
things
and
I
said
well,
maybe
maybe
we
could
find
another
tool
that
would
work
online
and
in
about
45
minutes
we
had
a
tensor
flow
model
that
was
training
and
running
at
the
scale
that
we
were
hoping
for
so
be
sort
of.
Mindful
of
this,
you
can
go
find
nice
tools.
B
Do
things
from
off-the-shelf
models
as
you're
setting
up
to
train
the
storage
that
you'll
run
into
storage
can
be
a
challenge
also,
datasets
that
we
were
training
on
were
actually
small
relative
to
a
lot
of
things
that
people
in
here
might
be
dealing
with
some
of
the
language
tasks.
That
I
was
looking
at
where
only
gigabytes
of
data
set
speech.
Recognition
were
up
two
terabytes,
but
people
in
here
maybe
have
have
tens.
Hundreds
of
terabytes
of
data
and
so
storage
is.
These
can
be
massive
trying
to
get
good
input
pipelines
and
things
is
tricky.
B
If,
if
possible,
just
use
off-the-shelf
data
sets
for
setting
up
to
set
baselines
and
match
what
previous
people
have
done,
and
then
this
is
also
probably
goes
without
saying,
make
sure
you're
taking
checkpoints,
while
you're
training,
if
you're,
if
you're
a
machine,
happens
to
die,
you
don't
want
to
have
to
rerun
everything
from
scratch
and
then
one
thing
that
I've
found
is
is
useful
for
people
who
have
maybe
seen
a
little
bit
of
they
haven't
had
a
whole
lot
of
experience
with
frameworks.
Yet
is
theirs.
B
They're
sort
of
two
different
ways
that
frameworks
are
structured,
one
is,
is
called
eager
evaluation
frameworks.
The
other
is
lazy,
evaluation
framework,
so
I'm,
given
that
a
lot
of
people
in
here
have
used
things
like
MATLAB
and
are
you
know
that
and
every
time
you
execute
a
line
of
code,
the
data
is
ready
for
you.
You
can
inspect
the
data,
that's
really
nice
for
what
mustafa
was
describing.
B
B
Trickier
frameworks
for
this
are
used
a
thing
called
lazy
evaluation.
This
is
where
you
construct
the
graph
first
and
the
framework
will
build
an
internal
representation.
So
instead
of
something
like
this,
where
the
gray
boxes
actually
hold
my
data
instead,
what
I
get
is
a
graph
that
looks
like
this
as
an
internal
representation
and
I
have
references
to
the
weights.
I
have
references
to
the
inputs
and
I
need
to
use
the
I
need
to
tell
the
framework
propagate
put
these
put
this
data
in
these
locations.
These
handles
and
give
me
back
results.
B
So
one
one
benefit
of
doing
lazy
evaluation
is
that
you
can
do
some
optimization
before
running
session
run
and
so
that
sometimes
this
can
be
faster.
Okay
and
debugging
is
something
that
is
always
a
challenge
in
in
these
things,
you
you
want
to
be
able
to
probe
tensors,
you
want
to
be
able
to
print
values,
I'm
not
going
to
go
into
much
detail,
but
there
are
some
great
references
online
for
ways
that
you
do
this.
B
So
one
of
the
challenges
that
we
had
when
we
were
setting
up
these
large-scale
experiments
was
because
we
were
running
a
bunch
of
these
experiments
in
parallel
doing
a
lot
of
hyper
parameter
search,
we
need
to
be
able
to
sort
of
go
back
and
identify
which
runs
are
giving
me
which
data
points
in
my
set,
and
so
it's
maybe
obvious,
but
keeping
your
configuration
files
with
your
checkpoints
is
a
really
handy
thing
to
do.
It's
very
painful
when
you
find
yourself
in
the
situation
or
I.
Have
a
checkpoint.
B
Id
I
know
that
that
was
the
good
model.
I
need
to
reproduce
some
results
or
running
on
a
different
validation
set,
but
I
don't
have
my
parameters
that
I
used
for
training
I
can't
do
fine
tuning
or
something
so
keep
your
configurations
around
mustafa
was
talking
about
during
looking
at
our
training,
curves
and
and
comparing
to
validation,
curves,
and
that
we
want
to
stop
our
training
early.
Stop
our
training
before
the
model
starts
overfitting,
and
this
is
where
we
start
to
see.
B
The
validation
curve
diverge
from
the
training
curve,
and
so,
when
practicality,
that
you'll
run
into
when
you're
running
a
large
hyper
parameter
search
is
different.
Model
sizes
are
gonna
train
at
different
rates,
that
training
curves
are
gonna,
look
different
and
setting.
How
frequently
you
run
validation
is
a
tricky
thing,
so
your
validation,
that's
or
the
practical
thing
you're
you're
working
with
here
is
your
validation
set,
has
to
be
some
size.
It
has
to
be
large
enough
to
give
you
a
statistically
significant
estimate
of
the
divergence
between
the
training
and
validation
curves.
B
So
I
want
it
to
be
large
enough,
but
I
don't
want
it
to
be
so
large
that
I'm
actually
spending
more
time
doing,
validation
than
actually
training
and
so
picking
the
validation
periods.
How
frequently
you
run
validation
is
a
little
bit
tricky
I,
don't
want
it
to
be
too
frequent
to
cause
unnecessary
computation,
but
I
want
to
sort
of
pick
a
frequency
where
I'm
sure
that
I
won't
miss
the
best
model
in
there
somewhere.
I
want
to
be
very
close
to
that
best
model
that
I've
found.
B
This,
unfortunately,
is
something
you
have
to
kind
of
do
empirically
and
test
a
few
models.
Once
you
get
the
hang
of
it
with
some
models,
you
know
it
will
extend,
so
you
can
use
this
practically
speaking
it
later.
If
you
increase
data,
set
sizes
and
increase
models,
you
can
kind
of
estimate
what
this
should
be.
Okay,.
B
After
training,
there's
there's
a
challenge,
I'd
say:
I've
selected
a
model
on
my
validation
set
and
now
I
want
to
test
it
on
some
sort
of
real-world
data.
My
test
set
and
there's
there's
a
problem
here
called
test
train
distribution
mismatch.
This
is
where
the
the
test
set
has
a
different
data
distribution
from
the
training
set.
Everybody
familiar
with
these
challenges
already
some,
maybe
some,
not
okay.
B
So
when
you're
going
into
these
problems,
you
want
to
have
a
strategy
for
how
you
would
address
something
like
test,
train
mismatch
and,
as
an
example,
I
heard
this
from
some
Amazon
Alexa
engineers
that
they
had
set
up
this
nice
training
set.
That
was
that
was
using
data
from
like
audio
books
and
recorded
audio
in
a
sound
booth,
and
so
there
was
it
was
very
clean
audio
and
it
was
people
reading
from
something
it
wasn't
like
a
speaker
up
front
telling
you
things.
B
So
not
only
was
it
very
clean
audio,
but
it
was
kind
of
a
weird
contrived
sample
where,
if
I'm
talking
to
alexa
the
vocabulary
and
things
I'm
going
to
use
are
different
than
if
I'm
just
reading
from
from
a
book,
and
so
when
they
sort
of
deployed
this
in
a
in
a
practical
setting
with
a
device
that
had
a
microphone
in
it.
The
microphone
was
catching
ambient.
B
Noise
echoes
all
kinds
of
weird
things
and
the
phrases
that
people
are
using
to
speak
to
this
thing
we're
wrong,
and
so
there
was
a
really
bad
mismatch
between
their
to
their
training
distribution
and
the
actual
scenario
deployment.
I
so
Amazon
picked
a
strategy
here.
It
was
kind
of
unique.
They
went
off
and
tried
to
remove
all
the
ambient
noise
and
echoes
by
designing
this
really
incredible
microphone
array
into
the
device,
and
so
that's
something
you
can
consider
doing
like
punt.
B
Sometimes
tricky
to
pick
out
evaluation
metrics
that
make
the
most
sense
for
the
problems
that
we're
working
with.
So
as
an
example
the
when
we
were
working
on
our
speech
generation
systems
that
by
do
these,
these
were
sort
of
a
new
set
of
systems
where
we're
generating
trying
to
generate
audio
waveform
from
a
text
string,
and
we
want
that
audio
waveform
to
sound
natural.
B
B
Sorry,
the
l2
norm
is
because
it's
a
distance
measure
over
over
two
images,
like
the
difference
of
two
spectrograms,
it
permitted
small,
perturbations
or
variants
that
cause
things
like
static
noise
or
robotic
voices.
So
the
edges
of
things
were
hard
in
the
spectrograms
and
to
fix
this,
we
ended
up
kind
of
normalizing,
a
few
different
loss
functions.
We
sum
them
together,
weighted
them
and
that
helped
with
some
of
the
issues
and
then
eventually
what
we
had
to
try
to.
What
we
decided
to
do
is
instead
of
just
measuring
error.
B
As
something
I
can
calculate
in
a
framework,
I'm
gonna
generate
some
samples
and
just
kick
those
samples
off
to
something
like
Mechanical
Turk,
so
we
use
Mechanical
Turk,
put
posted
up
our
our
generated
audio
waveforms
and
then
said:
ask
users:
what's
your
preference
of
these
different
versions,
which
one
is
the
best,
and
so
there
there
are
evaluation.
Metrics
like
mean
opinion
score,
which
you
can't
actually
calculate
inside
of
your
framework
during
a
training
run
and
these
these
crop
up
in
a
lot
of
practical
applications.
B
Questions
on
these
okay,
all
right.
So
let
me
get
at
kind
of
the
meat
here
now
that
we're,
through
some
of
the
practical
training
set
up
I,
wanted
to
call
this
section
model,
architecture
search,
but
really
what
I
think
I
want
to
the
point
I
want
to
get
across
is
that
what
we're
doing
is
we're
using
models
to
give
us
information
about
data
and
data
is
really
the
thing
that
we're
trying
to
analyze.
B
B
We
want
we,
we
could
go
and
disk
and
build
deep
learning
models
that
can
model
arbitrary
structure.
So,
for
example,
there
is
a
thing
called
the
neural
Turing
machine
and
technically
it's
Turing
complete.
It
should
be
able
to
compute
any
function,
any
continuous
or
sorry
any
real,
valued
function,
something
that
doesn't
doesn't
have
like
decidability
issues.
B
Unfortunately,
the
challenge
that
we
have
when
we're
doing
when
we're
doing
model
architecture
search,
is
it's
hard
to
optimize
some
of
these
models,
so
the
neural
Turing
machine
was
a
nice
theoretical
nicety,
something
that
was
interesting
to
try
play
around
with
certain
tasks.
It
was
very
hard
to
optimize,
so
it
was
it's
we've
drawn
ideas,
inspiration
from
it
since
then,
but
it's
not
a
model
that
that
we
generally
use
in
practice.
B
All
of
what
I'm
going
to
be
going
through
here
then,
is
focusing
on
how
we,
how
do
we
limit
the
the
challenges
to
training
these
models
and
the
key
insight
is
something
about.
What
does
the
error
surface?
Look
like
how
do
I
make
my
training
process
look
like
a
convex
process
at
least
locally,
and
so
I'm,
going
to
flip
the
perspective
start
from
the
data
and
look
at
what
the
data
is
telling
us
design
models
to
match,
and
so
here
the
here's,
an
overview
kind
of
the
pieces
that
I'm
going
to
describe
here.
B
B
The
models
that
you're
training
are
going
to
need
sufficient
capacity
to
capture
the
information
content.
That's
in
the
data
set
and
so
I'll
talk
about
those
learning.
Curves
also
gave
us
some
bet
best
fit
model
sizes,
and
so
now
we
can
analyze.
What's
the
size
of
the
best
fit
model,
what
is
how
does
capacity
grow
as
we
increase
data
set
size
and
what
you're
going
to
be
doing
if
you're
modifying
some
models
to
to
improve
their
accuracy
on
your
data
set,
is
you're
going
to
be
engineering?
B
Ok,
so
data
sets
are
usually
usually
contain,
hierarchical
relationships,
so
these
are
bottom-up
structures.
You're
gonna
be
finding
small
bits
in
the
early
stages
of
the
models.
We've
seen
a
little
bit
of
this
already
and
later
deeper
in
the
model,
we're
going
to
be
capturing
sort
of
higher-order
concepts,
and
so
a
good
mental
model
for
what
your
data
set
looks
like
is
something
like
an
ontology
and
unfortunately,
what
deep
learning
results
in
what?
What
deep
learning
models
learn
is
not
always
the
same.
Ontology
is
our
intuition.
B
The
big
takeaway
here
is,
if
you
have
sort
of
poor
representations
early
in
the
model,
capturing
the
little
bits
it's
hard
to
to
use
that
information
later
in
the
model,
and
so
the
model
have
sort
of
instability
or
error
that
gets
introduced
early
on
so
Mustafa
showed
an
example
here.
I'll
spend
a
little
bit
more
time
on
this.
B
This
is
from
a
very
interesting
talk.
I'd
highly
recommend
watching
this
talk
because
it
kind
of
breaks
down
this.
The
hierarchy
of
data
in
the
context
of
computer
vision
models,
as
I
learned,
Fergus,
great
great
talk.
They
designed
a
thing
called
Adi
convolutional
network
that
allows
them
to
sort
of
back
out
what
filters
later
in
the
model
are
perceiving
from
an
image
and
so
here's
your
examples.
This
is
what
the
image
is
perceiving
after
layer.
1-
and
here
R,
this
is
a
set
of
images
that
sort
of
represent
the.
B
These
are
patches
of
the
images
that
are
matched
by
these
different
filters.
The
top
corner
here
looks
like
it's
an
edge
detector
on
something,
that's
sort
of
north-south
oriented,
and
so
you
see
a
bunch
of
these
patches
for
in
an
image
that
are
north-south
oriented,
so
we're
capturing
that
information
about
this
edge
here
and
some
of
these
are
color.
Also
so
color
is
captured
in
these
patches.
Like
the
green
here,
as
we
move
to
later
farther
into
the
network,
we
see
that
the
network
is
starting
to
compose
these
things.
B
We
also
see
things
like
barcodes
in
the
middle
of
the
network,
so
these
are
things
that
have
still
have
fairly
regular
structure,
but
at
a
higher
granularity
than
earlier
in
the
network,
as
we
get
very
deep
in
the
model.
So
this
was
a
I
think
used
something
like
vgg
Nets
at
towards
the
end
of
the
model,
we're
capturing
things
that
are
more
organic
in
nature,
their
compositions
of
a
lot
of
different
bits
and
pieces.
So,
for
instance,
capturing
eyes
of
owls.
B
You've
got
a
little
bit
of
symmetry,
but
it's
not
as
regular
as
things
like
lattices.
So
it
takes
a
little
bit
more
hierarchy
to
extract
those
things
out.
This
makes
sense.
Okay,
there's
another
example.
Here,
that's
that's
useful
or
helpful.
One!
Isn't
it
to
kind
of
see
how
this
works
and
that's
in
language
modeling?
So
this
is
sort
of
my
main
area
of
research.
B
The
the
first
layer
of
language
models
is
are
embedding
layers,
and
here
what
you're
trying
to
do
is
you're
trying
to
map
from
a
vocabulary
of
some
sort
into
an
internal
representation
that
the
model
could
use.
So
it's
going
to
be
looking
up
vectors.
So
here's
like
a
load
low,
dimensional
projection
of
a
few
of
these
vectors.
Let's
say
like
the
corner
here
is
zero
zero.
So
man
is
positioned
here.
Woman
is
positioned
here,
etc,
and,
given
this
low
dimensional
representation,
I
can
find
the
eigen
value
or
something
I
can
values.
B
The
embedding
layers
of
these
models
captures
sort
of
low
dimensional
relationships
like
this
like
gender,
so
other
things
that
you
might
consider
would
be
you
know,
taking
a
verb
and
making
it
an
adverb
that
might
be
a
projection
in
this
low
dimensional
space.
If
we
look
at
models
like
elmo,
Burt
and
GPT
to
some
of
the
more
recent
transformer
models,
more
complicated
models,
they're
going
to
use
attention
mechanisms
to
now
combine
these
embedding
representations
and
they'll.
Do
it
somewhat
hierarchically,
and
so
this
is
a
simplified
example
of
what
these
things
are
doing.
B
But
you
could
imagine
if
we
had
this
sequence
that
looks
like
this.
The
King
asks
the
blank
if
she
would
attend
a
party
or
something
given
the
representation
of
King
here
I'm
going
to
attend
heavily.
So
the
darker
arrow
here
means
a
model
is
probably
attending
heavily
on
the
word
King,
probably
attending
heavily
on
she
its
it
needs
to
it's
recognizing
that
the
position
of
this
object
probably
has
a
gender
female.
B
B
Picture
that
I
grabbed
here
was
just
from
online,
so
that's
that's.
The
only
reason
I
could
have
gone.
The
other
way
or
I
could
have
I
could
have
actually
won.
One
also
interesting
thing
is:
if
you
look
at
the
position
of
the
word
human
here
or
person,
I
think
one
of
these
words
it
will
land
sort
of
in
between
these,
and
there
will
be
a
vector
to
the
female
gender
and
there
will
be
a
vector
to
the
male
gender.
These
are
sort
of
analogies
tasks
that
that
embeddings
can
take
take
on
yeah.
B
Yes
right,
so
that's
that's
exactly
how
attention
mechanisms
are
designed
is
it's.
You,
basically
are
taking
a
query.
So
the
query
in
this
case
is
the
this
blank
position
here.
This
question
mark
I
have
embeddings
for
each
of
these,
except
this,
this
one
I'm
embedding
kind
of
a
empty
character
and
then
on
the
output
of
this.
What
I'm
doing
is
some
relationship
analysis
clustering
between
these
vectors,
so
I
would
take
like
the
vector
King
and
the
article
here
this
vector
this.
B
These
are
probably
related,
and
the
attention
mechanism
would
look
at
the
those
two
vectors
they
do.
Like
a
you
know,
a
dot
product
between
them.
The
dot
product
is
gonna,
say
how
closely
related
these
things
are,
how
important
is
Queen
in
this
representation
and
then
from
the
dot
product.
I'll
do
a
softmax,
so
this
will
be
turn
it
into
a
probability
distribution.
B
B
That's
maybe
something
that
I
I
should
should
have
expanded
on
here,
because
we
saw
I
think
there
are
a
lot
of
content
about
computer
vision
models.
Some
of
the
new
models,
oh
I,
will
say
a
little
bit
more
about
new
models.
Some
of
the
new
models
that
are
being
used
very
widely
right
now
are
called
transformers
and
they're.
Basically
just
sets
of
these
attention
mechanisms
with
without
other
things,
without
convolution.
B
Generally,
the
it's
worth
saying:
generally
attention
mechanisms
work
well
in
things
like
sequences
and
things
like
text,
you
can
use
them
in
computer
vision,
applications
or
certain
things
in
computer
vision
that
are
important
like
if
I'm
trying
to
steer
a
camera.
For
instance,
I'd
like
to
figure
out
where
I
should
be
attending
and
then
start
to
try
to
set
my
reticle
of
the
camera
on
that
position.
So
you
can
use
them
in
computer
vision,
but
the
applications
aren't,
as
aren't
as
clear
obvious.
B
B
The
the
idea
is
that,
as
I
propagate
through
a
network,
the
information
content
that
I
have
at
any
stage
I
can.
I
can
mess
up
the
information
content
by
narrowing
my
model.
So
if
you,
if
you
consider
a
model
that
was
just
a
bunch
of
fully
connected
layers,
I
can
have
a
dimension
here
that
if
I
reduce
the
dimension
of
one
of
these
fully
connected
layers,
I'm
reducing
squashing,
my
information
down
into
a
low
dimensional
representation
and
because
of
that
I
can
lose
information.
B
Generally,
we
don't
design
models
like
that.
We
keep
all
the
layers
kind
of
the
same
width
so
that
we
don't
squish
squish
out
information,
but
the
bottleneck
is,
is
a
really
important
concept.
So,
if
I
squished
the
inner
dimension
of
the
model
at
some
points,
the
the
I
am
bottlenecking
the
amount
of
information
that
can
traverse
through
it.
So
in
recent
years,
we've
been
looking
at
the
information
bottleneck
in
the
context
of
deep
learning,
and
we
get
some
very
interesting
plots
that
look
like
this.
B
That
also
echo
this
this
argument
that
data
is
hierarchical,
so
this
plot
here
is
showing,
on
the
horizontal
axis,
the
information
between
the
mutual
information
between
my
input,
X
of
the
input
to
the
network
and
then
the
activations
at
layer,
T
of
the
model
and
each
one
of
the
each
one
of
these
sort
of
lines.
Here
is
a
training
process
for
each
of
the
layers.
So
this
is
the
first
layer,
the
second
layer,
the
third
layer.
So
this
is
in
depth
through
the
model,
so
as
I'm
training,
the
information
is
changing.
B
B
Why
is
it
small,
its
small,
because
what
it's
doing
is
detecting
small
bits
of
the
information
in
the
input
and
then
later
on,
the
the
further
layers
are
going
to
try
to
combine
that
information
with
more
information
that
they're
extracting
from
the
input,
and
so
the
information
content
relative
to
the
input
is
increasing.
It's
going
to
the
right
as
they
get
deeper
in
the
model
and
then
on.
B
The
vertical
axis
here:
I
have
the
mutual
information
between
the
current
layer,
activations
and
the
output
of
the
model,
and
at
the
end,
hopefully,
these
activations
are
capable
of
telling
me
what
the
function
should
be.
What
function
I'm
calculating
it
should
be
able
to
predict
the
outputs
of
the
model,
so
hopefully
the
mutual
information
should
the
outputs
is
perfect
in
this
case
one.
B
B
So
if
I
need
to
reduce
down
my
the
dimension
of
my
input
to
make
a
classification
like
Sam
I'm,
just
saying
if
something
is
a
cat
or
a
dog
in
an
image
network
at
the
at
some
point,
I
have
to
reduce
the
dimension
of
of
my
model
down
to
a
prediction,
cat
or
dog,
and
so
I
can
do
that
like
right
at
the
end,
that
might
not
be
the
best
approach.
It
might
actually
be
smarter
to
to
kind
of
enforce
that
I'm
slowly
eliminating
information
by
narrowing
the
model
as
it
gets
deeper.
That
makes
sense.
B
B
Okay,
so,
finally,
to
kind
of
wrap
back
to
the
tests
that
we
were
running
for
this
research
study,
what
we
found,
which
is
sort
of
well-known
but
gives
us
the
the
sort
of
systematic
study
we
did,
gives
us
some
some
ways
to
decompose
this
to
break
it
apart.
Finite
data
sets
have
a
finite
amount
of
information
which
is
actually
sort
of
convenient,
because
that
means
I
can
I
should
be
able
to
represent
them
with
a
finite
model.
B
Models
with
sufficient
capacity,
which
I'll
define
in
a
little
bit
we'll
fit
a
full
data
set,
and
so
what
I'm
plotting
here
is
examples
of
what
is
sufficient
capacity
or
what
is
sufficient
number
of
parameters
in
a
model
to
fit
a
data
set.
So
here
what
I
have
is
this
is
for
word,
language
modeling.
We
had
particular
data
sets
and
we
were
looking
at
training
sets
that
were
chunks
of
this,
so
this
is
sort
of
powers
of
two
roughly
different
sizes,
and
this
is
the
validation
loss.
B
So
this
is
how
well
these
models
generalizes
I
trained
them.
This
small
data
set
has
sort
of
a
finite
amount
of
information
that
a
small
models
of
model.
That's
this
is
model
size
number
Prem
here
model
size
of
what
maybe
200,000
parameters
is
sufficient
to
fit
the
training
set
here
and
anything
that
any
larger
model
here
would
have
capacity
to
over
fit
so
I
don't
actually
need
a
larger
model.
B
B
So
this
gets
maybe
to
some
something
that
Mustafa
pointed
out
early
on
the
it's
possible
to
grow
your
data
set
size
by
just
taking
another
copy
of
your
data
set
and
concatenating
it
to
the
one
you
currently
have
there's
no
new
information
there,
because
I
already
have
all
of
that
information
in
previous
samples,
and
so
you
have
to
you
sort
of
have
to
be
careful
about
what
it
means
to
grow.
A
data
set.
B
What
we
really
want
to
do
is
make
sure
that
we're
growing
the
information
content
in
our
data
set,
and
so
this
this
is
sort
of
captured
here
or
subdividing
our
data
set
into
different
chunks
sizes
and
then
looking
at
the
model
size
that
fits
and
I'll
get
to
in
a
little
bit.
How
should
model
size
grow
as
we
increase
data
set
size,
larger
data
sets
require
larger
models
and
there
is
sort
of
a
theoretical
background,
for
that
would
help
us
predict
how
large
models
should
be
for
a
given
data
set
size.
B
If
we
have
some
point
of
reference
and
a
smaller
data
set
and
that's
model
capacity,
measures
like
the
VC
dimension,
so
VC
dimension
is
an
upper
bound
measure
of
capacity.
The
rigorous
definition
is
a
little
clunky.
It's
it's
given
the
model,
it's
the
largest
set
of
points
that
can
have
arbitrary
labeling
that
you
can
shatter,
meaning
you
can
split
and
distinguish
all
of
the
separate
points.
B
The
definition
here
is
not
so
important
as
kind
of
what
this
means
conceptually
when
we're
changing
models.
If
you
go,
look
at
some
prior
techniques,
try
like
sort
of
traditional
machine
learning
techniques
things
so
traditional
machine
learning
techniques
have
some
limitations
on
how
capacity
grows.
It's
it's
hard
to
grow
the
capacity
of
some
of
these
techniques,
and
so,
for
example,
decision
trees.
Have
capacity
is
something
like
order.
B
N
log
D,
where
n
is
the
number
of
leaves
in
the
tree
and
D
is
the
number
of
dimensions
of
each
data
sample,
and
so
what
does
this
mean?
It
means
that
sort
of
the
size
of
the
tree,
the
size
of
the
tree,
is
actually
probably
n,
plus
a
log
of
n.
The
size
of
the
tree
grows
roughly
linearly
with
the
capacity
of
the
model,
and
so
this
isn't
a
very
this.
B
Isn't
that
we'd
like
to
have
some
compression
factor
on
our
data
we'd
like
we'd
like
to
be
able
to
represent
it
with
less
with
a
structure?
That's
smaller,
it
grows
more
slowly
than
the
data
and
that's
where
deep
learning
is
really
deep.
Learning
is
really
nice
here,
so
Bartlett
in
2017
showed
that
deep
neural
networks
with
nonlinearities
have
capacity
WL,
log,
W
and
here
W
is
the
number
of
weights
in
your
model,
and
so
this
is
actually
proportional
to
the
size.
B
In
this
case,
and
L
is
roughly
the
depth
of
the
model,
it's
an
interesting
way
of
characterizing
depth
of
the
model,
and
so
the
the
what's
really
interesting
about
this
is
now
I.
Have
this
sort
of
trade
off
between
capacity
of
the
model
and
depth
but
I
also
have
this
really
nice
factor?
I
have
W
log
W
here,
meaning
the
the
capacity
of
deep
neural
networks?
Actually
grows
super
linearly
with
the
number
of
parameters
the
amount
of
storage
I
require
to
sorry
the
they
grow
super
linearly
in
the
storage.
B
B
So
this
is
not
exactly
what
you're,
seeing
if
you're
looking
at
computer
vision
models
which
have
a
lot
of
convolution
you're
doing
some
interesting
ablation
of
the
fully
connectedness
of
this
kind
of
model
when
you're
doing
computer
vision
things
okay.
So
now,
if
I
have
this
nice
trend,
if
I
can
maybe
rely
on
this,
as
these
are
tight
bounds
on
this
type
of
network,
if
I
can
rely
on
this
trend,
that
capacity
grows
super
linearly
in
the
number
of
weights,
I
can
start
to
predict,
perhaps
how
the
model
size
should
grow
with
my
dataset
size.
B
So,
let's
start
with
the
sort
of
simplifying
assumption
suppose
the
information
content
of
my
data
set
grows
linearly
in
its
size,
which
is
sort
of
a
reasonable
assumption.
I'm
going
to
pick
out
new
samples,
I'm
not
going
to
try
to
repeat
old
samples,
and
so
hopefully
I'm
giving
them
the
data
set
more
information
as
I
grow
it
and
so
model.
Given
that
capacity
grows
super
linearly
in
weights
model,
our
model
sizes
should
need
to
grow
sub
linearly.
B
In
the
data
set
size,
which
is
a
nice
property,
it
should
hopefully
be
less
than
linear,
and
if
you,
if
you
do
some
back
of
the
envelope
calculations
and
some
approximations
here,
it
should
probably
grow
greater
than
square
root
in
size
and
in
fact,
what
we
found
when
we
were
doing
our
studies.
This
was
true
across
all
of
the
applications
that
we
tested
five
different
domains.
B
B
This
is
the
sort
of
a
trend
fit
of
the
curve
power
law
trend,
fit
that
shows
that
ResNet
scale
approximately
with
0.57
exponent.
So
if
I
am
going
to
double
my
data
set
size,
I
should
take
about
a
square
root
of
two
increase
in
my
model,
size,
which
is
a
really
nice
property
when
you're,
when
you
want
to
start
scaling.
B
B
B
B
If
you
want
a
scale-
and
you
don't
know
exactly
what
exponent
to
use
here,
you
can
err
on
the
side
of
more
linear
growth,
so
we've
seen
the
bias-variance
tradeoff
I'm
gonna
reflect
on
it
again
because
it
gets
into
kind
of
some
of
the
things
that
deep
learning
does
like
I
just
said,
deep
learning
is
a
little
bit
fungible
on
on
parameters.
How
you
grow
models
so
given
mean
square
error,
is
the
bias
squared
plus
the
variance
plus
some
noise,
historically
and
sort
of
traditional
machine
learning
domains.
B
People
said
we
need
to
optimize
the
bias,
because
that's
the
squared
term,
that's
the
thing
I
want
to
minimize
in
practice
with
deep
learning.
There
isn't
really
always
a
trade-off
here.
There's
it
isn't.
The
bias-variance
tradeoff
doesn't
doesn't
really
come
into
play.
It's
so
as
an
example,
we
can
over
parameterize
our
model
and
a
lot
of
models
are
good
kind
of
ignoring
weights.
That
aren't
important
and
they
they're
still
there
but
they're
not
contributing
much
to
the
calculation,
and
we
can
do
things
like
regularization.
B
Regularization
techniques
can
force
those
weights
to
not
do
anything,
and
so
there
is
kind
of
a
heated
debate
in
the
deep
learning
community,
beginning
in
2017
until
early
2018
about
why
do
deep,
neural
networks
actually
generalize
when
they're
so
over
parameterised-
and
this
is
something
that
is
if
you
want
to
go-
read
some
heavy
research
papers.
This
is
a
place
where
you
can
get
some
very
deep
insights,
very
deep
understanding
of
generalization.
B
Ok,
so
what
we
want
to
do
is
we
want
to
engineer
biases
into
our
model
and
so
I'm
going
to
talk
through
a
few
of
these
different
biases.
So
inductive
bias
is
sort
of
conceptually
speaking.
I
want
my
model
to
be
able
to
deduce
something
I
want
it.
I
want
to
tell
it
how
to
make
deductions,
and
so
we
call
that
inductive
bias
and
we're
gonna
bake
in
these
assumptions
into
our
models.
This
is
a
sort
of
general
process,
so
there
are
some
examples.
You've
already
seen
and
I'm
gonna
try
to
extend
those
examples.
B
So
you
have
a
bunch
of
different
keys
and
sort
of
triangulation
points
to
go
back
to.
We
talked
about
how
computer
vision
applications
probably
should
be
translationally
invariant,
and
so
we
we
build
our
models
out
of
convolutions,
where
I'm
doing
applying
the
filter
at
every
pixel
in
the
in
the
image,
so
I
should
be
able
to
detect
an
edge
anywhere.
B
The
second
example
is
languages,
structured
sequentially.
So
what
does
that
mean?
The
words
that
are
coming
out
of
my
mouth
right
are
dependent
on
things
that
I've
previously
been
saying,
and
so
I
want
to
condition
what
I'm
saying
now.
On
the
previous
words.
Basically,
all
language
is
structured
like
this,
their
previous
conditions,
and
so
we
want
to
use
models
that
sort
of
integrate
this
knowledge.
We
want
to
use
sort
of
Bayes,
rule
and
kind
of
decompose.
B
These
conditional
probabilities
through
time
in
a
sequence
and
so
I
think
there
will
be
a
talk,
maybe
later
about
sequential
models,
recurrent
models
and
so
you'll
see
the
the
details
of
those
models
later.
But
we
put
this
inductive
bias
into
those
models
and
then
the
more
recent
thing
that
one
of
the
recent
set
of
models
that
people
are
using
our
transformer
attention
models,
and
this
is
based
on
so
we
now
have
this
conceptual
view
that
everything
we're
building
all
of
the
data
sets
were
going
after
or
have
hierarchy
hierarchical
structure
in
them.
B
One
way
to
do
that
is
actually
human
memory
is
structured
as
a
highly
associative
memory,
where
I
have
some
concept
in
my
head
and
it
Prime's
me
to
think
of
some
other
concept.
That's
highly
related,
so
I'd
like
to
put
in
associative
structures
in
my
models,
and
so
things
like
attention
mechanisms
can
do
that
and
in
particular,
hash
maps
are
or
key
value
stores
are
associative
memories
that
look
like
that
and
they
can
be
used
to
cache
these
models,
and
so
I
can
use
attention
mechanisms
as
a
probabilistic
associative
memory.
B
That
makes
sense,
and
so
I
want
to
make
sort
of
a
bold
claim
here,
but
and
and
take
this
with
a
grain
of
salt.
If,
if
you
are
a
data
data
structures,
expert,
there's,
probably
a
need
for
your
data
structures
in
probabilistic
models
like
deep
learning,
and
so
if
you--if
there's
a
data
structure
that
you
think
would
help
transforming
from
one
representation
in
your
model
to
the
next
one.
If
you
can
come
up
with
a
probabilistic
version
of
that
data
structure
that
can
be
used
in
your
model
and
it
might
actually
help
with
training.
B
I'm
going
to
touch
on
this
just
briefly:
inductive
bias
effects
the
these
learning
curves.
That
I
was
talking
about
previously,
and
so
we
tested
a
handful
of
different
models
to
look
at.
How
does
how
do
different
models
with
their
different
inductive
biases?
How
does
that
affect
their
learning
curve?
So
maybe
this
one
would
be
the
interesting
one
to
talk
about
in
speech
recognition.
B
We
train
models
called
deep
speech
to
which
is
a
recurrent
model
at
from
baidu,
and
then
we
also
put
together
an
attention
model
also
recurrent,
but
it
uses
attention
to
attend
to
different
positions
in
the
in
the
sequence
and
we
looked
at
the
learning
curves.
So
this
again
shows
this
power
law
characteristic
speech.
Recognition
happens
to
be
one
with
a
fairly
good
exponent
here
word:
language
modeling
is
really
bad.
This
one's
hard.
B
You
have
to
use
a
lot
of
data
in
the
case
of
speech
recognition,
if
you
talk
to
machine
learning,
researchers
that
do
speech,
recognition,
work
and
they
have
experience
with
deep
speech,
and
you
ask
them
how
our
attention
models.
How
well
do
they
do
every
one
of
them
will
complain
about
how
hard
it
is
to
train
them.
B
It
requires
more
data,
so,
in
practice,
what
we
did
is
we
actually
just
use
a
different
model,
a
language
model
to
help
the
speech
recognition,
attention
model,
formulate
language,
the
language
model.
It
gives
it
some
predictions
about
what
words
should
come
next
and
if
you
actually
plotted
this,
we
took
out
the
the
language
model.
If
you
put
the
language
model
back
in
you
see
this
come
down
and
it
does
better
than
the
deep
speech
to
model.
A
B
Mustafa
talked
about
data
augmentation,
but
maybe
it's
worth
noting
that
data
augmentation
is
another
form
of
inductive
bias
that
you're
trying
to
impose
on
this
model
that
you're
training.
So
if,
if
we
think
that
our
model
should
have
invariance
is
like
cropping,
translating
and
rotating
and
we're
building
components
of
the
model,
that
should
be
able
to
recognize
that
I
could
it'd
be
nice
to
have
samples
in
my
data
set.
That
kind
of
encourage
it
to
learn
that,
and
that
can
be
helpful
for
extracting
out
these
hierarchies.
B
One
other
one:
that's
that's
really
important
and
I
want
to
show
practically
speaking.
What
what
happens
in
this
scenario
is
how
you
initialize
models
when
you
so
initialization
is,
is
a
bias
on
what
you
think
the
models
prior
should
be
so
prior
probability
is:
if
I
was
just
predicting
the
outputs
and
I
just
had
the
output
data
set.
How
likely
are
these
things?
That's
the
the
probability
I'm
going
to
choose
is
the
you
know
how
frequent
the
one
of
the
outputs
is
seem.
B
B
That
kind
of
gives
me
this
this
characteristic
that
most
of
the
embeddings
are
going
to
be
orthogonal
now
I,
say
I,
say
most
because
in
this
case
my
vocabulary,
size
is
very
large
relative
to
the
hidden
dimension
and
I
can
only
have
256
orthogonal
vectors
in
in
a
latent
space.
That's
256
dimensions,
so
when
I
uniformly
randomly
initialize
I
get
a
lot
of
orthogonality,
which
is
good,
but
my
model
now
is
going
to
have
sort
of
spurious
relationships
between
these
words.
B
Some
of
the
vectors
are
actually
going
to
have
a
nonzero
cosign
between
them
and
so
the
model
when
I'm
training
it
actually
kind
of
needs
to
unlearn
these
problems
these
these
relationships.
So
it's
worth
kind
of
thinking
through
this
in
in
examples
like
this,
when
you're
doing
initialization.
What
is
it
that
this
sort
of
initialization
causes,
and
is
this
kind
of
the
prior?
That
I
think
is
important
at
this
part
of
the
network
and
then
in
language
modeling.
We
do
something
that
Mustafa
was
mentioning
and
transfer
learning.
B
We
we
initialize
with
pre
trained
word
embeddings,
because
this
already
tells
us
that
you
know
king
and
queen
are
related
through
a
gender
relationship,
just
really
nice.
So
if
you
initialize
with
pre
trained
vectors
training
is
faster
you
you
could
train
the
embeddings
on
a
much
larger
data
set
than
than
the
thing
you're
training
on
later
and
so
training
the
full
model.
You
might
not
be
able
to
learn
all
of
those
embeddings
very
accurately.
That's
a
that's!
A
phenomenally,
interesting
research
direction.
B
There
are
people
are
that
are
continually
working
on
embeddings
right
now
and
specifically
for
large
vocabularies,
but
yeah.
So
the
I
mean
kind
of
a
rule
of
thumb
in
the
in
the
community.
Right
now
is
you're.
Embedding
dimension
in
language
modeling
you're,
embedding
dimension
could
be
like
roughly
the
square
root
of
your
vocabulary,
size
and
the
maybe
the
intuition
there
is
in
language,
modeling
and
using
multiple
vectors
to
represent
things,
and
often
there
are
combinations
or
sets
of
words.
B
B
B
All
right.
So
are
there
any
questions
on
kind
of
getting
at
what
the
data
is
telling
us
and
seeing
how
we
design
our
models
for
the
data
all
right,
so
yeah,
I'm,
gonna
I'll,
go
through
these
things
quickly,
but
in
the
in
the
large
research
study
that
I
that
we
ran
the
trickiest
thing,
it
was
easy
to
scale
back
our
datasets
and
run
on
small
training
sets.
It
was
much
harder
to
scale
that
back
up
and
and
to
get
to
scale,
I'm
gonna.
Let
future
speakers
I
think
cover
some
of
this
material.
B
But
there
are
a
couple
things
practical
challenges
that
you
it's
useful
to
be
aware
of,
because
it's
nice
to
train
on
small
data
sets
okay,
so
as
an
example
for
image
classification,
when
we
were
training
our
models
for
this
research
study,
we
started
with
a
model
and
a
so
we
started
with
a
code
base
data
set
and
a
pre
trained
model
which
could
give
us
accuracy
that
was
state
of
the
art
at
the
time.
So
this
is
a
learning
curve.
Again,
this
is
the
dotted
line.
B
Here
is
the
learning
curve
I'm
not
sure
if
I
showed
the
learning
curve
for
image
classification,
but
this
is
what
it
looks
like
here's,
the
small
data
region,
here's
the
power
law
region,
the
small
data
sets
are
sorry.
The
large
data
set
the
large
model
we
had
trained
correctly
to
the
right
accuracy,
and
now
we
wanted
to
start
kind
of
backing
this
off
ablating
the
model,
reducing
the
data
set
size
and
see
if
we
could
to
sketch
out
this
curve,
and
the
reason
that
you'd
want
to
do
this
in
practice.
B
For
a
lot
of
us
in
here
is
I'd
prefer
to
pick
a
dataset
size
that
might
be
sort
of
towards
the
the
beginning
of
this
power
law
region.
Do
all
my
iteration.
He,
because
this
is
a
small
data
set
small
model
sizes.
I
can
run
lots
of
training,
runs
very
quickly
down
here
I.
We
would
really
like
to
be
working
in
this
region
and
then,
at
a
later
time,
when
everything
works
here
run
down
this
curve
right.
B
So
as
we
were
kind
of
filling
in
points
going
backwards
here,
we
actually
saw
a
curve
that
looked
like
this,
which
was
very
problematic
this
this
guy,
so
that
the
procedure
that
we
were
using
was
decrease.
The
data
set
size
in
steps
decrease,
the
model
sizes
were
going
to
smaller
sizes
and
then
I
think
we
were
being
conservative
here
and
keep
everything
else
fixed,
and
so
we
got
this
thing
that
we
we
termed
as
the
small
data
gap
I
haven't
I
haven't
written
about
this
publicly.
B
B
B
Okay,
so
neither
did
we
so
fair
point.
It
turned
out
the
batch
sizes
were
too
large
for
these
small
models.
So,
if
you
think
about
it,
the
smaller
model
has
a
smaller
latent
dimension.
I
have
fewer
parameters
to
represent
a
transformation
that
I'm
making
and
at
any
point
in
my
optimization
process,
and
what
that
leads
to
is
smaller
models
have
lumpier
error
surfaces
once
you
learn
something
in
this
latent
dimension,
and
you
need
to
try
to
learn
something
else.
B
I
have
to
sort
of
compete
with
the
things
that
I've
already
learned,
and
so
what
I
mean
by
a
lumpy
error
surface
is:
if
I
continue
training
in
one
direction.
On
my
air
surface,
it's
likely
I
will
run
into
non
convexity.
My
error
will
get
worse,
and
this
is
what
exactly
what
was
happening
and
especially
when
we
were
using
batch
sizes
that
were
too
large
batch
gradient
descent.
B
So
because
we're
averaging
over
a
bunch
of
of
samples
with
a
large
batch,
the
the
averaged
gradient,
is
something
that's
not
actually
a
gradient
from
any
one
of
the
samples
and
so
going
in
that
direction
might
do
more
harm
than
good.
It
might
cause
the
model
to
unlearn
things
that
it
had
previously
learned.
B
So
there's
some
work
on
this,
and
maybe
Thorsten
can
cover
this
a
little
bit
more
detail,
but
there's
a
paper
that
describes
that
batch
size
can
grow
linearly
roughly
linearly
with
the
data
set
size
under
a
handful
of
assumptions
about
how
your
data
is
distributed.
This
turned
out
to
be
very
handy,
and
we
empirically
vetted
this
result
when
we
were
doing
our
study
that
it
was
it
worked
best
if
we
decreased
the
batch
size
linearly
when
we
started
seeing
something
like
this.
So
starting
at
this
point
decrease
the
batch
size
linearly.
B
Because
of
this,
I
can
start
backing
out
a
whole
bunch
of
different
things.
I
can
I
can
make
a
prediction
given
an
accuracy
level.
I
can
predict
my
data
cost.
So
if
I'm
added
particularly
particular
accuracy
level
now
and
I'm
going
to
scale
my
data
set
size
and
it
costs
some
amount
per
sample,
I
can
estimate
how
much
it
would
cost
to
grow.
My
data
set
size
public
data
sets
are
very
helpful
and
there
are
ways
to
maybe
get
through
this
without
having
so
like.
B
Dataset
size
grows,
a
certain
amount
model
size
grows,
a
certain
amount.
My
computer
operations,
compute
operations
per
parameter
in
models
is
roughly
linear
and
the
number
of
training
steps
that
I
need
to
Train
on
a
larger
data
set
grows
roughly
linearly
in
the
data
set
size,
and
so
now
I
can
estimate
that
the
compute
time
that
I
have
grows
a
little
bit
less
than
the
square
of
the
data
set
size.
Growth
to
larger
sets.
Okay
and
I.
We
have
a
paper
on
this
also.
B
And
I
will
skip
I'll
I'll
call
it
here
and
jump
to
my
summary.