►
From YouTube: 11 - Sequential Models - Luke de Oliveira
Description
Deep Learning for Science School 2019 - Lawrence Berkeley National Lab
Agenda and talk slides are available at: https://dl4sci-school.lbl.gov/agenda
A
And
hopefully,
you'll
stay
through
the
rest
of
the
week.
It's
my
great
pleasure
to
introduce
Luke,
who
is
a
familiar
face
here
at
Berkeley,
Lab
he's
worked
with
several
people
here
in
the
physics
division.
He
currently
leads
AI
an
engineering
math
video
which
was
acquired
by
well
what
he
was.
He
was
at
wide
technologies,
which
was
then
acquired
by
Twilio
he's
a
mathematician
from
Yale
and
then
a
master's
at
Stanford
he's
also
published
a
ton
of
work
in
generative
modeling
and
today
he's
going
to
talk
about
sequential
models.
Thank
you.
B
B
So
what
we're
going
to
talk
about
today,
kind
of
my
goal
in
today's
lecture
is
pretty
high-level.
I
want
to
set
up
sequence,
learning
kind
of
in
the
general
sense.
There
are
lots
of
tutorials
that
are
flying
around
the
medium
random
jupiter
notebooks
you'll
find
let's
try
to
set
up
what
sequence
learning
is
from
a
more
kind
of
systematic
perspective.
Then
I'm
gonna
give
a
taxonomy
of
sequential
models.
B
So
how
can
we
set
up
the
inputs
and
outputs
for
the
different
kind
of
types
of
sequential
models
that
you'll
encounter
either
in
papers
or
encounter
in
different
problem
environments?
So
you're
gonna
want
to
work
in
then
finally,
we'll
get
into
the
deep
learning
building
blocks
that
will
kind
of
constitute
these
individual
components.
This
is
gonna,
be
kind
of
a
whirlwind
of
information,
so
my
goal
at
the
end
of
the
day
is
to
give
you
the
key
words
to
then
later
Google
or
when
you
read
a
paper.
B
You
know
kind
of
you
can
kind
of
tie
the
thread
together
to
what
we
learned
about
in
lecture
today.
So
I
don't
expect
people
to
kind
of
leave
knowing
how
to
go:
train
the
fanciest
sequence,
a
sequence
model
of
attention,
but
just
kind
of
you've
heard
all
of
these
terms:
kind
of
stitch
them
together
and
start
to
build
a
framework
for
understanding
sequence
models
so
particularly
were
gonna
go
through
our
ends,
I'm
going
to
go
through
cnn's
and
we're
gonna
go
through
transformers
at
a
super
high
level.
B
There's
a
lot
of
kind
of
details
and
nitty
gritty
that
goes
into
a
lot
of
these
models.
We're
gonna
cover
them
from
a
little
bit
of
a
higher
level
perspective,
guessing
the
tuition
all
towards
the
goal
of
having
a
good
way
to
grok
sequence,
learning
and
taking
models.
So,
let's
start
with
the
basics
of
sequence
learning,
so
the
typical
I'm
gonna
put
this
in
parentheses,
supervised
learning
setup
that
we
oftentimes
have
is
a
fixed
vector
input
and
a
fixed
vector
output.
B
So
this
is
I
have
readings
from
a
sensor
or
multiple
sensors
or
some
sort
of
sensor
array.
I'm
measuring
attributes
about
my
system,
if
I'm
working
in
sociology
and
maybe
I'm,
measuring
like
age,
race,
sex
income
level,
things
like
that
I
have
a
fixed
set
of
observables
about
my
system
and
then
I
want
to
predict
out
a
fixed
set
of
outcomes
so
signal
or
background
whether
or
not
machine
is
gonna
fail
in
the
next
t.
Time
periods
very
fixed.
B
Okay,
we
have
some
fixed
domain,
some
fixed
Co
domain,
and
we
want
to
learn
in
the
typical
setup.
As
discussed
in
the
earlier
lectures
within
the
summer
school.
We
want
to
learn
a
good
function
that
map's
our
inputs
to
our
outputs
matching
the
state
of
the
world.
So
we
have
some
F,
that's
actually
governing
this
process
going
from
domain
to
Co
domain
and
we
want
to
learn
a
good
mapping
for
that
turns
out.
Deep
neural
networks
tend
to
be
quite
good
at
this
problem.
B
We
learn
this
through
data,
but
this
doesn't
really
extend
to
this
eventual
case
so
where
this
is
fall
apart,
this
falls
apart
when
we
have
what
I
generally
like
to
call
very
attic
size,
so
we're
dealing
with
features
or
an
input
space
that
isn't
constrained
to
a
fixed
dimension.
How
do
we?
Where
do
we
encounter
these
things?
You
encounter
these
a
lot
when
you're
dealing
with
time
series
when
you're
dealing
with
kind
of
general
collections
of
objects,
we
can
measure
any
number
K
observables
about
our
system
on
each
of
these
we
deem
to
be
useful.
B
This
is
a
collection
of
things
it
doesn't
fit
into
a
fixed.
This
doesn't
fit
into
a
fixed,
dimensional,
vector
another
place
where
you
end
up
finding
this
a
lot
of
times
is
an
unbounded
spatial
domains.
So
normally,
when
you're
dealing
with
images,
you've
got
your
32
by
32
128
by
128,
it's
fixed
or
1024.
If
you're
worried,
you're
fixed
there,
you
aren't
dealing
with
kind
of
very
attic
sizes
but
oftentimes
in
a
lot
of
applications.
Your
images
are
going
to
be
coming
in
in
various
sizes.
B
Your
volumetric
measurements
are
coming
various
sizes
from
different
sensor
series
in
when
you're
measuring
some
fisher
process
and
like
a
groundswell
or
something
so
when
you
have
very
etic
size,
a
lot
of
the
standard
techniques
that
you
use
tend
to
fall
apart.
So
what
often
times
ends
up
happening?
Is
you
end
up
building
models
with
what
I'm
gonna
call
summary
features
or
reductions?
B
What
do
we
call
sequence
learning
so
I
defined
sequence,
learning
as
a
problem
domain,
where
at
least
one
of
the
input
or
the
output
is
of
a
sequential
or
very
etic
size
nature,
and
the
other
key
thing
that
we're
going
to
throw
into
this,
especially
for
today,
is
what
I
call
a
naturally
ordered
sequence.
So
we
want
to
have
an
order
to
what
we
have
in
our
sequential
data,
I.
Think
about
order
in
two
different
ways:
there
is
intrinsic
ordering,
which
is
more
commonly
called
strong
ordering.
B
This
is
where
there
is
like
a
natural
order
admitted
by
Nature
events
coming
in,
in
a
sequence
words
in
a
sentence,
different
samples
from
a
spatial
domain
over
time
these
are
have
a
natural
admitted
order
from
nature.
There
are
a
whole
other
set
of
orderings
that
come
might
get
a
lot
of
work
in
high
energy
physics,
and
this
comes
up
a
lot
in
sequence:
learning
in
high
energy
physics,
where
you
have
extrinsic
ordering.
So
this
is
an
ordering
that
is
not
coming
from
nature.
B
We
are
reading
if
you're
familiar
with
high
energy
physics
we're
reading
tracks
in
the
jet.
We
don't
actually
have
an
ordering
over
those
tracks,
so
we
kind
of
come
up
with
one
as
physicists
say.
The
transverse
momentum
should
be
the
the
the
ordering
should
be
an
order
of
transverse
momentum.
We
think
that
that
is
the
way
that
we
should
order
the
sequence
to
give
it
some
sort
of
weak
ordering.
So
you
see
this
a
lot,
especially
in
natural
science
applications.
B
So
when
we
represent
sequences,
we
have
kind
of
some
sequence
of
vectors.
We
index
these
on
kind
of
a
general
order.
This
can
be
time
any
of
the
extrinsic
order,
as
we
talked
about,
let's
limit
ourselves
to
the
case,
where
we're
doing
typical,
supervised
learning
for
a
quick
second,
how
can
we
use
traditional
models
to
work
on
sequences,
not
saying
that
you
should?
But
how
could
we,
because
this
will
give
us
a
good
inductive
bias
for
how
to
construct
a
deep
learning
models?
B
So
the
answer
as
I
was
mentioning
before
is
kind
of
summaries
or
reductions.
So
what
do
these
look
like
you'll
often
times
call
se
reductions
in
the
NLP
literature
refer
to
this
bags.
You
just
kind
of
take
a
bag
of
things
and
reduce
it
into
a
summary
feature,
but
assume
I
have
some
input
sequence.
How
do
I
get
that
into
a
single
number,
so
I
have
kind
of
maybe
a
vector
of
sensor
readouts
over
time.
Maybe
I
want
to
take
the
sensor
readout
from
sensor
I
and
take
the
mean
over
my
entire
time.
Sequence.
B
You
oftentimes
see
this
done
in
kind
of
older,
older
school
literature,
but
this
has
some
fairly
kind
of
prominent
disadvantages
when
you
think
that
the
ordering
or
the
temporal
component
of
your
problem
actually
has
meaning
for
determining
and
of
the
outcome
of
Y
that
you're
trying
to
predict.
So
you
lose
ordering
that's
a
huge,
huge
issue,
but
in
certain
domains
that
actually
may
be
okay.
B
So
one
of
the
things
that
I'm
not
going
to
get
into
today,
but
is
super
relevant
for
having
these
sorts
of
reductions
or
unordered
sequences,
so
actually
taking
like
some
number
K
of
vectors
associated
with
an
event
with
an
example
and
predicting
something
about
them.
They
don't
have
an
order.
We
don't
even
have
an
extrinsic
ordering
that
we
can
impose
over
it.
B
How
do
we
reason
about
this
big
collection
of
things
there,
at
least
in
the
deep
learning
literature
today,
people
use
still,
because
we
don't
really
have
a
very
good
mental
model
for
how
to
reason
about
unordered
sets
basically
yet,
but
this
is
a
pretty
strong
inductive
bias
as
to
you'll
hear
me
say:
inductive
bias
a
lot
today.
This
is
a
very
strong
inductive
bias
for
what
is
important
to
your
problem.
You're,
basically
saying
I
think
most
of
the
predictive
power
I
don't
need
it
from
the
actual
temporal
component.
B
I
can
just
deal
with
the
features
summarized
in
a
higher
level
form.
So
what
are
the
pros
of
an
approach
like
this?
There
interpretable,
but
sometimes
I-
think
interpretability
is
something
that
it's
hard
to
pin
down.
It's
usually
very
use
case
dependent
what
we
mean
by
interpretable.
People
will
say:
linear
models
are
not
interpret
or
are
interpretable
depending
on
kind
of
their
view
of
the
world,
so
I'm
not
gonna,
get
into
into
arguments
about
interpretability,
but
people
claim
that
reductions
then
using
traditional
machine
learning.
B
Algorithms
that
are
kind
of
working
on
fixed
sized
inputs
and
outputs
are
interpretable,
but
I'll
leave
that
at
that
I
personally,
can't
think
of
many
other
pros
from
a
pure
modeling
standpoint,
sometimes
they're
a
lot
faster
to
train.
That
can
actually
be
a
very
useful
thing
if
you're
dealing
with
absolutely
massive
massive
sequences,
maybe
you'd
rather
run
your
reduction
job
on
Hadoop
spark,
whatever
you're
using
and
then
take
that
and
build
a
traditional
model
on
that
that
actually
might
be
totally
valid.
B
Given
the
scale
of
a
problem
you
work
up,
but
well,
that's
a
pretty
use
case,
specific
advantage
that
might
come
up.
What
are
the
cons
of
using
these
sorts
of
reductions
and
these
cons
will
creep
up
as
we
start
to
build
deep
learning
models
that
are
essentially
building
continuous
approximations
of
a
lot
of
these
sort
of
like
some
max
operations
that
act
as
a
reduction,
so
they
are
brittle
when
you
create
features
as
I'm
sure
everyone
who's
listened
to
a
deep
learning.
B
Talk
has
heard
deep
learning
removes
the
brittleness
of
feature
design
so
in
general
building
models
on
sequences
if
you've
ever
looked
at
a
sequence,
model,
free
kind
of
deep
learning,
basic,
sequential
learning.
You
see
a
lot
of
crazy
features
like
the
value
of
this,
like
variable
dips
and
periods.
Two
and
five
doesn't
dip
in
period
seven
and
then,
like
is
twice
as
much
in
period
eleven.
You
see
a
lot
of
these
sorts
of
pretty
pretty
brutal
features
that
maybe
work
but
boy
do
they
take
a
lot
of
time
to
test
and
validate
I?
B
Think
the
bigger
kind
of
question
here
is:
are
you
inherently
limiting
your
limiting
your
performance
and
the
answer
is
yes,
if
you
have
a
temporal
component
to
your
problem,
you're
you're
expressly
limiting
the
amount
of
information
you
can
get
out
of
your
data
by
doing
these
reductions
very
similar
to
kind
of
the
first
point,
which
is
it's
brittle,
you
actually
have
a
very
hard
time
representing
what
you
want
to
represent
when
you
have
a
sequence,
modeling
problem
I
might
have
an
idea
that
I
think
in
30
time
periods
back.
B
There
might
be
something
that
will
indicate
what
will
happen
30
time
periods
from
now.
It
could
be
right,
it
could
be
wrong,
but
actually
building
the
features
that
get
that
into
a
model
can
be
very
time.
Intensive
and
I
would
argue
not
the
best
use
of
time
and
I
think
the
biggest
problem
like
arises
out
of
thinking
like
this
is
you
actually
have
a
hard
time
predicting
sequences
this
way,
this
works
fine.
When
you
have
a
sequence,
that
is
your
input
and
a
fixed
thing.
B
B
B
We
need
to
be
able
to
predict
sequences
as
an
output,
so
most
kind
of
machine
will
ever
buy
some
machine
learning
when
you
think
about
it.
You're
like
classifying
you're,
doing
some
regression
from
you're
doing
some
sort
of
survival
analysis
time
to
fail
your
prediction:
we
need
to
be
able
to
extend
in
the
case
where
we're
predicting
a
sequence
of
things
into
the
future.
B
B
History,
I
literally
just
take
my
my
sequence
of
vectors,
X,
I
and
I
can
feed
those
into
whatever
decline
a
model
I
want.
So
there's
nothing
really
to
do
there
now
throw
in
a
temporal
component
work
with
time
series.
It
is
definitely
definitely
not
trivial.
So,
as
I
said,
if
we're
dealing
with
kind
of
a
fixed
sampling
rate-
and
we
know
this
because
of
that's-
that's
our
setup-
either
we're
dealing
with
something
like
audio
or
we
actually
do
have
like
a
fixed
metronome,
ik
cadence
with
which
were
sampling
from
our
sensors,
we're
fine.
B
We
can
actually
ignore
time
for
the
most
part,
but
most
time
series
models
and
most
time
series
domains.
Excuse
me:
they
don't
have
this
kind
of
fixed
sampling
nature,
so
we
have
heterogeneous
time
differentials
between
samples
and
that
can
be
a
real
real
problem
for
deep
learning
models.
I'll
caveat
this
right
now
with
I.
Don't
think,
we've
actually
solved
this
as
a
field
properly,
yet
I
think
this
they're
they're
still
kind
of
a
lot
of
research
in
this
direction.
B
If
this
is
something
of
interest
to
you,
I
think
there's
actually
like
a
lot
of
very
high
impact
papers
to
be
written
down
this
domain.
Some
people
have
thought
about
this
by
incorporating
kind
of
a
time
gate
inside
an
RNN
later,
but
this
is
a
really
really
greenfield
problem.
People
have
hacks
I'll,
show
you
three
hacks.
B
Basically,
that
get
these
heterogeneous
time
differential
sequences
to
work
with
people
I
think,
but
at
the
end
of
the
day,
we're
actually
not
there
yet
as
a
field
and
machine
learning
has
had
a
hard
time
dealing
with
this
for
a
long
time,
they're
kind
of
traditional
spline
based
methods
that
can
work
when
you're
dealing
with
very
simple
models,
but
for
very
expressive
models.
We're
just
we're
just
not
quite
there
yet.
So
let's
draw
a
picture:
yes,.
B
Yes,
the
classic
one
is
time
series
from
the
stock
market.
You
maybe
have
very
regular
samples
for
for
the
day
on
the
weekend,
you
don't
have
samples
on
national
holidays.
You
don't
have
samples
so
constructing
some
sort
of
model.
You
need
to
interpolate
between
that.
Another.
Pretty
classic
thing
that
will
happen
is
there's
just
a
delay
in
your
system.
That's
stochastic,
you're
sampling
from
so
you're
reading
out
from
some
sensor,
it
will
come
in
at
different
times.
B
This
machine's
can
all
be
kind
of
out
of
sale,
so
the
the
sampling
rate
that
you're
getting
from
each
of
them,
even
if
they're
constant
you
get
kind
of
this
staggered
weird
pattern,
so
you
need
to
be
able
to
handle
that
kind
of
non-uniform
nature,
they're
kind
of
a
lot
of
other
ones.
That
tend
to
come
up
when
you're
dealing
with.
B
When
you're
dealing
with
sampling
from
actually
this
is
a
really
really
easy
example,
when
you're
dealing
with
audio
you'll
oftentimes,
also
in
a
very
clean
case,
you'll
be
able
to
just
have
this
kind
of
constant
sampling
rate
in
a
lot
of
real-world
audio,
you
get
cutouts
you
get
kind
of
you're
taking.
You
have
to
stitch
together,
a
sequence
from
potentially
different
sampling
rate,
so
the
audio
was
collected.
Here's
one
at
six
Hertz
use
elastic
scheme
that
can
be
really
really
problematic
when
you're
trying
to
actually
learn
about.
B
So,
let's
draw
a
picture
for
this
quickly
here:
I
have
some
sequence:
that's
coming
in
my
X's,
and
these
are
all
kind
of
happening
at
different
times,
staggered
or
not
uniform.
How
can
we
deal
with
that?
So
there
are
basically
three
options.
If
you
want
to
use
deep
learning,
there
are
a
few
others,
but
for
intents
and
purposes
there
are
three.
The
first
one
is
you
it
just
ignore.
The
temporal
component
I
tend
to
do
this
a
lot.
Most
people
tend
to
do
this
a
lot.
B
The
second
case
is,
you
can
resample,
so
you
can
resample
your
time
series
to
make
it
uniform
or
you
can
use
the
time
Delta
as
a
feature.
So
these
are
three
very
simple
ways
that
you
can
handle
this.
So
what
do
these
look
like
pictorially?
So
if
we
ignore
time
it
is
and
exactly
what
I
said
just
strip
out
the
time
set
and
stick
all
the
vectors
together,
assuming
they're
coming
from
a
uniform.
B
So
what
can
go
wrong
if
we
do
this
well,
what
could
go
right
first.
Is
that
you
ignore
a
lot
of
the
complexity
behind
the
temporal
component
and
you
can
feed
these
directly
into
whatever
models
you're
using.
So
that's
great,
but
the
big
downside
is.
You
can
actually
lose
a
pretty
critical
feature
of
your
data,
and
that
is
the
time
difference
between
current
time
step
and
the
previous
time
step.
That
can
actually
be
really
really
indicative
of,
for
example,
a
machine
failure,
for
example,
some
future
time
to
event.
B
In
your
sequence,
the
time
differential
can
be
super.
Super
super
important
I
will
say
that
it's
generally
not
a
bad
idea
to
try
as
a
baseline.
It
is
imposing
the
inductive
bias
that
time
differential
is
not
important
to
your
problem,
but
it's
a
reasonable
baseline
to
try
and
if
you
perform
well,
you've
saved
yourself
a
lot
of
data
processing
work.
So
that's
never
a
bad
thing.
B
So
another
option
is
we
can
resample,
so
I
can
kind
of
say:
hey
I
want
a
sample
here.
I'm
sampling
at
24
second
intervals
and
I
want
to
construct
a
sequence
that
sampled
at
individual
24
second
intervals.
This
will
be
perfect.
I
now
have
kind
of
a
uniform
sampling
rate
sequence
to
be
able
to
feed
into
my
model.
All
is
golden
right,
not
quite
what
do
I
do
with
X
3
and
X
4.
B
If
I'm
I
need
to
have
some
sort
of
reduction
to
bring
these
into
the
time
window,
so
people
type
people
oftentimes
do
some
sort
of
mean
averaging
they'll.
Take
the
latest
one,
so
they'll
just
take
X
4
and
call
that
X
3
people
have
a
lot
of
different
packs
in
order
to
get
the
sampling
rate
to
work
out
correctly.
B
What
happens
in
the
gap
that
is
going
to
lead
to
X,
for
how
do
we
get
a
sample
in
for
X,
for
when
we
have
these
and
uniform
time
windows
that
are
going
into
the
sequence
once
again
many
ways
of
doing
this
oftentimes
when
you
are
dealing
with
continuous
inputs,
when
my
exes
are
continuous,
we'll
do
some
sort
of
interpolation
to
get
kind
of
a
value
to
impute
into
X
4.
So,
let's
sort
of
cubic
interpolation
between
all
the
features
and
they'll
impute
that
into
export.
B
You
will
oftentimes
use
kind
of
a
backwards,
filling
approach
where
you
say:
hey,
I,
don't
have
anything
in
my
sampling
window
that
would
produce
my
export.
Let
me
just
take
the
previous
X
3
and
just
copy
that
over
you
can
argue
whether
or
not
that's
a
good
idea
or
not.
You'll
oftentimes
see
this
particularly
done
for
discrete
inputs.
So
if
I
don't
have
the
ability
to
do
some
sort
of
interpolation,
because
I
have
a
discrete
level
set
or
something
they'll,
just
kind
of
copy,
the
previous
level
set
value
over
to
the
next
time
set.
B
So
what
does
this
give
us?
This
gives
us
the
ability
to
ingest
data
and
the
format
that
models
are
expecting
just
kind
of
boom
boom
boom
boom.
Discrete
uniform
time
sets.
But,
oh
man,
we
are
trading
a
lot
of
that
for
data
pipeline
and
data
pre-processing
complexity.
You
now
also
have
an
additional
set
of
hyper
parameters
or
kind
of
data
processing
parameters.
Let's
say
to
tune.
How
do
I
want
to
aggregate
multiple
samples
within
a
window?
B
Do
I,
take
the
mean,
do
I,
take
do
I,
take
the
mean
and
then
sample
from
some
sort
of
normal
distribution
around
that
to
absence
of
gas.
This
ax
t
do
I,
take
them
max.
How
do
I
do
that?
You
also
then
have
to
choose
the
sampling
frequency
and
my
little
example
here.
I
just
kind
of
drew
some
triangles
that
looked
approximately
equal,
but
for
a
real
application,
you're
gonna,
say:
okay,
I
have
all
these
things
coming
in
at
random
intervals?
B
B
How
do
I
do
that
kind
of
forward
filling
procedure
that
I
was
describing
before
when
I
don't
have
any
samples
in
a
time
window
that
I
need
to
have
an
x
value
for
what
do
I
choose
to
put
in
there
by
filling
from
the
previous
time
step,
you
can
have
really
really
bad
biases.
Even
maybe
I
have
a
really
really
rare
event
occur
in
my
previous
times.
That
I
probably
don't
want
that
rare
event
to
occur
in
my
next
time
step,
even
though
that's
what
will
happen.
B
So
these
are
all
kind
of
things
to
consider,
and
yet
another
dimension
that
you're
gonna
need
to
kind
of
tweak
you're
a
data
pre-processing
pipeline,
which
I'm
sure
none
of
us
want
to
do
more
of
so
something
to
consider.
But
it
can
add
a
significant
amount
of
complexity
to
how
you
handle
your
data
before
passing
a
model.
B
C
B
Time
Delta
for
x1
is
0.
Now
we
look
at
the
time
difference
between
my
current
time
step
and
the
previous
time
step.
What
are
the
timestamp
differentials
and
include
that
as
a
numeric
feature,
you
can
argue
whether
or
not
including
an
America
feature
for
a
time
differential
is
a
good
or
a
bad
thing.
Let's
assume
it's
a
good
thing
for
now,
but
this
actually
gives
us
a
sense
now
of
how
long
it
was
since
our
last
event,
our
last
sample
from
the
sequence
and.
C
B
What
give
us
this
gives
us
once
again
the
ability
to
feed
directly
into
a
people
any
model
that
is
expecting
sample
sample
sample
sample.
The
one
thing
that
it
definitely
does
do
is
it
over
index
is
hard
on
time
differential
as
a
feature.
Maybe
this
is
not
relevant
to
your
problem.
You
just
did
a
bunch
of
extra
work
that
was
kind
of
why
I
was
suggesting
using
a
baseline
with
ignoring
time
to
start
I.
B
Think
the
other
thing
that
can
kind
of
happen
here
is
this
assumes
that
I'm
only
really
affected
by
the
time
differential
for
my
last
step
should
I
care
about
the
time
difference
between
my
current
sample
and
two
time
periods
ago,
three
time
periods
ago,
okay,
time
periods
before
that
might
be
very
important,
I,
don't
know
this
will
be
yet
another
thing
you
will
have
to
tune
your
kind
of
look
back
window
into
figuring
out
the
time
differential.
B
We'll
talk
about
this
a
little
bit
later
with
our
n
ends,
but
there
are
some
really
really
really
pathological
compounding
effects
that
can
happen
when
you
deal
with
sequence
models
so
having
one
number,
that's
totally
outside
of
the
range
can
really
throw
a
wrench
in
the
machine.
Pardon
the
pun,
normalized
features
to
comparable
ranges.
You
don't
need
to
like
normal
normalize.
B
Everything
to
zero
as
your
amine
unit,
variance,
not
necessary,
but
just
kind
of
make
sure
everything
is
comparable
like
if
you
have
a
bunch
of
things
between
like
zero
and
one
good
one,
one
zero
three:
zero:
five,
something
like
that:
fine,
don't
throw
something
that
can
take
a
value
of
like
3000
in
there
too.
That's
really
gonna
mess
up
kind
of
the
steep
tracking
of
most
of
these
sequence
models.
B
A
second
point
which
is
relevant
for
sequence
models,
but
also
just
for,
like
general,
fixed
input,
fixed
output
models,
just
general
supervised
learning
use
embeddings
for
discrete
data
using
embeddings
gives
you
a
lot
of
pretty
key
advantages.
B
Essentially
what
you
using
an
embedding
will
do,
as
you
say,
all
the
levels
of
my
discrete
feature
now,
map
to
a
vector
in
a
lookup
table
and
I'm
saying
for
every
vector
in
this
lookup
table,
train
this
vector
as
I'm
turning
my
model
such
that
this
vector
kind
of
represents
where
in
semantic
space,
this
particular
discrete
level
of
my
feature
is
so
and
then
the
classic
example
of
this
comes
from
NLP.
Naturally,
I'm
processing
with
word
matters
where
you,
by
training
a
model
with
words
indexed
to
individual
vectors.
B
You
get
some
really
really
nice
semantic
properties
in
the
vector
space,
particularly
able
to
kind
of
do
basic
arithmetic
on
word
vectors.
This
is
a
table
that
kind
of
floated
around
on
Twitter
and
medium
and
it's
kind
of
risen
and
fallen
in
popularity,
as
people
have
discovered
and
rediscovered
word
vectors.
But
you
can
do
interesting.
Things
like
take
Paris,
subtract,
France
and
add
Italy
and
you'll
get
Rome,
so
these
are
kind
of
semantic
vector
spaces,
and
the
idea
is,
if
you
take
your
discrete
feature
and
you
embed
it
in
a
vector
space.
B
While
you
train
the
vector
space,
should
emit
kind
of
an
interpretable
kind
of
set
of
arithmetic.
That
will
happen
even
if
it's
not
interpretive.
All
the
amount
of
information
you
can
encode
in
a
vector
of
some
dimension
is
like
definitionally
higher
than
that
of
just
using
a
one,
hot
encoding,
so
just
kind
of
using
a
dummy
variable
on/off,
you
can
get
some
pretty
big
gains
and
models
that
are
using
dummy,
yes/no
flags
for
a
discrete
feature,
but
I
just
using
embeddings.
B
Another
cute
image
of
what
embeddings
can
do
just
to
give
a
quick
motivation
is
that
they
learn
these
kind
of
superlative
relations,
which
is
quite
awesome.
I,
don't
you
can
see
the
text
from
from
the
back,
but
you
get
interesting
kind
of
similar
relationships
and
vector
space
between
slowest,
slower
and
slow
and
shortest
shorter
and
short.
They
kind
of
occupy
a
similar
arc
in
vector
space.
So
it's
a
cute
kind
of
anecdote
for
why
there's
a
lot
of
information
to
be
stored
in
vectors,
and
you
should
kind
of
use
those
when
dealing
with
discrete
features.
B
B
So
how
do
we
organize
the
world
of
sequin?
Starting?
That's
far,
we've
only
talked
about
how
do
we
represent
the
inputs?
What
are
some
unique
characteristics
about
the
inputs
of
sequence
models?
How
it
should
be
reason
about
them?
How
can
we
deal
with
time
all
these
sort
of
kind
of
nitty-gritty,
like
data
level
questions?
B
So
there
are
at
least
in
kind
of
my
super
reductivist
view
of
the
world.
For
the
purposes
of
this
lecture,
there
are
four
archetypes
of
sequence
models
I
stole
and
modified
this
image
from
Andre
Karpov
II,
as
did
not
have
the
one
too
many
I
added
the
one
too
many,
because
I
was
missing
so
for
kind
of
rough
prototypes
for
how
to
build
a
model.
The
first
I
like
to
call
predictive
so
we're
taking
sequence
of
inputs
and
predicting
a
fixed
output.
B
The
second
I
call
abstracted,
because
essentially,
what
we're
doing
is
we're
taking
a
sequence
as
input,
we're
learning
something
about
that
sequence
and
then
we're
generating
another
sequence:
we're
transducing
another
sequence
from
the
original
sequence.
So
this
became
super
popular
and
machine
translation
where
you're
taking
a
sentence
in
a
source,
language,
English
and
outputting.
The
translation
in
French,
so
you're
actually
transducing.
These
sentence
into
a
new
domain,
the
second,
so
that's
kind
of
many
to
many.
The
second
many
to
many
archetype
with
in
sequence,
learning
is
what
I
call
labeling.
B
So
this
is
saying
for
each
time
step
for
each
element.
In
my
sequence,
let
me
predict
something
this
can
either
be
predict
what
the
next
element
in
the
sequence
should
be,
or
this
could
be
predict
the
pressure.
This
could
be
predict
whether
or
not
this
is
a
signal
or
background
sample
from
my
sequence,
but
just
in
general
for
every
element
in
the
sequence,
we're
predicting
a
thing
and
the
final
a
little
bit
more
obscure,
maybe
archetype
of
sequence.
B
Learning
is
for
a
lack
of
my
creativity,
captioning
and
the
reason
it's
casually
is
because
it's
usually
used
for
image.
Captioning
kind
of
the
idea
here
is
I.
Take
a
fixed
dimensional
input
and
I
am
able
to
decode
a
sequence
off
of
that
fixed
dimensional
input
in
images.
This
is
taking
a
images,
input
and
producing
a
description
of
what's
going
on
in
the
image
or
answering
a
question
about.
What's
going
on
in
the
image,
I
haven't
seen
examples
of
this
in
scientific
domains.
B
Yet
I
think
it
would
be
really
cool
if
one
of
you
could
come
up
with
I
tried
to
Rack
my
brain
for
the
past
few
days
to
think
of
an
interesting
one,
but
I
pretty
narrow,
I've
only
kind
of
worked
in
very
specific
area
at
that.
So
I
think
it
would
be
really
cool
if
we
came
up
with
something
that
was
fixed
input
variable
length
decoding
today,
most
of
kind
of
the
insights
I
think
were
gonna.
Get
are
gonna
come
from
the
first
week.
B
B
So
if
just
to
repeat
the
question
for
those
who
didn't
hear
when
dealing
with
gaps
in
sequence
data,
should
we
have
some
sort
of
model
that
can
imbue
the
ability
for
a
machine
learning
model
to
fill
in
the
gaps
as
we're
training
I
think
there
are
kind
of
two
schools
of
thought
here.
B
B
If
you
do
have
some
model
that
can
fill
in
the
gaps,
fill
them
and
then
kind
of
validate
yes
or
no,
whether
or
not
it
actually
affects
your
downstream
past
performance
without
knowing
more
details,
it's
probably
the
most
useful
thing,
I
could
say,
but
I
think
it's
one
of
those
things
to
just
evaluate
from
your
downstream
metric
of
kind
of
how
you
how
you
quantify
the
system
performance.
B
Actually,
don't
see
why
that
wouldn't
work
I
haven't
seen
any
applications
of
that.
Since
most
of
these
things
are
developed
in
NLP,
you
usually
don't
see
some
sequences,
but
that
would
be
a
really
cool
architecture
to
try
I,
don't
think,
wouldn't
work
yeah,
so
I
would
say,
promote
if
you're
not
dealing
with
language.
B
You
would
learn
the
embeddings
as
part
of
your
training
procedure
I'm.
The
idea
is
you're,
embedding
tool
aligned
such
that
they're
maximizing
the
dimensions
of
information
that
are
most
important
to
your
problem.
If
you're
dealing
with
text,
not
sure
if
you
will
be,
you
can
use
pre-training
betting's,
but
kind
of
the
idea
of
embeddings
extends
well
beyond
kind
of
dealing
with
pre-trained.
Where
to
the
fact
that
comes
from
google,
you
can
use
embeddings
and
they
will
align
with
in
training
to
be
maximally.
B
B
So
with
many-to-one
I
would
say
this
is
the
archetype
that
you
will
see
the
most.
There
are
tons
of
applications
of
this
and
kept
be
like
taking
a
sequence
in
and
predicting
something
out
about.
The
sequence
is
pretty
usually
pretty
low
hanging
fruit,
taking
a
sequence
saying
signal
background,
taking
an
input.
Video
for
example,
saying
stating
whether
or
not
the
video
is
I,
don't
know
a
or
B,
but
the
really
really
key
had
a
thing
to
get
out
of
the
many-to-one
archetype.
There's
a
lot
of
the
things
you
learn
about
fixed
input,
fixed
output
models.
B
You
can
use
to
reason
about
these
as
well.
You're,
probably
going
to
be
using
the
same
loss
functions.
I
doubt
you'll
be
changing
things
there
and
when
you
start
dealing
with
sequential
outputs
your
losses
starting
getting
a
little
bit
different
little
bit
wonky,
but
for
variable
in
fixed
out
a
lot
of
the
mental
models
that
you
develop
for
kind
of
standard,
fixed
and
fixed
out.
Learning
tend
to
work
pretty
well,
so
you
can
kind
of
swap
on
a
classification
layer,
regression
layer.
B
You
can
do
some
sort
of
survival
analysis
thing
tack
it
on
top
and
those
should
all
kind
of
work
out
of
the
box
for
many-to-one.
One
key
thing
about
all
these
archetypes,
and
this
one
in
particular,
is
the
flexibility,
so
I
kind
of
hinted
at
it
just
kind
of
rambling
there
a
second
ago,
but
the
inputs,
the
kind
of
the
red
boxes
that
go
into
this
archetype.
They
can
be
really
really
complex
data.
Each
one
of
these
can
be
an
image.
There's
nothing!
B
That's
stating
that
this
has
to
be
kind
of
like
a
boring
vector
of
inputs.
If
you
let's
say
you
have
like
some
readout
from
a
telescope,
that's
some
big,
like
20,
2014,
5
2014
in
it,
and
that's
evolving
over
time
that
can
be
used
here.
You
can
kind
of
have
the
every
input
to
the
sequence,
be
an
image.
So
there's
nothing
about
this.
That
kind
of
restricts
the
modality
of
the
red
boxes,
which
makes
all
of
these
frameworks
quite
flexible.
B
So,
let's
talk
about
the
second
archetype
quickly
here,
which
is
many
to
many
you'll.
Often
times
see
this
preferred
to
and
kind
of
deep
learning
modernity.
I'm
gonna
put
that
big
scare
quotes,
as
sequence
the
sequence
learning,
even
when
sometimes
it's
actually
not
sequence
in
sequence,
I
mean
the
general
idea
here,
as
was
alluded
to
before,
is
that
you're
taking
some
length,
K
sequences
input
and
producing
a
length,
l
sequence
as
output
or
K
and
L
can
vary,
and
they
can
be
completely
independent.
B
That
can
be
quite
a
hard
problem.
So
in
the
deep
learning
archetype
for
this
you
end
up
breaking
down
your
world
into
an
encoder
and
a
decoder,
the
job
of
the
encoder
and
a
many-to-many
model
that
follows
the
sequence,
a
sequence
paradigm,
but
the
job
of
the
encoder
is
to
take
the
input
sequence
and
summarize
it
I
said:
summarization
was
bad
at
the
beginning.
B
Whether
this
might
be
new
weather
samples
for
the
next
seven
time
steps.
This
could
be
taking
a
source
sentence
and
returning
a
summary
of
the
sentence.
But
I
need
to
have
this
kind
of
summarize
that
this,
like
this
summary
in
the
middle,
this
vector
this
fixed
length,
vector
B,
maximally
informative
for
my
decoder
to
work
well.
B
So
a
key
key
question
here
is:
if
you
put
on
your
hat
thinking
about
kind
of
traditional
ml,
we're
normally
just
predicting
out
a
single
fixed
thing.
How
do
I
predict
a
thing
that
is
a
variable
length
that
can
be
quite
tricky.
How
do
I
know
like
I
should
stop
predicting
at
time
step.
7.
Is
that
right?
Is
that
wrong?
Who
knows?
How
do
we
reason
about
that?
B
So
this
gives
us
kind
of
a
differentiable
approximation
to
this
stop,
and
this
is
commonly
referred
to
as
a
stop
token
in
NLP,
but
I
think
the
the
application
is
much
more
broad
than
just
NLP
being
able
to
predict
when
to
stop
decoding
a
sequence
is
super
critical.
If
we
have
kind
of
this
different
length
between
your
input
and
your
output,
most
of
the
problems
that
this
has
worked
on
haven't
been
that
long
kind
of
as
you
would
imagine,
but
I
don't
actually
imagine
this
being
any.
B
If
the
signal
is
there
for
the
stop
to
be
predicted,
it
should
be
able
to
learn
that
this.
There
are
other
kind
of
engineering
concerns
that
come
up
with
very
very
long
sequences
they're
just
really
hard
to
fit
on
a
GPU.
But
if
the
signal
is
there
for
you
to
stop,
if
there
is
like
signaling
the
noise
to
learn
when
to
stop
I,
don't
see
why
I
wouldn't
kind
of
be
able
to
perform
that
once
again,
I
would
test
it,
but
there's
nothing
kind
of
theoretically
or
from
a
prior.
B
That
would
tell
me
that
that
wouldn't
work
yeah
so
in
the
in
the
I'll
talk
about
LLP
and
then
non
NLP
and
non
in
NLP.
You
just
kind
of
have
an
additional
word
in
your
vocabulary,
which
is
stop
in
non
NLP.
You
end
up
having
two
losses:
you'll
have
your
loss,
that's
for
your
problem
that
you
actually
care
about,
which
is
predicting
the
next
element
of
the
sequence.
So
this
could
be
predicting
out
the
sensor
read
us:
did
we
predicting
temperature
pressure
or
whatever,
so
that
could
be
at
some
regression
loss?
B
Then
you
have
an
additional
loss,
which
is
like
a
classification
loss
which
will
predict
for
the
next
took
and
whether
or
not
X
or
not,
for
the
next
element
of
the
sequence,
whether
or
not
exile.
So
you
end
up
kind
of
having
this
two
losses.
Of
course,
then,
there's
a
whole
other
set
of
problems
which
arise
and
how
to
balance
these
two
losses,
but
in
principle
there's
kind
of
nothing
that
stops
you
from
combining
a
classification
loss
that
tells
you
when
to
stop
from
a
domain
loss.
That
tells
you
what
your
next
sequence
should
be.
B
So
the
second
archetype
within
many
to
many
is
what
I
like
to
call
sequence,
labeling,
so
sequence,
labeling
is
super
super
flexible.
Actually,
there
are
kind
of
two
things
that
people
generally
do
with
sequence:
labeling
the
either
are
predicting
the
next
time
step.
What
is
my
value
of
my
sequence
going
to
be
and
time
T,
plus
one
or
they're,
predicting
some
observables
that
they
care
about
from
the
system
at
time
T?
B
So
these
are
kind
of
the
two
approaches
that
people
tend
to
take
you'll
oftentimes
see
for
problems
where
we
care
about
forecasting
just
for
the
next
time
step.
People
will
set
up
their
problem
like
this.
So
if
I'm
doing
a
stock
prediction
problem,
let's
say
I,
you
know,
stock
prediction
is
pretty
trite,
but
it's
kind
of
a
classic.
You
would
feed
in
the
stock
prices
at
every
time
step
and
just
be
predicting
out
the
stock
price
at
the
next
time
step.
B
This
will
oftentimes
come
up
when
you
are
dealing
with
and
a
sequential
forecasting
where
you
just
kind
of
care
about
the
next,
the
case
where
you're
predicting
something
about
the
current
time
step
T.
That
also
tends
to
arise
pretty
frequently
in
NLP.
This
is
an
a
predicting
part
of
speech
or
word,
but
more
broadly,
this
can
kind
of
be
whether
or
not
an
element
of
my
sequence
of
machine
readouts
is
a
failure.
Mode
or
not.
B
I
can
be
predicting
whether
or
not
the
machine
is
in
an
arid
state,
at
kind
of
any
point
along
my
time
series
which
is
reading
out
kind
of
something
from
a
sensor
greater.
So
this
is
a
pretty
flexible
framework
for
being
able
to
reason
about
every
state,
so.
B
The
one-to-many
archetype-
this
is
the
one
that
I
said:
I,
don't
have
good
examples
from
kind
of
something:
that's
not
NLP,
so
this
in
in
the
world
of
NLP.
This
is
the
case
where
we
will
take
either
an
image
or
a
fixed
piece
of
data
and
decode
a
description
of
the
data
in
kind
of
natural
language.
So
the
idea
here
is
you,
take
fix
input
and
at
every
time
stuff,
as
you're
decoding
you're,
looking
back
at
the
image
or
you're.
B
Looking
back
at
the
fixed
amount
of
data
and
you're
saying
what
should
I
predict
next
until
I
then
predict
you
to
stop
decoding
my
sequence.
I'm
example
that
maybe
doesn't
fall
into
NLP
is
given
a
given.
The
frame.
1
can
I
predict
the
next
4
24
frames
of
video.
You
could
do
something
like
this
to
kind
of
generate
little
videos
from
a
static
image.
I
could
imagine
that
potentially
being
useful,
but,
like
I
said,
the
canonical
problem
really
is
image
captioning.
B
B
There
are
some
individual
ristic
of
sequence,
learning
models
that
we
need
to
have
kind
of
reflected
and
our
deep
learning
building
blocks
temporal
invariance.
We
need
to
be
able
to
control
kind
of
the
explosion
and
reduction
of
gradients.
There
are
kind
of
a
lot
of
edge
cases
that
come
up
in
the
sequence
learning,
so,
let's
kind
of
take
a
look
through
some
of
the
basic
building
blocks
here.
B
So
this
is
a
very,
very
active
area
of
research,
I'm
kind
of
almost
every
day,
some
new
big
company
or
big
research
group
has
a
new
variant
on
an
existing
kind
of
approached
for
modeling
sequences.
We're
gonna
cover
the
main
three,
the
first
two,
a
little
bit
more
in
depth
in
the
third
we're
going
to
cover
rnns
recurrent
neural
networks,
convolutional
neural
nets
and
transformers.
There
are
a
ton
of
other
variants
which
you
will
see
come
up.
There
are
quasi
RI
towns.
B
There
are
like
a
attentive
convolutions
they're,
like
gated
convolutions,
there's
a
whole
kind
of
zoo
of
modifications
of
these
three
building
blocks
that
have
come
up,
but
my
hope
is
that
by
just
kind
of
reasoning
through
the
these
three
basic
ones,
you'll
be
able
to
get
a
sense
of
what
the
field
looks
like
from
high
level
and,
like
I
said
at
the
very
beginning,
be
able
to
Google
and
reason
about
things
that
you
read
in
Ivor.
So,
let's
start
with
Ardennes,
so
the
core
of
an
RNN
is.
B
We
want
to
have
a
neural
network
unit
that
can
learn
kind
of
a
sequence
based
dependency
over
time.
And
what
does
this
look
like?
So,
if
I
have
some
input,
that's
coming
in
X
kind
of
over
time,
sequentially
I
want
to
be
able
to
kind
of
loop
back
through
modify
a
state
and
produce
an
output.
So
this
is
a
hard
diagram
to
reason
about.
Why?
B
Don't
we
unfold
this,
where
it's
a
little
bit
easier
to
reason
about
every
time
step,
I
have
a
vector
X
coming
in
I'm,
basically
taking
a
look
and
saying:
do
I
want
to
modify
my
state
and
the
idea
is
this
internal
state,
as
we
kind
of
chunk
through
the
sequence,
we'll
learn
useful
things
about
my
task
at
hand
predicting
why,
for
example,
so
I
have
this
internal
state?
I
was
like
chunk
through
my
sequence:
I
am
learning
what
to
store.
B
What
not
to
start
a
key
element
here
is
at
the
transformation
that
is
kind
of
going
and
taking
X
and
producing
this.
The
hidden
state
is
invariant
per
time
step.
So
I
basically
have
one
neural
network
unit
that
is
just
getting
stamped
out
four
times
up,
so
that's
pretty
powerful.
That
gives
us
kind
of.
We
don't
have
to
engineer
features
that
kind
take
into
account
the
temporal
component
anymore.
We
have
something
that
just
gets
applied
time
step
by
taxa.
B
So
what
are
some
issues
that
can
arise
with
a
model
such
as
this,
so
we're
gonna
take
a
quick
detour
into
one
of
the
internals
of
how
a
vanilla
RNN
works
to
kind
of
try
to
shed
some
light
on
this.
So
this
is
the
kind
of
most
basic
RNN
you
can
draw.
Let's
walk
through
what
the
math
is
telling
us
here
quickly
from
an
intuitive
level.
So
we're
saying
this
is
the
state
that
I'm
going
to
next,
given
my
input,
so
how
do
I
get
to
that
state?
B
B
B
So
what
happens?
We
have
a
try
to
take
ingredient,
so
you
don't
need
to
memorize
the
proof.
This
is
just
to
kind
of
get
intuition
over
on
why
training,
our
entenza
or
heart,
so
we're
gonna.
Take
the
derivative
of
some
loss
function
with
respect
to
the
parameters
W
rec.
This
is
the
thing
that
is
taking
the
previous
state,
I'm
moving
it
to
the
next
state.
So
there's.
B
Element
in
here,
which
is
when
we're
taking
the
derivative
of
current
state
with
respect
to
every
previous
state
that
has
happened,
kind
of
in
my
sequence,
but
every
previously
thought
I
predicted
in
my
sequence,
so
is
actually
a
product
of
jacobians.
So
we
don't
need
to
know
how
we
get
to
this
proof,
but
let's
kind
of
step
through
kind
of
what
the
math
will
tell
us.
B
So
we
have
this
and
an
infinite
product,
this
very
large
product
and
I'm
over
my
entire
length
of
my
sequence
of
multiplying
all
of
these
psi
psi
minus
1
0
o'clock.
So
we
can
decompose
it
and
then
we
can
use
some
tricks
to
try
to
bound
the
norm
of
every
single
element
in
this
very,
very
long
product
about
and
because
we
know
how
to
decompose
this
as
a
product
which
ability
is.
B
We
can
then
split
apart
w
rec
itself
a
diagonalize
component,
and
we
can
bound
this
above
by
the
two
eigenvalues
one
of
w
one
of
diagonal
component.
Here.
We
don't
need
to
worry
how
we
get
there.
But
the
interesting
thing
here
is
since
every
single
element-
WSI
sorry
D-
si
over
D
si
minus
one
is
kind
of
a
product
of
two
numbers
that
are
eigenvalues
when
we
take
this
really
really
long
product
you're,
taking
a
lot
of
things
and
multiplying
them
together.
B
When
you
take
a
lot
of
things
and
multiplying
them
together,
you
kind
of
have
a
point
of
criticality
around
one.
So
if
you
have
small
eigen
values,
you
can
rapidly
go
to
zero
and
the
gradient
can
rapidly
go
to
zero
or
if
you
have
very
big
eigenvalues,
the
gradient
can
explode.
So
you
have
something
that's
getting
multiplied
over
and
over
and
over
again,
and
your
point
of
criticality
ends
up
around
one.
B
You
can
explode
or
you
can
kind
of
shrink
if
it
doesn't
ways
fall
in
your
gradient
calculation,
so
that
can
be
super
super
problematic
when
you're
dealing
with
very
long
sequences.
So
this
is
kind
of
an
intuitive
argument
for
if
we
take
the
norm
of
kind
of
an
element
of
my
of
the
gradient
I
can
reason
about
what
will
happen
when
I
have
very,
very,
very
long
Z.
Yes,.
B
B
B
So
the
question
arises:
how
do
we
fix
us,
so
this
is
referred
to
commonly
as
the
vanishing
or
exploding
grading
problem
intuitively.
The
solution
is
to
decide
what
to
forget
what
to
remember
so,
essentially
what
ends
up
happening
kind
of
if
you
want
to
personify
an
RNN
is
they
have
really
really
vivid
memories
of
things
that
are
totally
irrelevant,
which
is
kind
of
a
funny
mental
model
to
have
about
an
hour
now,
so
we
need
to
control.
B
How
updates
affect
that
little
very,
very
large
product
or
very,
very
small
product
that
happens
in
the
middle
of
the
gradients
of
so
the
solution
is
list
yet
from
hope,
writer
and
Shui
Huber,
so
I'm
just
going
to
talk
about
Alice
Ian's
in
particular,
because
they
have
they
were
kind
of
the
first
to
control
this
mechanism,
but
there
have
been
a
lot
of
other
developments
around
here,
use
other
variants
of
the
state
management
of
our
intent.
That
absolve
this
problem
intuitively.
B
What
an
LS
TM
does
is
it's
allowing
for
a
trainable
decision
function
to
be
embedded
inside
the
Dora
and
what
this
will
do.
Is
you
basically
now
have
a
trainable
differential
ability
to
forget
and
remember
state,
so
at
every
time
step
your
model
is
saying:
do
I
want
to
remember
what
happening
now?
Do
I
want
to
forget
it?
B
We
want
to
write
it
to
my
memory
or
do
I
read
from
memory,
so
you
now
have
kind
of
four
operations
that
can
help
you
and
have
systematically
process
your
sequence
in
a
way
that
won't
let
you
kind
of
dramatically
forget
or
over
remember
and
irrelevant
time
steps
in
your
sequence.
Pictorially
I
stole
this
image
from
Michaela.
B
Specifically,
they
have
kind
of
three
key
mechanisms
that
are
added,
no
need
to
know
the
math
that
goes
into
this
directly,
just
kind
of
useful
to
take
a
look
at
the
picture
and
understand
the
individual
components
that
are
affecting
this
behavior.
So
there's
a
forget,
eight,
the
forget
gate
is
basically
saying
how
much
of
the
previous
state
should
be
retained
in
the
new
state.
How
much
should
I
forget
the
input
gate
is
deciding
how
much
of
the
current
times
that
actually
matters
for
the
problem.
B
I
might
process
a
time
step
and
look
at
it
and
say
hey.
This
is
totally
irrelevant
for
my
task.
Just
forget
it
and
the
output
gate
is
kind
of
controlling
the
the
mixture
of
all
of
these
gates
in
composing.
The
final
state,
so
you've
got
three
trainable
components.
Basically
that
are
deciding
what
to
keep
and
that's
a
really
really
powerful
mechanism
that
helps
us
control
the
gradient,
but
also
gives
us
a
pretty
interpretable
way
of
reasoning
about
the
like
differentiable
state
of
Merida.
B
Can
we
have
a
state
to
process
a
sequence
in
Reverse
and
this
ends
up
being
actually
quite
important
in
a
lot
of
NLP
applications
and
actually,
even
in
high
energy
physics,
it
was
figured
out
that
the
bi-directional
are
n
ends
which
we'll
talk
about
in
a
second
here
are
very,
very
useful
for
performing
law.
Yes,
not
necessarily,
there
is
kind
of
this
notion
of
memory,
so
you
can
like
take
a
state
and
write
it
into
like
you
have
kind
of
a
matrix
that
gets
passed
in
through
an
LSD
M.
B
That
is
the
memory,
so
you
can
take
a
state
and
like
shove,
it
into
memory
and
then
recall
it
again
in
a
future
subset,
so
you
can,
for
that
would
be
the
H
cool.
So
if
we
kind
of
run,
we
can
run
an
RN
n
forwards
and
we
gonna
run
an
RN
and
backwards,
and
this
gives
us
an
interesting
ability
to
reason
about
what's
happening
in
the
sequence
and
a
forward
or
backward
direction,
which
can
actually
be
super.
B
Super
powerful
you'll
end
up
finding
a
lot
of
different
information
is
stored
in
a
sequence
when
processing
it
in
Reverse
counterintuitive,
but
it
ends
up
happening
a
lot.
There
was
a
really
interesting
paper
where
they
were
actually
able
to
get
very.
This
is
a
while
back.
They
were
able
to
get
reasonable
performance
in
machine
translation,
processing
sequences
entirely
in
Reverse
and
the
places
where
that
model
produced
errors
were
actually
very
different
than
the
model
that
was
run
forwards
so
just
kind
of
an
interesting
exercise
in
the
orthogonality
of
information.
B
That's
learned,
depending
on
how
your
sequence
is
presented
to
the
model.
So
how
does
an
RNN
like
this
fit
with
our
archetypes?
So
in
the
many-to-one
case,
we
can
kind
of
take
this
R
and
n
this
process
and
things
boom-boom-boom,
and
we
can
take
the
last
hidden
state
and
that
last
hidden
state.
Ideally,
if
our
model
is
well
trained,
will
kind
of
be
a
trained
summary
of
everything.
That's
happened
in
the
model
as
far
so.
The
idea
here
is
we
take
this
last
hidden
state.
B
This
is
a
fixed
dimension
and
this
can
then
be
fed
into
whatever
loss
I'm.
Using
for
my
problem,
I
can
then
kind
of
have
whatever
outputs
I
need
to
reach
the
dimensionality
for
my
target,
but
I
can
kind
of
take
the
last
state
of
an
RNN
and
use
that
directly
and
how
many
to
one
problem
oops
in
the
mini
from
any
case
I,
can
kind
of
take
the
last
hidden
state
of
an
RNN.
B
The
very
very
last
time
step
and
I
can
use
that
as
the
initial
state
for
my
decoder,
so
I
can
take
okay
I
process,
my
input
sequence.
What
do
I
think
about
the
full
sequence
now
feed
that
as
the
initial
state
to
my
decoder.
That's
then
making
reason
reasoning
about
how
to
produce
my
output
C's
in
the.
B
Kind
of
already
have
our
work
done
for
us,
we're
predicting
a
hidden
state
at
every
single
time
step
as
we
proceed,
so
why
don't
we
just
tack
a
layer
on
top
of
the
hidden
state
and
be
able
to
predict
something
about
the
current
time,
step
T.
So
every
single
time
step
we
in
our
state
predict
something
about
it.
Get
our
state
predict
something
about,
so
this
kind
of
naturally
falls
out
of
using
our
Don's.
B
The
interesting
thing
about
the
one
to
many
cases,
we're
essentially
taking
let's
say
we're
dealing
with
image
captioning
you
can
kind
of
take
the
full
image,
have
some
feature
Iser
for
the
image.
Take
a
pre
train
ComNet,
for
example,
and
you
can
stick
that
as
the
initial
hidden
state
in
the
decoder.
So
then
you
can
generate
the
sequence
from
kind
of
an
initial
state
which
is
governed
and
conditioned
by
the
fictive,
so
our
n
ends
for
the
summary
level.
B
We
are
processing
time
steps
one
at
a
time
and
we're
modifying
the
internal
state
to
keep
track
of.
What's
useful,
for
the
task
at
hand
are
n
ends
are
governed
by
this
time,
invariant
transformation
that
is
applied
to
every
element
in
the
sequence
kind
of
regardless
of
if
it
comes
first
or
last,
and
you
can
also
traverse
sequences
backwards,
which
can
be
very
useful.
Oftentimes
you'll
learn
relatively
orthogonal
pieces
of
information
to
what
you
would
learn:
processing
the
sequence
forward.
B
So
talk
about
the
second
major
building
block
of
deep
learning,
career
sequence,
earning
we'll
talk
about
convolutional
neural
nets,
so
usually
calm,
Nets,
I
thought
about
for
image
problems
where
we're
dealing
with
kind
of
2d
patches
of
an
input,
input,
image
that
are
having
filters
applied
to
them.
This
can
be
extended,
pretty
trivially
to
deal
with
sequences.
So
this
is
a
standard
convolution
operator.
B
Essentially
what
we
can
do
is
we
can
form
image
if
you
will,
where
we
take
out
every
vector
and
we
kind
of
line
them
all
up
in
a
row,
that's
kind
of
a
vector
image,
and
then
we
can
take
a
convolution
in
one
dimension
across
that
sequence.
So
pictorially
and
of
what
happens
as
we
take
on
the
diagram
on
the
left.
B
Once
again,
that
word
summarizes
summarizes
the
features
that
are,
in
my
sequence,
so
just
to
kind
of
illustrate
how
we
parameterize
continents,
they're,
usually
parametrized
by
the
stride
and
the
filter
size.
The
start
in
the
filter
size
are
controllable
and
end
up
kind
of
imbuing,
a
lot
of
structural
priors
that
you
might
have
about
your
data
on
the
Left
I've
shown
tried
one
filter
size
three,
but
as
I
increased
my
stride,
every
single
feat,
every
single
kind
of
element
that
comes
out
of
my
convolution
operator
is
has
a
wider
like
kind
of
look
around
window.
B
If
you
will
every
element
that
comes
up
knows
more
about
its
context
than
if
I
have
a
smaller
positive
and
Genesis
stride
parameter
tells
us
how
much
we
want
these
filter
windows
to
overlap
so
I
can
control.
Both
of
these
will
affect
my
dimensionality
and
it'll,
also
kind
of
affect
the
degree
to
which
context
is
carried
and
to
my
window.
So
this
is
kind
of
a
pretty
powerful
thing
to
be
able
to
control
both
of
these,
which
can
affect
kind
of
the
learnability
of
the
problem.
B
So
the
simplified
math
behind
and
of
what
will
happen
in
CNN
filter
is
each
of
those
arrows
that
I
had
in
the
previous
slide
or
a
vector,
and
essentially
my
output
is
just
each
of
these
vectors
dotted
with
the
input
vector,
X,
I,
so
fairly
simple,
just
at
the
individual
and
a
time
step
level.
But
once
you
get
bigger,
you
want
to
use
kind
of
official
convolution
operators
that
will
do
this
and
they'll
scale
much
better.
B
So
interesting
trends
in
cnn's
for
sequence
learning
go
back
about
four
years.
Everyone
was
was
doing
only
RN
ends.
Then
there's
kind
of
the
like
CN
n
trend
was
mostly
kind
of
started
by
there
and
a
few
others
that
kind
of
what
you
used
to
think
you
had
to
do
with
Arnaz.
You
can
actually
do
with
pets
which
is
kind
of
an
interesting
trend
away
from
the
complexity.
I
could
controlling
state
and
all
of
these
things
that
are
inherent
to
Arnett,
and
this
kind
of
raises
a
pretty
big
question
about
most
sequence.
B
Modeling
problems,
which
is:
do
we
actually
need
to
retain
state?
So
we
have
all
these
mechanisms
that
are
reading
and
writing
to
state
reasoning
about
state.
Are
they
actually
necessary
because
commnets
don't
know
about
state,
there's
no
kind
of
notion
of
reading
writing
forgetting
as
we
process
a
sequence,
there's
just
kind
of
a
standard
convolution
operator,
yet
they
perform
well
so
I
think
it
kind
of
raises
an
interesting
question
of
if
RN
ends
our
reasoning
about
state.
Is
that
state
really
necessary?
B
So
an
interesting
thing
about
convolutions
is
that
they
don't
know
about
order
in
particular,
they
don't
know
about.
They
don't
know
about
position,
so
they
are
translation
invariant,
but
that
also
means
that
they
don't
know
kind
of
what
came
before
if
I
apply
a
convolution
operator
to
time,
step
17,
it
doesn't
know
about
time,
step
16,
so
kind
of
a
way
that
people
have
gotten
around
this
and
the
deep
learning
world
is
by
using
what's
called
a
position
encoding
and
a
position
encoding.
B
There's
a
very
simple
idea:
I,
basically
train
and
embedding
as
I'm,
discussing
earlier
kind
of
an
additional
feature.
That
is
an
integer
kind
of
1
through
n.
Where
n
is
my
maximum
sequence,
length
and
I
learn
a
vector
for
every
position.
So,
instead
of
just
having
my
features,
X
I
have
my
features:
X,
plus
a
position
vector
P
and
that
P
vector
is
either
trained
or
fixed,
but
that
that
position,
vector
kind
of
tells
me
where
I
am
in
the
sequence.
I
can
form
a
convolution
operator
where
it
is
of
being
applied
in
this
place.
B
So
the
you're
embedding
that
right.
So
every
single
position
has
an
embedding.
So
there's
no
dynamic
range
I'll
come
up
every
single
vector
every
single
position
has
a
vector
and
that
vector
kind
of
gets
added
to
my
type
set.
So
P
isn't
necessarily
the
number
1
through
50,000
P
is
the
look
the
retrieved
embedding
associated
with
that
position.
B
So
I
have
a
trainable
embedding
per
position
and
the
embeddings
you
can
constrain
to
be
between
negative
1
1,
whatever
you
want,
but
the
the
raw
like
position,
ID
isn't
going
in
yeah,
so
padding
padding
is
a
whole
other
topic
which
we
actually
won't
get
into
today.
But
I
will
just
kind
of
say:
padding
is
not
trivial
for
comments.
How
to
do
it
efficiently
is
also
non-trivial.
B
Oh
yeah,
that's
kind
of
a
there's,
a
highly
architecture
depended
question.
If
I
have
a
computer
architecture
that
confused
the
operators
that
are
required
for
a
GRU
more
efficiently
than
an
Alice
TM
GRU
is
gonna
be
mark,
but
that
depends
on
the
target
architecture
deploying
on
to
full
suite
of
other
things.
First
app
would
be
GRU,
I,
don't
actually
know
so
very
quickly.
How
does
this
fit
with
our
archetypes?
B
So
if
I
need
to
get
a
fixed
length
vector
to
be
able
to
use
this
many-to-one
archetype,
how
do
I
get
a
fixed
length
vector
out
of
my
convolution
outputs?
Unfortunately,
I
need
to
do
a
reduction.
The
idea
is,
since
my
kind
of
elements
that
are
going
into
the
reduction
are
trainable,
the
reduction
should
be
more
informative.
This
is
kind
of
a
hot
topic
in
NLP,
but
a
lot
of
people
who
come
from
a
linguistics
background
hate.
B
C
B
The
many-to-many
case
you
end
up
using
what
are
called
dilated
convolutions
to
be
able
to
obtain
kind
of
the
sequential
dependence
on
the
fact
that
time,
step
T
can't
know
about
time,
step
T
plus
one.
So
this
is
a
great
gift
from
B
mind
if
it'll
play
yeah
of
kind
of
what
a
dilated
convolution
looks
like
where
you're
only
looking
at
kind
of
the
previous
time,
steps
when
constructing
your
convolution
for
the
next
times
have
to
get
decoded.
B
So
this
is
kind
of
what's
used
in
many
to
many,
with
kind
of
a
sequence:
the
sequence
transduction
style
approach
in
the
many
to
many
case,
where
we're
doing
sequence,
labeling
CNN's,
are
a
very,
very
natural,
fit
you're
outputting
a
number
per
time
step.
You
might
need
to
do
some
padding,
but
you're
outputting
kind
of
a
thing
per
step
along
kind
of
my
input
sequence.
B
So
a
summary
of
CNN's
we're
essentially
using
kind
of
filters
without
a
notion
of
state.
We
can
use
something
as
positional
encoding
and
we
have
this
time,
invariant
transformation
that
gets
applied
for
time,
step
of
this
kind
of
convolution
operator,
taking
my
filter,
applying
it
to
my
backers
for
the
many
to
one
use
case,
which
is
unfortunate.
We
do
need
this
ability
to
reduce
the
the
sequence
into
kind
of
a
fixed
length,
vector
so
we'll
be
mean
min
max
summary.
B
You
need
to
have
a
summary
statistic.
That's
kind
of
getting
you
your
fixed
length,
Becker
at
the
end,
so
our
Nan's
cnns.
These
are
kind
of
two
of
the
central
building
blocks
of
kind
of
how
people
build
the
individual
components
inside
these
archetypes
we've
talked
about
the
last
thing.
I
want
to
talk
about.
Pretty
briefly,
is
paying
attention,
so
you've
all
probably
heard
if
you
followed
deep
learning
at
all
about
attention.
B
They
originally
gained
popular
popularity
and
machine
translation
where
you're
kind
of
doing
these
soft
alignments,
where
you're
looking
back
through
and
saying
in
my
source
sentence,
I'm
saying
the
word
dog
in
French
I
need
to
say
she
Anna
do
I
need
to
focus
there.
So
you
have
these
kind
of
very
explicit
kind
of
attention,
components
that
are
looking
at
the
sequence.
B
Self
attention,
which
is
kind
of
a
very
google
thing.
They've
done
a
lot
of
interesting
work
around
transformers
which
we'll
talk
about
next
is
basically
the
idea
that
a
sequence
can
attend
to
itself.
So
as
time
step,
T
I
could
look
at
time.
Steps
1
through
t,
minus
1
and
t
plus
1
through
K
or
an
or
whatever
and
I
can
use
that
to
form
context
about
what
I
should
be
thinking
at
times
at
key
the
idea
behind
self
attention
is
you
don't
need
any
recurrence?
You
don't
need
any
convolutions.
B
You
can
just
kind
of
look
elsewhere
in
the
sequence
and
make
it
like
a
a
judgement
about
what
your
time
step
I
would
ship
up,
which
should
be
if
you're
interested
in
kind
of
this
approach.
This
is
the
classic
paper
has
like
it's
only
two
years
old
now
and
has
like
three
thousand
citations
so
do
take.
A
look
at
attention
is
all
you
need,
because
that
was
kind
of
the
first
kind
of
real
large-scale
application
of
self
attention
so
quickly.
B
B
Steps
and
making
a
judgment
about
itself
and
the
key
idea
is
you
kind
of
let
sequences
do
key
value
lookups
with
themselves
so
as
the
vector
of
time
step,
T
I
do
a
key
value
lookup
across
all
the
other
time,
steps
in
the
sequence,
maybe
I,
take
a
dot
product
to
get
a
score,
something
like
that
and
I
kind
of
construct.
A
summary
once
again
for
myself,
given
my
relationship
with
all
of
the
other
elements
of
the
sequence.
B
B
The
read
heads
are
an
interesting
kind
of
construct
where
you
for
a
model
to
be
able
to
reason
about
itself.
It's
kind
of
a
discrete
thing.
A
individual
vector
can
only
take
dot
products
with
other
vectors
and
do
kind
of
one
summation
with
itself.
So
you
need
to
have
multiple
of
these
to
be
able
to
have
multi
attention.
It's
kind
of
a
jargon
filled
word,
but
kind
of
the
key
idea
is
it
allows
your
individual
elements
of
your
sequence
to
be
able
to
focus
on
different
things
at
the
same
time.
B
So
this
is
the
diagram
of
the
transformer
from
the
paper.
It's
a
mess.
It's
actually
a
pretty
simple
idea,
but
the
diagram
makes
it
look
pretty
complicated
just
to
break
it
down
super
quickly
on
the
left.
You've
got
this
encoder.
All
the
encoder
is
doing.
Is
it's
taking
its
input
sequence,
looking
at
all
the
other
sequence
elements
in
the
sequence
and
then
outputting
a
new
sequence,
where
every
time
stuff
has
just
looked
at
all
of
its
neighbors
and
the
decoder
is
basically
doing
the
same
thing.
B
B
Why
are
transformers
useful
they
model
long-term
dependencies
explicitly
so
a
model
is
explicitly
able
to
look
back
and
say:
hey
this
time,
step
like
300
times
steps
ago
was
important
for
my
problem.
It
is
near
state-of-the-art
for
pretty
much
every
transduction
problem
that
exists
today,
probably
not
very
useful
for
many
to
one
style
problems
unless
you
just
got
like
a
boatload
of
compute
that
you
want
to
throw
at
something
kind
of.
B
It
was
designed
for
a
sequence
transmission
at
the
end
of
the
day,
but
it's
a
super
super
powerful
model
so
how
to
pick
building
blocks?
People
will
usually
ask
about
like
what's
the
best
thing
to
choose
for
my
problem.
I
have
no
idea
and
I,
don't
think
anyone
who
tells
you
they
know
usually
has
a
hidden
agenda.
The
secret
is
to
just
test
them.
B
If
you
really
do
want
to
know
what's
best,
there
is
no
free
lunch
at
the
end
of
the
day,
and
certain
problem
characteristics
resonate
well
with
particular
model
architectures
if
you've
got
really
complicated
state
that
you
need
to
reason
about.
An
RNN
might
be
good
if
you
just
need
to
kind
of
pick
an
element
in
the
past
of
your
sequence
and
like
use
that
to
predict
maybe
a
an
attention
based
transformer
is
kind
of
the
right
way
to
go,
but
there
really
is
no
free
lunch.
B
It's
impossible
to
know
what
is
best
so
concluding
remarks
before
we
open
up
the
questions.
Sequence
learning
is
super
flexible,
I,
hope
I
gave
you
a
pretty
high
level
overview
of
what
all
the
building
blocks
are.
That
can
be
fit
together.
What
the
different
archetypes
are.
There
are
four
main
flavors
many-to-one
the
many-to-many,
where
we're
doing
a
sequence
sequence,
the
many-to-many,
where
we're
labeling
every
element
in
the
sequence
and
kind
of
this
one-to-many
unicorn
that
I've
talked
about
that
I'm
looking
for
an
interesting
application.
B
There
are
a
lot
of
building
blocks
to
choose
from
with
more
coming
out
every
day
on
archive
the
base,
fundamental
ones
that
you
might
want
to
know
about.
Our
are
n
ends,
CN,
NS
and
transformers,
and
the
really
important
thing
to
remember
when
you're
building
a
sequence
model
that
there's
no
kind
of
best
approach,
should
I
encode
time
as
a
delta
I
don't
know
try,
it
should
I
use.
B
Idea,
try
them
both.
There
really
is
no
free
lunch.
I
can't
emphasize
that
enough.
A
lot
of
kind
of
blog
posts
that
accompany
papers
very
grandiose
claims
that,
like
this
new
thing,
is
better
than
all
of
the
prior
art
that
came
before
it's
usually
not
true.
They
usually
pick
pretty
poor
baselines,
so
do
kind
of
try
out
everything
for
yourself
and
yeah
remember.
There
is
no
free
lunch
for
sequence
learning,
so
thank
you.