►
From YouTube: Investigation of Recurrent Sequence Memory
Description
Paper review of "Learning distant cause and effect using only local and immediate credit assignment" (https://arxiv.org/abs/1905.11589)
Discuss at https://discourse.numenta.org/t/paper-review-of-recurrent-sequence-memory/6357
A
B
C
They've
they're,
probably
some
post-its
one
specific
blog
post.
If
you
saw
this
one
about
like
how
to
build
an
artificial
general
intelligence
and
they
really,
but
that
might
sound
a
little
crazy
to
you.
But
I
really
do
try
to
lay
out
like
what
are
the
problems
of
the
company
to
be
solved
really
laid
out
like
they.
They
refer
to.
D
B
A
D
Yeah
they
were
touch
with
us
to
to
get
feedback
and
see
to
explore
potential
collaborations
as
well.
So
that's
something
that
we
we
decide.
We
take
a
look
at
the
paper
and
see
how
I
might
interview
before
doing
here
and
and
think
about
that
for
the
future.
So
what
they
were
trying
to
do
with
recurrent
sequence
memory
was
design
a
bio,
feasible
learning
algorithm
that
allows
every
state
was
learning
so
overcome
some
of
the
limitations
of
existing
sickness
learning
and
do
so
at
much
level.
D
Remember
requirements,
so
the
constraints,
the
rules
that
they
set
out
for
themselves,
I
put
here
so
local
credit
assignment,
so
they
define
that
by
limiting
back
propagation
across
at
most
two
layers.
I
think
they
justified
this
by
talking
about
potential
layer
of
active
dendrites
that
are
sort
of
like
in
neurons
didn't
of
themselves.
C
D
Encoder
is
one
that
takes
an
input,
is
like
eminence
digit
to
perform.
Summary
representation
goes
through
a
bottleneck
and
then
produces
output.
That's
a
decoded
prediction
of
the
same
inputs.
It
tries
to
predict
itself,
we
have
a
loss
as
we
compare
the
difference
between
the
two,
so
we
optimize
the
models
to
reduce
the
loss
so
that
we're
generating
a
high
fidelity
versions
of
the
input
through
a
smaller
Latin
space.
D
B
D
They
were
actually
trying
to
take
the
work
from
HTM
and
bring
into
the
deep
learning
worlds
and
get
some
and
demonstrated
on
new
problem
types.
So
next
year
reference
the
work
on
the
river
grammars
may
be
2016.
They
don't
actually
reproduce
that
themselves,
but
they
have
a
couple
other
demonstration
tasks.
They
want
to
kind
of
show
this
working
on.
Okay,.
B
B
B
D
So
they
illuminate
a
couple
of
issues
that
they
see
with
this
kind
of
architecture,
which
is
inspired
by
HD
models
and
yeah,
so
generally
set
up
as
a
predictive,
auto
encoder.
So
we're
trying
to
do
the
next
item
prediction
and
then
so
in
terms
of
the
architecture.
What
they
have
is
columns
and
cells.
D
That's
going
to
be
very
familiar,
they
call
these
groups,
but
these
are
guests,
many
problems
and
so
that
we
have
a
proximal
weight
matrix
or
the
proximal
dendrites
coming
down,
taking
the
input
and
though
they're
shared
for
all
cells
in
the
column,
just
like
an
HTM
and
then
we
also
have,
in
this
case
a
single
active
dendrite
on
each
cell
that
is
taking
input
from
the
previous
memory
state.
So
this
is
a
memory,
a
kind
of
activity
in
the
network
at
t
minus
one,
so
they
every.
B
B
D
E
D
For
the
current
memory
state
and
want
to
spread
this
inhibition
matrix
and
that's
that's
everything
so
like
each
game
has
a
separate
kind
of
predictive
state
of
cells.
They
do
everything
in
a
single
matrix
of
activity.
So
it's
a
little
bit
different
in
that
respect
and
that
event
has
some
implications.
That
I
think
are
important
to
discuss
so.
B
B
D
D
D
D
D
Exactly
yeah
everything
it's
looking
at
here,
they
do
specify,
but
they
do
specification
of
I'm
asking
as
after
the
parameters.
So
we
do
as
some
of
these
too
and
then
we
we
have
this
inhibition
matrix.
You
can
think
of
that
as
interest
just
like
this
one.
So
for
every
so
we
have
an
inhibition
state
they.
This
is
inspired
by
a
refractory
period
of
a
cell.
The
cells
don't
fire
for
a
certain
period
of
time
after
firing
less
recently,
so
we
keep
track
of
when
each
cell
fired
most
recently
and
we
dictated
that
metrics
over
time.
B
D
B
D
Okay,
so
yes,
we
inhibit
the
result
of
this
some
and
then
we
do
that
we
do
key
winners,
and
so
this
is
producing
masks
that
are
column.
One
isn't
cell
wise,
as
this
is
going
to
produce.
It's
gonna
pick
a
single
cell
for
each
of
these
columns
and
then
based
on
the
activity
of
the
the
it's
going
to
be
a
look
at
max
pooling.
It's
we're
gonna
pick.
Let's
say:
K
equals
to
make
two
cells
two
columns,
and
so
we
have
this
cell
and
this
cell
to
come
out.
So.
A
B
D
Very
similar
representation
scheme
I
think
there's
one
major
difference
which
maybe
I'll
get
to
in
a
second,
because
I
think
this
is
the
thing
to
talk
about
the
weight,
but
is
everything
make
sense
up
to
you're,
just
in
a
tie
level
ones
doing
that,
and
then
we
generate
outputs.
So
we're
going
to
the
primary
output
is
going
to
be
the
memory
state.
So
this
kind
of
activity
is
going
to
become
the
new
memory
for
the
next
time
step.
So
this
going
to
be
decayed
as
well,
but
there's
an
optional
decay
term.
D
D
D
D
B
E
B
B
B
B
B
A
D
B
A
B
B
C
B
E
D
D
B
D
B
It's
not
it's
not
a
this
one
sequence,
because
if
I,
let's
say
I
trained
it
on
20
different
12
digit
sequences
and
do
multiple
sequences
that
begin
zero
and
therefore
my
prediction
that
the
first
digit
would
only
I
train
on
this
one
sequence.
That's
it
then,
then
the
first
zero
I
can
predict
one
very
accurately.
B
So
that's
that
to
me
why
I
asked
where
do
the
97
97
percent
come
from,
because
if
you
were
training
on
multiple
sequences
which
I
assumed
it
was
the
case,
then
you
would
have
to
either
determine
to
make
that
measurement
at
the
end
or
something
far
way
into
it
or
like
you're,
saying
no
they're.
Just
this
is
the
one
the
training
on
this
one
12
digit
sequence.
That's
it
yes
and
then,
therefore
you
could
measure
the
prediction.
B
D
D
B
B
D
Because
this
bus
fare
is
trained
on
the
hidden
memory
state,
they
can't
just
pass
in
a
full
image:
okay,
well,
what
they
do
do.
Is
they
take
this
memory
state?
It
train
the
funds
from
our
once
for
the
current
image
and
once
for
the
next
image,
and
they
show
that
the
same
exact
memories
to
is
able
to
capture
the
representation
of
memory,
state
encodes
information
for
both
the
current
item
and
the
next
item.
Yes,.
E
A
E
C
C
C
B
B
A
E
D
D
Our
asset
memory
to
the
DQ
it
allows
it
to
solve
the
problem
worse,
but
I
saw
the
problem
and
then
the
more
interesting
answers
is
a
more
useful
benchmark.
I
think
working
on
is
this
penn
treebank
data
set
in
which
you
have
a
million
words
in
order
10,000
vocabulary
size.
You
have
to
do
next.
Word
prediction,
so
you
get
sentence
in
every
word,
you're
trying
to
predict
the
next
word.
That's
going
to
come
in
this.
B
F
D
B
A
F
A
F
D
F
E
F
D
F
A
A
C
D
D
They're
training
up
and
inventing
from
scratch:
that's
they're,
building
a
felon
language
model
or
embedding
model
based
on
the
corpus.
So
this
is
the
theory
interdimensional
embedding
matrix
and
it's
trained
based
on
I
guess
the
traditional
look
at
the
ward
next
style
of
training
from
context.
Okay,
so
you're
gonna
get
these
distributed
connectors
for
each
word
and
you're
gonna
pass
that
into
the
network.
They're
produce
predictions
from
that,
but
that's
a
question
that
I
had
for
them
that
I
want
to
follow
up.
D
It
could
be
defusing
some
pre-training
betting
is
following:
they
mention
it,
because
I
have
not
perfectly
replicated
results
from
the
language
modeling.
Yet
the
M&S
looks
really
good.
The
language
modeling
doesn't
ok,
so
just
quickly.
We
already
discussed
most
of
this
in
terms
of
comparisons
with
sham
or
the
things
that
are
doesn't
gonna
be
different.
So
continuous.
D
D
D
They
have
this
refractory
inhibition,
which
is
parallel
to
what
we
do
with
boosting
their
architecture
is
explicitly
generative
I,
think
it's
we've
done
some
classification
or
generative
Stephanies
gym
as
well,
I
believe,
but
that's
not
explicitly
built
in
the
network,
so
they
are
constantly
producing
a
predicted
image
for
the
next
time
step.
Even
though
that's
not
when
every
representation
there,
the
network
is
trained
on
this
loss
function
is
trying
to
produce
images,
and
this.
D
We're
taking
all
in
a
different
time,
I
think
so.
I
believe
that
this
architecture
allows
the
which
columns
are
activated
to
be
active
to
be
influenced
by
the
recurrent
employed
in
a
way
the
HTM
doesn't
the
correct
me
if
I'm
wrong.
We
are
summing
to
activations
and
then
we're
picking
winners
from
that.
So
we
were
really
strong
recurrent
input.
We
can
actually
activate
columns
that
have
no
connections
to
the
input
at
time.
Zero.
B
B
Stomps,
we
won't
argue,
there's
some
advantages,
but
I
also
think
the
problem
you
lose
a
lot
for
that
too
I
mean
I,
think
there's
a
real
advantage
in
separating
out
these
two
problems,
because
sometimes
you
so
it
seemed
just
different.
That's
something!
We've!
Never
really!
We
didn't
address
in
the
temporal
memory
algorithm,
but
we
took
we
think
about
a
lot
in
terms
of
the
column
of
architecture
where
those
active
predictions
recommend-
and
here
they're
saying
hey
we're
doing
on
one.
No,
it
wasn't
that
so
I'm
just
a
place.
D
B
B
D
B
B
D
E
A
A
D
D
There's
science
together,
so
that's
why
I
was
saying
that
the
predictive
influence
can
actually
produce
calm,
love
like
brushing
through
my
days
Jim
can
and
what
they're
doing
is
they're
actually
decoding
the
record
for
column
level
activations
to
produce
the
prediction
again.
So
the
column
of
observations
is
what
how
many
columns
they
have
in
this
now,
usually
200
to
600,
200,
furnace
and
600
flora,
language.
D
Using
training,
they're
betting,
but
it's
initially
it's
just
a
vector
of
ID's,
so
Michael
we're
going
high.
So
I'm,
not
my
you
can't
pre-trained
your
met
him
we're
just
training
it
back,
it's
fresh!
So
it
comes
in
yes
morning,
planner,
so
I'm
using
PI
torches
embedding
module,
which
I
think
is
the
same
yeah
I.
Don't
they
don't
specify
with
their
universally
no
need
I?
Suppose,
there's
a
possibility
there
not
doing
any.
Inventing
at
all,
it
seems
unlikely
right
if
they
were
just
using
one.
A
D
D
It's
the
major
problem
that
they
referenced
in
the
paper
is
that
it's
very
hard
for
this
kind
of
architecture
to
generalize
on
the
same
sequences.
So
they
do
really
well
memorizing
penn
treebank
versus
this
huge
data
set,
so
they
they
can
remember.
As
these
huge
sequences
of
thinking
for
class
city
of
nain
next
word:
prediction
actor
to
50%
on
the
training
set,
so
that
demonstrates
media
capacity
of
the
memory,
but
that's
overfitting
with
credit
ii,
and
so
once
you
give
a
test
test,
it
doesn't
have
any
idea
what
to
do
with
it.
C
A
B
A
C
D
E
A
E
D
D
A
I
think
you
really
need
to
have
multiple
sequences
with
you
know,
hi
or
confusion
in
between
be
able
to
do
start
in
the
middle
yeah
wonderfully
trying
to
and
I
think
mark
at.
This
point
is
great
with
this
one:
just
training
on
one:
you
could
feed
it
almost
anything
and
it
could
just
replay
the
one,
the
single
sequence
it
knows
and
completely
ignore.
They
know
they
sort
of
ties
to
your
previous
point
like
the
recorded
but
can
drive
the
columnar
activity.
D
It's
true
yeah,
it's
not
very
naturalistic
training
set.
Is
it,
but
maybe
that's
just
a
debugging.
D
Yet
so
the
thing
they're
trying
to
solve
now
or
that
they
mentioned
they
want
to
figure
out
how
to
deal
with
is
just
exposure
to
novel
words
and
novel
sequences,
which
this
does
vary
poorly
because
the
overfitting
problem,
and
so
today
we're
talking
to
us
on
this
call
about
it
and
intentional
structures,
and
things
like
that.
That
might
allow
looking
farther
back
in
time.
D
D
D
F
F
F
F
B
D
E
F
C
A
A
A
A
D
B
D
B
D
B
B
A
D
B
F
B
Obviously,
these
are
patterns
that
are
from
a
pre-sorted
corpus,
dependent,
so
you're
being
another
one
or
another
two,
but
still
the
idea
that
that
it's
always
somewhat
novel
coming
in
and
we
always
had
some
issues
ever
in
that
regard,
which
is
you
know,
the
noise.
You
trip
up
the
supplementary,
so
we
had
to
rely
on
some
cooler
layer
or
some
other
artificial
to
to
bridge
across
noise
right.
B
C
I
like
how
this
uses
learned
sequences
to
then
essentially
put
labels
and,
what's
being
sensed,
they
can
learn
the
MS
digits
by
first
learning
sequence
and
then
seeing
seeing
various
versions
of
that
sequence
and
not
now
that
it
has
learned
the
sequence.
It
cannot
label
these
zeros
and
ones,
and
such
it's
almost
with
us.
That's
like
our
time
for
temporal
memory,
training
our
spatial
cooler
or
early.
B
A
A
E
B
C
A
A
B
Motor
that
would
be,
if
that's
a
good
example,
as
I
just
said,
like
I,
have
treated
for
my
private
one
right
and
and
our
special
cooler
in
all
situations
might
classify
differently.
So
I
would
learn
those
two
separate
sequences,
depending
which
way
to
the
one
you
know
you
might
it
might
force
those
two
categories
together.
C
Maybe
like
an
easy
ones
that
would
be
would
be
I,
don't
know
here.
I'll
just
go
circle,
different
coffee
cups
that
are
sent
that
are
a
little
bit
slightly
different
from
each
other
or
or
like
or
under
different
lighting
or
the
same
coffee
cup
under
different
light
you're.
Getting
these
different
sensory
inputs
for
the
same
option
from
the
same
location,
same
viewing,
location
and
you're
learning.
E
A
C
F
C
F
B
A
B
B
We
wanted
to
jump
to
a
totally
different
one
always
thought
the
beans.
Do
you
think
sequences
those
like
formality?
We
don't
have
yeah,
and
so
we
never
had
that
always
bothered
me
I.
Put
that
a
requirement
early
on,
but
but
I
felt,
like
the
solution
became
clear
through
displacements
because
visit,
which
were
learner's
to
sequence,
with
displaced
onto
the
team
and
not
the
answer
to
that
problem.
So
we
now
all
ending
sequences
of
sensory
input,
but
now
the
its
elements
probably
learned
as
the
secrets
of
this
place,
but
we
haven't
really
developed.
B
A
B
B
C
B
D
B
D
A
A
B
Like
oh
I
do
is
I
write,
a
seven
or
two
ways
or
a
five
or
so
you
can
literally
just
talk
with
you.
You
can
say
sixes
and
nines
are
the
same
and
train
on
the
system,
one
or
two
pitches
right
now.
The
question
won't
come
up
as
that
it
was
curious
to
me
so
okay,
so
the
system
should
figure
out
that
sixes
and
nines
represent
the
same
thing.
B
E
E
E
D
D
B
B
D
D
Rivalry
thing
predicting
they
cross
inhibit
each
other,
because
we
do
have
a
sub
representation
of
a
six
like
there's
two
kinds
of
sixes
through
the
nine
count
of
six
in
the
scene
out
of
six,
and
then
we
can
make
predictions
for
both.
But
maybe
some
sort
of
like
something
stochastic
in
the
network
produces
a
very
nice,
six
or
very
nice
now,
each
time,
because
we've
never
seen
the
eight
before
so.
D
This
on
the
call
they
can
narrow
staff,
character,
yeah
and
stacking.
It
doesn't
necessarily
work.
It
was
designed
to
be
stacked
where
you
take
the
memory,
outputs
and
you
feed
them
into
the
next
layer,
but
when
they
saw
in
experiments,
was
that
the
higher
layers
just
learned,
the
same
sequences
will
over
theirs
yeah
beneficial
nestling.
That
I
think
we
get
eventually
play
with
as
well.
I
think
this
better
ways
to
tune
parameters
for
higher
networks
to
maybe
encourage
and
variance
structures,
maybe
M
less
inhibition
at
higher
levels.
D
Single
transition
from
a
zero
to
a
one
at
this
level,
so
there's
another
level
on
top.
Maybe
you've
got
the
zero
to
one,
maybe
you're,
predicting
that,
because
to
a
2
to
a
3.
That's
still
just
with
two
layers
of
hierarchies.
You
only
get
like
as
three
transitions.
You
can't
get
like
really
higher
sequence
and
then
stay
active
for
a
while
and
transition
to
the
really
entire
sequence.
Simply
a
hard
problem.