►
From YouTube: Temporal Memory via RSM-like models - Numenta Research
Description
"Temporal Memory via Recurrent Sparse Memory-like models" - topic from Jeremy Gordon https://twitter.com/onejgordon
Discuss at https://discourse.numenta.org/t/temporal-memory-via-rsm-like-models/6345
A
B
D
C
Okay,
so
I'm
just
gonna
go
through
an
update
on
what
we've
done
from
memory
which
oversaw
weeks
ago,
just
for
a
reminder
on
what
that
our
controls
like
so
this
is
HTM
inspired
recurrent
neural
network,
that's
designed
for
receivers,
running
tests,
and
so
the
way
it
works
is
we
have
n
groups
and
n
cells
per
group
just
like
HTM
and
each
column.
Sorry
they
call.
C
These
groups
follows
suit,
each
column
of
cells
to
be
thinking
having
shared
wigs
from
the
proximal
input,
so
this
is
showing,
and
this
digit
coming
in
as
input
each
cell
has
recurrent
includes
from
all
the
previous
the
entire
basement.
We
see
so
the
red
connections
there
are
from
the
previous
every
state
in
the
blue
connection.
So
yes,
in
the
sort
of
deep
learning
are,
these
are
continuous
ways.
C
The
real
office
call
themselves
to
was
its
local
friend
of
assignment
and
no
kind
of
travel
expenses
to
get
after
better
connections
to
previous
time,
steps
go
and
get
direct
connections
to
previous
memory
state.
So
all
the
hidden
state
of
the
network
has
to
be
encoded
in
memory.
There's
a
nominee
or
you'd
be
okay
winners,
so
they
take
K
columns
and
once
all
four
columns
to
activate
they
code
through
a
final
layer
than
producing
image
predation.
C
Reminder
on
what
the
market
looks
like
the
test,
attacks
that
we're
testing
with
one
of
us
we're
calling
stochastic
sequence
eminence,
and
so
the
way
this
works
is
we
define
a
grammar
of
sub
sequences,
so
the
benchmark
grammar
that
we've
been
using
and
the
authors
that
yeah
I
already
did
as
well
as
this
criminal
in
there.
So
each
row
is
a
sub
sequence
and
the
way
this
works
is
you
move
the
sub
sequences
deterministically
for
every
label
or
digit
use
random
wait.
C
C
C
The
second
task
we
looked
at
was
this
language
modeling
task
with
the
penn
treebank
dataset,
which
is
a
million
words
from
The
Wall
Street
Journal
reduced
to
a
little
pebble
area,
10,000
words
and
the
goal
is
to
predict
the
next
word,
based
on
the
entire
sequence,
up
to
the
currents
and
they're,
not
using
any
inventiveness,
because
they're
interested
in
seeing
whether
an
artist
I
love
architecture
and
develop
meaningful
itself.
So
this
is
an
example
of
the
output
from
what
I
needs
rollouts
of
predictions.
You
can
see
that
it's
doing
an
okay
job.
C
This
is
something
that
you
might
get
out
of
like
a
a
to
grand
model
as
well.
So
it's
able
to,
for
example,
predict
that
after
New
York,
Stock,
Exchange
composite
or
the
next
words,
but
also
results.
Sighs
later
so,
I
turned
artisan
model
re-implementing.
The
results
from
eg
I
I
haven't
currently
replicated
their
Samoa
yet
another
very
cold
stage.
I
look
at
for
the
details
of
what
exactly
the
differences
are.
I
did
start
to
see
some
results
that
were
interesting
enough
to
adjust
the
model
a
little
bit
and
try
a
few
tweaks.
C
The
first
thing
that
I
was
looking
at
was
using
to
win
ourselves
for
coal
instead
of
one
and
using
our
key
learners
and
Colin
boosted
instead
of
the
original
RSM
model
uses,
and
one
of
the
things
that
was
interesting
was
that
the
column
of
representation,
so
the
the
image
predictions
are
decoding
from
max
pools,
cobble
estates
or
taking
the
max
of
sales
in
each
column
and
that
column
of
our
association
was
what
actually
produces
the
decoded
next
English
prediction.
So
what
I
was
hoping
to
see
was
differentiation
between
different
representations
of
the
same
input?
C
So
it's
trying
right
here,
0
1,
0
tubes
is
like
we've
just
seen
a
0
and
we're
about
to
see
one
or
about
a
CSU.
We
need
these
to
be
representative
differential.
We
need
to
be
able
to
discriminate
between
these
cases.
We
can
make
a
correct
prediction
for
the
next
time
step
and
what
these
blue
boxes
on
the
diagonal
are
showing
is
that
there
is
no
column
level
differentiation,
so
the
column
representation
of
each
of
these
0
1
0
2,
is
essentially
the
same
level
different
predictive
state.
C
C
A
A
A
Yeah
we
wanted
everything
nicely
separated
for
sure.
Well,
the
way
you
separate
up,
at
least
what
we
do
it
in
the
same
as
memory
Agency
Chris
memories
that
you
have
so
you
know
essentially
instead
of
neurons,
are
decoding
simultaneously
the
Union.
So
really
do
you
you
with
multiple
angles
based
on
multiple
previous
days
and
you
can
just
put
a
classifier
or
some
sort.
This
is
certain
subset
of
cells,
which
represents
this.
C
This
is
the,
and
this
was
experimental
is
why
they
did
it.
This
way
exactly
why
they're
cutting
through
the
maximal
versions
that
still
I'm
not
totally
clear,
I,
think
there
was
some
argument
for
keeping
up
parameters
low,
but
he
could
even
get
full
memory.
Stick
I
think
makes
more
sense
and
when
they're
doing
iterations,
they
are
using
the
memory.
It's
just
for
the
image
production,
which
is
the
lost
memory
loss
metric.
They
go
down
like
all
activations.
So
it's
true
that
so
one
one
solution
is
problem.
Is
we
use
HTML
HTML?
C
C
A
If
you
think
about
like
melodies,
have
to
do
this
right,
any
kimono,
you
I
mean
the
only
few
knobs
in
some
sense
you
interval,
instant
and
and
every
melody
compose
or
some
sequence
of
these
beautiful,
and
therefore
you
have
to
do
right
with
these
representations
for
all
these
transitions.
I
think
the
trick.
Is
you
don't
want
to
remember,
nobody's
subsequence
of
all
time,
it's
like
things
that
would
repeat
a
lot
and
I.
C
D
A
A
C
I
mean
so
this
the
Arizona
style
is
3ds
max
old
Riggins.
So
I
was
interested
in
trying
to
encourage
representations
that
were
more
differentiated
of
the
reading
by
the
predictive
state,
but
we
should
next
think
what
they're
predicting
so
I
was
looking
into
a
flat
version,
as
well
as
a
flattened
version
of
this
is
a
seminar
picture,
so
there's
no
way
sure
every
cell
taste
unique
way
matrix
from
input
and
unique
way,
matrix
matrix
from
the
previous
four
current
state.
So.
C
Of
us
in
and
they're
all
taking
input,
and
potentially
this
seems
like
it
could
be
a
more
flexible
members
and
allowing
more
clarification.
So
we
have
some
cells
that
are
sharing
this
primitive
state
so
hard
to
see
there.
They
do
have
nine,
where
the
blue
subtor
are
saying
that
the
current
station
is
eight
and
there
this
might
allow
us
to
have
the
same
solid
reps
in
the
next
predictive
state
and
therefore
sharing
between
sequence
items
are
tired.
C
B
C
C
This
performs
the
best
on
the
stochastic
immunised
benchmark.
So
on
the
this
Poggi
nine
data
set
that
I
was
showing
earlier.
The
eight
sub
sequences
nine
lengths
each
the
theoretical
accuracy
limit
is
eighty
seven
point:
two
six
percent,
that's
the
best
you
can
do
if,
if
you're,
making
the
exact
right
predictions
at
every
time,
step
and
the
flat
model
got
to
seventy
four
percent.
After
about
twenty
thousand
mini-batches
and
the
flat
partition
model
got
say
eighty-six
percent
after
just
two
thousand,
so
it
learns
extremely
quickly
and
it
gets
much
closer
to
the
circus.
C
C
D
C
It's
actually
so
the
ones
used
on
the
paper
were
fixed
sequences,
and
these
are
stochastic
regenerated
sequences,
so
they
were
just
putting
in
0
1,
2,
3
and
0
3
2
1
over
and
over
again,
without
any
for
any
uniform
random
selection.
That's
a
much
much
easier
task.
So
once
you
have
this
stochasticity
of
uniform
selection
of
the
next
subsequence,
it
got
much
the
original
models
her
perform
less
worse.
C
So
that's
why
I
was
doing
this
kind
of
architecture,
exploration
and
all
was
try
to
improve
that's
performance,
so
so
that
did
reasonably
well.
I
was
pretty
happy
with
those
results,
though
there
is
this
limitation
of
the
fact
that
the
benchmark
we
were
using
was
pretty
much
just
second-order
sequences.
There's
only
a
couple
of
digits
that
require
have
three
time,
steps
back
and
predict
the
next
to
do
so.
It
might
not
be
quite
a
difficult
enough
benchmark.
So
that's
not
know
it
would
like
to
experiment
more
with
the
language.
D
C
D
Stochastic
internist
I
think
should
point
out
we're
not
actually
sending
in
symbols
we're
sending
in
M
nest
images.
Yes,
so
in
order
to
actually
do
this,
it's
essentially
has
to
solve
Emmis
and
there.
So
you
know
that's
something
that
they're
the
traditional
tempo
memory
could
not
do
it's.
Yes,
actually
it
is
generalizing
from
a
training
set
to
test
set
here,
yeah
and
while
doing
sequences
yep.
D
C
D
C
Look
at
the
sequence
and
you
figure
out
what
what's
the
best
thing.
You
could
predict
it
every
time
step,
so
it's
always
impossible
to
predict
the
first
item
in
the
second
because
you
don't
know
which
one
you've
uniformly
chosen,
but
you
can
predict,
for
example,
in
this
case
this
particular
grammar.
You
can
always
predict
too
and
you're
getting
it
three
out
of
eight,
whatever
yeah
three
out
of
eight.
The
next
digit
is
going
to
be
similarly
difficult
to
predict
and
then,
after
that
it
becomes
deterministic.
So
you
can
always
get
the
next
items
correct.
D
C
C
So
language
modeling
is
much
harder
and
in
the
original
RSM
paper
they
they
noted
that
this
doesn't
get
anywhere
close
to
state-of-the-art
from
things
like
Alice
and
so
language
modeling
is
often
reported
in
perplexity.
So
lower
perplexity
is
better
where
the
the
best
possible
score
is
you're,
predicting
with
a
hundred
send
confidence.
The
next
word
in
the
vocabulary
and
you're
not
predicting
anything
else,
putting
everything
else
in
zero.
C
So
it's
like
a
point
distribution,
the
percentage
there
is
the
next
word
prediction
accuracy,
so
they
got
twenty
point
six
in
their
original
paper,
the
flat
partition
model
gets
to
twenty
three
percent
and
maybe
around
a
hundred
fifty
perplexity
after
far
fewer
mini-batches.
So
there's
still
something
interesting
and
potentially
useful
happening
there,
but
I
have
some
speculations
for
what
it's
doing
so
yeah.
Just
as
I
said,
this
is
sort
of
disappointing
results
overall
for
a
language
modelling,
so
I
think
there's
either.
C
This
is
a
fundamental
imitation
of
this
architecture
or
there's
some
things
that
we're
not
doing
as
well
as
we
should
be
here
and
I
have
some
thoughts
on
that
in
general.
What
the
RSM
and
the
are
some
related
models
to
on
language
modeling?
Is
they
over
fit
to
the
training,
said
very
quickly,
so
really
regularizing
and
trying
to
figure
out
how
to
care
less
to
unseen
sequences?
C
Is
the
hard
part
there's
a
few
things
that
the
arts,
a
moth,
was
suggested
for
doing
this
I
think
they're,
probably
more
regular
ization
that
we
can
look
at
the
concern
I
have
about
this
partition
model?
Is
that
it's
possible
for
the
recurrent
layer
to
be
essentially
passed
through
for
the
previous
state
of
feed-forward
input,
which
would
allow
us
to
do
classification
with
t
minus
1
and
an
T
equals
0
and
for
the
stochastic
M
disk
task?
That's
probably
good
enough,
because
these
are
just
second-order
sequences,
so
you
have
last
digit
current
digit.
C
You
can
do
a
pretty
good
job
of
predicting
next
digit
if
we
test
with
third
and
fourth
order
models,
I
think
we'll
be
able
to
push
it
and
see
how
well
it
does
that's
my
explanation
for
why
it
learns
incredibly
quickly
close
to
the
theoretical
limit.
Is
it's
essentially
getting
direct
access
and
that
would
that
would
be
a
problem
because
it
wouldn't
yell
to
generalize
very
well,
because
it's
not
really
using
the
recurrent
state
of
the
network
as
a
long-term
memory.
C
It's
just
using
as
a
pass-through,
if
that's
true
than
in
language
modeling,
this
becomes
essentially
a
diagram
model
where
we're
predicting
the
next
word,
based
on
the
previous
two
words,
and
that
seems
to
be
I,
haven't
tested
this
explicitly
but
I
think
then
we
get
similar
perplexity.
We
just
tested
a
bigram
model.
So
my
guess
is
that
there's
some
adjustments.
We
need
to
make
to
force
the
recurrent
inputs
and
just
be
it
passed
through
and
there's
a
couple
of
ways
that
I
was
thinking
that
doing
that.
D
A
C
C
C
So
this
seems
like
a
fundamental
problem
that
I
think
active
dendrites
are
supposed
to
help
us
solve,
because
with
active
dendrites
we
can
as
soon
as
a
predictive
state
is
confirmed,
we
can
deactivate
those
cells.
That's
my
understanding,
the
active
generated
model
and
so
I
think
there's
an
opportunity
to
add
active
dendrites
to
this
model
as
well.
I
have
thought
of
a
couple
with
that.
We
can
potentially
do
that
by
essentially
decoding
the
actual
image
that
comes
in
and
then
subtracting
that
from
the
memory
state
essentially
saying
these
cells
were
successful
at
predicting
this
image.
C
C
Yeah,
that's
the
potential
active
dendrites,
I,
think
stacking
in
hierarchy
or
something
that
the
RSM
model
is
designed
for.
So
essentially,
if
you
have
a
second
layer
on
top
of
a
previous
layer,
it's
doing
it
next
memory
to
a
prediction
of
the
layer
below,
whereas
the
the
final
layer
that
layer
at
the
bottom
is
doing
next
image
prediction.
So
this
was
designed
for
this
kind
of
hierarchy
and
I
think
that
the
original
authors
mentioned
that
they
were
finding
that
higher
layers
were
just
learning.
C
The
exact
same
sequences
I
think
that
there's
ways
we
can
get
around
that
as
well,
potentially
by
modifying
the
hyper
parameters
for
higher
layers,
maybe
adjusting
the
decay
and
inhibition
rates
to
encourage
more
invariant
representations.
So
I
think
that
this
is
part
of
the
solution
to
some
of
the
the
repetition
and
other
issues
that
we're
having
and
with
and
overfitting.
But
stacking
is
not
very
trivial,
so
there's
there's
some
work
that
will
need
to
go
into
that
and
then,
of
course,
the
embeddings
that
we're
using
are
fully
non-semantic.
C
This
is
a
binary
embedding
there's
no
notion
of
similarity
between
similar
words,
and
so
these
models
was
starting
to
do
much
better
with
with
the
decent.
Embedding
so
I
think
it's
worth
trying
the
the
reason
they
didn't
use
embeddings
because
they
wanted
to
see
whether
or
not
representations
could
be
learned.
But
this
is
a
if
you
want
to
see
how
this
compares
the
state-of-the-art
we
have
to
use
embeddings.
D
C
D
D
D
D
A
D
Yeah
so
I
think
I'll
see
you've
got
a
much
better
feel
for
the
different
aspects
of
all
of
these
partitioning
it
on
partitioning.
We
can
also
bring
it
back
a
little
bit
closer
to
the
temple
memory
model
and
active
dendrite
service,
step
towards
it
exactly
the
way
that
inhibition
is
happening
and
the
way
we're
carrying
high
over
state,
which
we're
not
even
sure
if
this
can
actually
do
more
than
second
order
or.
A
C
That
could
just
be
because
of
essentially
a
something
that
we're
doing
wrong,
with
the
way
that
partitioning
is
working
or
the
flat
models
are
working.
That's
just
it's
like
a
local
minimum
that
it's
using
the
feed-forward
state
I
think
that
if
we
can
get
beyond
that
local
minimum
by
kind
of
enforcing
Moore,
River
and
input.