►
Description
In this meeting Subutai discusses three recent papers and models (OML, ANML, and Supermasks) on continuous learning. The models exploit sparsity, gating, and sparse sub-networks to achieve impressive results on some standard benchmarks. We discuss some of the relationships to HTM theory and neuroscience.
Papers discussed:
1. Meta-Learning Representations for Continual Learning (http://arxiv.org/abs/1905.12588)
2. Learning to Continually Learn (http://arxiv.org/abs/2002.09571)
3. Supermasks in Superposition (http://arxiv.org/abs/2006.14769)
C
B
So
what
I
thought
I'd
talk
about
today
is
these
three
topics
is
sort
of
the
focus
is
on
continuous
learning,
but
and
we'll
look
at
kind
of
several
papers,
so
we're
not
going
to
go
through
each
one
in
any
detail,
but
there's
some
underlying
concepts
in
each
one
which
are
which
are
quite
interesting
and
relevant.
So
I'll
sort
of
spend
a
couple
of
slides
on
each
one,
but
the
papers
would
be
sort
of
mammal,
which
we
already
talked
about.
B
That's
kind
of
a
background
and
mammal
is
used
in
the
oml
paper
and
the
animal
paper
and,
if
you
remember,
there's
a
variant
of
mammal
called
reptile.
So
this
like
whole
animal
naming
theme
going
on
here
and
then
also
this
new
paper
on
something
called
super
mass
and
superposition.
B
So
we'll
talk
about
a
bunch
of
those.
So
before
we
do
that,
I
thought
I'd
just
sort
of
review
a
little
bit
about
continuous
learning
in
htms
and
temporal
memory.
I'm
not
going
to
go
into
this
in
detail,
but
the
way
we
often
talk
about
how
inference
works
in
htns
and
how
continuous
learning
works
is
using
a
particular
type
of
language.
B
These
then
cause
predictions
and
specific
cells
and
other
mini
columns
and
as
kind
of
shown
on
the
right
finger
here.
The
way
this
prediction
happens
is
that
you
have
connections
from
some
of
these
cells
in
these
active
mini
columns
onto
one
dendrite
on
on
this
red
cell
and
another
dendrite
on
this
red
cell
and
so
on,
and
so
when
this
red
cell
detects
a
pattern
on
one
of
its
dendrites,
it
becomes
depolarized
or
predicted
state.
That's
what
showed
in
these
kind
of
red
dots,
and
so
this.
B
If
these
correspond
to
the
b
mini
columns,
then
we
say
kind
of
b
is
predicted
and
then,
when
b
actually
comes
in
this
cell
because
it
was
depolarized
will
win
and
the
the
rest
of
the
cells
in
this
mini
column
won't
become
active.
Let
me
just
turn
on
the
here:
okay
and
if
there
was
a
mistake,
if
if
there
wasn't
a
correct
prediction
here,
so
if
these
cells
were
not
depolarized,
then
all
of
these
cells
would
become
active.
B
We'd
pick
a
winter
cell
and
then
we'd
learn
the
previous
pattern
on
a
specific
dendrite
on
those
cells.
Okay,
so
these
dendrites
bias
these
cells
to
win,
and
if
there's
no
winner,
then
we
pick
a
random
one
and
learn
patterns
on
that.
So
this
is
kind
of
the
language
we
typically
use
to
describe
it.
I'm
gonna,
say
short
slightly
different
way
of
showing
the
exact
same
thing.
So
one
thing
to
note
is
that,
after
learning,
in
reality,
this
network
is
very
densely
recurrently
connected
and
we
never
really
show
this
in
our
pictures.
B
I
realized,
but
there's
a
you
know,
once
we've
learned
a
whole
bunch
of
sequences,
there's
tons
of
connections
going
back
and
forth
between
these
cells.
So
it's
a
very
densely
connected
recurrent
network
right,
what's
kind
of
going
on
is
that
when
you're,
showing
a
particular
task
or
a
sequence,
you're
instantiating,
a
very,
very
sparse
sub
network,
that's
embedded
in
this
densely
recurrent
network
right,
that's
the
connected
recurrent
network!
So
again
this
we
don't
typically
talk
about
it.
B
This
way,
but
what's
happening,
is
that
these
dendrites
are
actually
choosing
which
sub
network
to
instantiate
at
any
given
point
right.
So
if
you
think
about
this
being
a
full
really
densely
connected
network
for
a
particular
sequence,
you
you
know
you
instantiate
a
particular
subset
which
is
like
this
guy
this
guy,
this
guy
connected
to
a
previous
set
of
cells.
There
that's
for
this
one
transition
in
this
one
sequence
and
the
specific
dendrites
that
become
active
by
biasing,
the
cell
and
determining
who
wins:
they're
actually
instantiating,
this
particular
sub-network.
B
Okay.
So
that's
in
some
ways
what
a
dendrite
is
doing.
It's
it's
choosing
a
sub-network.
Of
course
we
have
very
sparse
distributed
representations
so
that
avoids
significant
overlap
between
the
tasks.
So,
even
if
a
few
of
these
cells
might
overlap
with
other
other
tasks
or
a
few
of
these,
you
know
portions
of
the
sub
network,
the
network
subnetwork
as
a
whole,
represented
by
an
sdr,
because
it's
super
sparse.
B
It
avoids
significant
overlap
between
tests
and
that's
how
you
can
kind
of
learn
new
things
and
instantiate
new
things
without
sort
of
getting
confused
with
other
stuff,
and
then
you
know
we
talk
about
the
weights
being
binary
and
permanence
is
essentially
choose
which
weights
to
make
active
for
a
particular
task
or
a
sequence
via
learning
and
marcus
has
kind
of
drawn
this
connection
in
the
past,
before
with
his
variational
inference
and
variational
dropout
stuff-
and
so
you
know,
permanence
is
during
learning
choose
which
sub-networks
to
make
choose,
which
subnetwork
is
going
to
be
part
of
a
particular
sequence
or
a
task
okay.
B
So
this
is
kind
of
background.
I'm
just
saying
the
exact
same.
This
is
no
change
in
temporal
memory.
I'm
just
explaining
it
in
a
slightly
different
language
and
you'll
see
the
point
of
this.
As
I
describe
some
of
these
papers,
any
questions
on
that
is
that
was
that
is
this
already
all
obvious
to
people,
or
is
this
anything
new.
D
B
It's
it's
in
our
algorithm:
it's
it's
pretty
random
in
in
in
the
neocortex.
It
might
not
be
truly
random.
There
might
be
still
a
little
bit
of
each
cell
is
not
going
to
be
exactly
have
the
same
amount
of
depolarization.
There
will
be
some
randomness
in
there.
You
know
something
we'll
choose
it,
but
in
our
algorithm
it's
it's
completely
random
and
it's
actually
good
that
it's
random,
that
that
gives
you
kind
of
the
big,
the
full.
You
can
use
the
full
representational
space
that
way.
B
Okay,
okay,
so
here
are
the
papers,
various
papers,
that
I'll
focus
on
now,
and
I
think
this
is
they
all
sort
of
came
out
in
rapid
succession
or
over
the
last
12
months,
sort
of
and
it's
sort
of
a
fascinating
constellation
of
papers
and
continuous
learning.
B
And
what
I'll
do
is
I'll
go
through
it
in
this
sequence?
So
I'll
first
talk
about
this
one,
which
is
the
oml
paper
and
then
I'll
do
the
animal
paper
which
is
down
here
then,
as
a
third
one,
I'll
talk
about
the
super
masks
and
superposition
mammal
I'll.
Just
briefly
show
my
drawings
from
when
we
talked
about
a
couple
of
weeks
ago,
but
mammal
is
used
in
oml
and
animals,
so
it
might
be
useful
just
to
kind
of
remember
what
what
it
is,
but
the
core
is
kind
of
these
three
papers.
B
Okay
in
in
the
original
mammal
paper,
they
described
good
as
being
how
quickly
you
can
learn
each
task.
But
the
idea
is
that,
through
this
meta
learning
process,
you're
gonna
find
some
optimal
set
of
weights,
which
is
basically
like
an
initialization
for
the
network
and
from
that
set
of
weights.
It's
going
to
be
primed
to
be
really
good
at
this.
B
This
big
task
that
you've
defined
and
the
basic
you
know
terminology
and
loop
was
there's
a
network
with
weights
theta
and
that's
the
kind
of
network
you're
trying
to
optimize
you
initialize
with
some
random
values
for
theta,
and
then
you
go
through
this
loop,
where
you
pick
a
task
out
of
your
collection
of
of
tasks
and
then
for
each
task.
You
perform
the
task.
B
In
her
case,
it
was
training
on
on
k,
samples
and
then
you
what
you
do
is
for
each
task:
you're
going
to
get
a
different
weight,
a
different
network,
which
is
this
theta,
updated
by
the
the
gradient
steps
for
that
task,
and
then
what
you
do.
Is
you
update
this
original
theta
to
minimize
the
loss
on
each
of
these
individual
networks?
Here,
I'm
using
a
test
set?
B
Okay,
so
and
then
you
kind
of
you
so
now
you've
changed
the
initialization
of
this
network
based
on
how
well
it's
doing
on
on
all
of
these
different
tasks,
and
so
then,
you
repeat,
and
hopefully
over
time,
you're
going
to
find
a
set
of
networks
that
is
going
to
do
very
well
on
all
of
these
different
tasks
and
as
in
in
the
meta
learning
case
during
the
meta
test
testing
they
they.
B
They
test
this
on
a
completely
new
set
of
tasks
that
have
not
been
seen
during
training
as
well,
but
they're
all
kind
of
part
of
the
same
thing.
Okay,
so
you
can
imagine
you
can
so
what
the
oml
paper
does
is
apply
this
basic
framework
to
continuous
learning,
instead
of
looking
at
how
fast
it
learns
they're
going
to
look
at
how
well
it
remembers
previous
categories.
B
The
output
of
that
is
then
fed
to
another
network
called
the
prediction
learning
network,
and
what
happens
is
that
during
the
meta
learning
phase
you
up
this?
This
is
the
theta
you're
going
to
update
you're,
going
to
learn
the
weights
for
this
network
for
the
representation
learning
network
and
during
the
continuous
learning
portion
you're
just
going
to
adapt
the
weights
on
the
prediction
learning
network
to
see
how
well
it
for
each
successive
task?
B
B
Okay,
so
the
the
training
loop
that
they
have
is
very
similar
to
the
mammal
thing,
so
here,
instead
of
a
task
where
you're
going
to
learn
things
quickly,
what
you
learn
is
what
you
sample
is
a
continuous
learning
task
and
so
the
way
they
define
this
okay,
so
I'll
explain
this
in
a
second,
so
you
sample
a
particular
continuous
learning
task
where
you're
going
to
learn
multiple
categories
in
a
row,
then
for
each
continuous
learning
task,
you're
going
to
go
through
this
sequence
of
categories,
train
it
on
on
each
one
and
then
compute
computer
loss.
B
So
here
they
look
at
mainly
omni
glot,
that's
their
their
main
result,
which
is
this
data
set.
I
forget
how
many
I
think
it's
like
1600
different
categories
or
something
like
that.
So
they
choose
randomly
choose,
so
they
first
have
a
subset
of
all
the
categories
as
their
training
subset
and
then
there's
another
subset,
which
is
the
test
subset,
which
is
not
seen
during
training
at
all.
So
what
happens
during
training?
Is
they
choose
randomly
some
list
of
200
of
those
training
classes?
Okay,
so
these
are
200
omniglad
classes.
B
B
Okay,
so
this
inner
loop
is
that
the
standard,
continuous
learning
thing
that
that
that's
big,
where
you
have
a
set
of
tasks-
and
you
learn
each
task
in
a
row
and
then
you
want
to
see
how
well
you've
learned
all
of
the
classes
at
the
end
of
the
day,
and
then
that
becomes
the
inner
loop
for
this
whole
meta
learning
set
in
reality.
If
they
were
to
actually
do
this,
this
would
be
a
thousand
steps
and
you'd
have
to
accumulate
these
gradients
over
a
thousand
steps.
B
So
what
they
do
is
they
only
update
five
of
these
at
a
time?
So
they
pick.
You
know
some
random
point
in
the
sequence
train,
I'm
sorry
they
did
five
classes
at
a
time.
So
they
pick
a
random
point.
Within
this
list
of
200
classes,
they
train
for
the
next
five
classes.
Then
they
update
the
weights
and
they
they
iterate.
So
they
don't
do
a
full
thousand
in
here.
B
Okay,
any
questions,
and
so
going
back
to
here.
The
continuous
learning
piece
is
they
freeze
these
weights
and
then
this
becomes
a
200
head,
continuous
learning
problem
and
they
learn
200
categories
in
sequence.
B
B
So
this
is
this
works
really
well
so
they're
able
to
they
show
learning
200
what
it
looks
like
learning,
200
categories
at
a
time
and
oml
does
quite
well
and
it
remembers
at
the
end
of
the
day,
it
gets
about
63
accuracy
across
all
200
classes,
which,
in
a
continuous
learning
setup
is,
is
really
impressive.
B
Some
of
these-
I
don't
remember
where
all
of
these
I
think
pre-training
is,
if
you
were
to
train
the
network
completely
on
some
of
the
class
on
the
200,
the
training
set
of
the
200
classes
and
then
try
to
do
continuous
learning.
It
doesn't
work
very
well,
but
oml
performs
quite
well
in
here.
B
And
the
one
big
point
they
make
in
this
in
this
paper
is
that
by
doing
this,
the
representation
that
they
learn
is
super
sparse,
so
that
that's
kind
of
nice
that,
if
you
look
at
you,
know
any
particular
output
at
any
particular
point
in
the
scheme
after
the
whole
meta
learning
phases
is
learned.
If
you
look
at
the
output
of
the
representation
network,
it's
extremely
sparse
and
the
other
thing
they
mention
is.
B
If
you
look
at
the
average
activation
across
all
of
the
all
the
classes,
you
know
the
it's
it's
fairly
high,
meaning
all
of
the
units
are
actually
used
quite
well
and
in
our
spatial
cooler.
This
is
what
we
try
to
get
to
with
with
boosting
as
well.
E
Do
they
get
this
level
of
sparsity,
even
though
they
don't
explicitly
encourage
it.
B
Yeah,
they
don't
encourage
it.
What
all
they're
trying
to
do
is
do
well
on
continuous
learning,
and
so
the
meta
learning
process
must
have
picked
weights
in
such
a
way
in
the
in
the
data
in
the
representation
network
such
that
you
always
get
very
sparse
outputs.
E
A
B
B
It
could
have
just
learned
weights
such
that
they're
still
using
relative
they're,
not
using
k,
win
or
take
all
or
anything
like
that
right,
so
they
might
have
learned
weights
such
that
the
average
dot
product
is
very
close
to
zero,
and
you
know
for
specific
types
of
inputs.
Only
some
of
them
actually
get
above
zero
and
some
don't
it's
not
clear
to
me
exactly
what
what
they're
not
explicit
like.
I
said
it's
not
explicitly
designed
to
be
a
sparse
network,
but
the
meta
learning
process
learned
that
sparsity
is
best
for
this
continuous
learning.
C
C
B
Animal
tries
to
take
the
and
they
sort
of
they
say
that
work
is
built
on
top
of
oml.
Is
the
exact
same
kind
of
training
set
up
and
stuff
like
that,
but
their
network
architecture
is
different.
C
B
And
so
this
modulatory
network
is
sort
of
gating
the
outputs
to
the
classifier,
creating
the
activity
that
eventually
gets
to
the
classifier.
So
the
way
I
kind
of
think
of
it
is
this
network
is
kind
of
learning
which
activations
are
important
for
a
given
input.
It's
trying
to
pick
some
characteristics
of
this
input
and
saying,
oh
based
on
this
characteristics.
B
Specific
output
units
are
going
to
be
more
important
than
others,
okay,
so
it's
it's
kind
of
like
an
attention
system.
It's
it's
gating
the
outputs
here,
so
so
during
meta
learning,
they
learn
all
of
these
weights.
The
weird
thing
is
that
during
continuous
learning,
only
these
only
the
classifier
weights
are
updated.
B
B
I
was
not
very
satisfied
with
this.
It's
like
I
thought,
is
this
really
continuous
learning
in
that
case,
is
it
really
well?
It
is
continuous
learning,
but
is
it
really
solving
the
catastrophic
forgetting
problem,
because
during
continuous
learning,
all
of
these
weights
are
fixed,
so
I
was
kind
of
left
slightly
unsatisfied
with
that.
I
would
have
been
happier
if
at
least
some
part
of
this
basic
network
was
also
updated.
B
So
so
what
do
you
think
is
happening
in
the
last
layer?
It's
just
memorizing
everything
yeah.
So
you
know
when
it's
learning
class
k,
only
the
head
for
class
k
is
being
updated.
Maybe
I
don't
remember
if
they're
learning
a
couple
of
categories
at
a
time
or
they're
just
learning
one
category
at
a
time-
and
I
think
it's
just
essentially
like
like
you're
saying
memorizing,
which
which
activations
it
should
it
should
attach
to
that
head
right.
B
B
So
it
was,
I
thought
this
was
a
little
odd
and
not
fully
satisf,
satisfying
as
a
continuous
learning
system
or
solving
the
catastrophic
problem,
forgetting
problem
they
get,
but
they
get
really
good
results.
So
here's
they
go
up
to
600
categories
in
a
row
instead
of
200,
and
you
can
see
that
the
animal
during
testing
time
it
retains
something
like
63
accuracy
on
all
600
classes
that
were
that
it
was
trained
on.
B
So
just
as
a
reminder
again
during
the
meta
training,
it's
trained
on,
you
know
hundreds
of
classes
where
each
time
it
does
continuous
learning
on
some
subset
of
these
classes.
Once
meta-learning
is
finished,
they
don't
change
those
weights
at
all.
They
get
a
completely
new
set
of
classes,
and
now
you
learn
600
of
them
in
a
row.
Okay,
so
it
says
never
seen
those
classes
during
training.
B
Okay,
so
it
is,
you
know
it's
a
completely
new
set
of
classes,
but
and
it's
able
to
learn
600
of
them
in
a
row
where
learning
involves
just
learning
the
output
weights
so
one.
So
this
is
quite
impressive
that
they
were
able
to
learn.
600
classes,
one
kind
of
really
interesting
twist
to
all
this
that
I
just
learned
recently,
I
was
in
touch
with
kuram
javed
who
did
the
oml,
so
they
he
just
found
a
pie,
torch
bug
in
the
scripts.
B
As
we
know,
writing
these
continuous
learning
things
in
pi
torch
is
really
tricky.
I
think
he
didn't
accumulate
the
gradients
quite
correctly,
and
so
he
just
found
a
budget
bug
in
the
scripts
and
after
he
fixed
the
bug.
If
he
puts
a
single
output
layer,
sort
of
mimics,
the
oml
and
the
animal
setup,
he
gets
the
same
results
as
animal.
B
B
B
At
the
outputs,
after
the
gating
here
before
and
after
the
gating
and
what
they
see
is
that,
after
the
the
gating,
the
outputs
are
super
sparse
again.
So
in
their
case
also,
the
classifier
is
seeing
extremely
sparse
activations
coming
in.
So
this
is
like
the
output
of
the
of
the
basic
network.
This
is
like
the
output
of
the
gating
signal
and
wherever
there's
a
coincidence
between
those
two,
that's
presumably
where
there's
a
activation
coming
in.
F
I
asked
a
question
on
slack
about
animal
using
a
remember
set.
I
don't
know
if
I
had
a
chance,
I'm
saying.
F
B
F
F
B
Little
bit
of
replay
yeah
yeah,
that's
interesting.
It's
been
a
while,
since
I
read
this
paper,
and
so
that
could.
B
This
one
right,
yeah
line
four
yeah,
yeah
yeah,
exactly
yeah,
so
so
the
training
on
this
new
trajectory,
as
well
as
some
subset
of
the
previous
ones,
yeah
yeah.
So
that's
that's
a
little
weird!
That's
interesting!
You
know
that
that
means
you
know
they
must
have
put
that
in
because
they
needed
it
right
exactly.
F
B
F
It's
free,
it's
two
different
ways
of
citing
the
test
and,
of
course,
if
you
put
the
remember
set
in
the
webmail,
it
could
be
even
better.
B
What
I,
what
I
do
remember
is
remember,
is
that
in
oml
what
they
do
is
instead
of
training
on
all
thousands
things
in
a
row.
They
just
pick
five
classes
at
a
time
and
just
train
on.
You
know
five
or
four
instances
of
each
class
at
a
time.
C
C
F
B
B
F
F
B
F
B
B
So
it's
it's
slightly
different
from
replay,
but
it's
it
is
being
that
information
is
being
used.
So
maybe
that's
what
this
this
thing
is
here.
B
It
was
just
well
it's
it's
it's
it's
interesting.
We
can.
We
can
look
at.
You
know
I
didn't
go
into
so
much
detail
into
into
each
one.
I
don't
know
quran
if
you
had
a
chance
to
look
through
that
at
all.
If
you
remember
any
of
that.
B
Yeah,
it's
a
so
in
what
you're
saying
is
in
computing.
The
meta
learning
loss
function,
they're
incorporating
a
few
random
examples
from
so
they
might
learn
on
sort
of
five
classes
in
sequence,
but
they're
also
including
some
so
of
course,
they're
going
to
test
on
those.
But
I'm
sorry
they're,
going
to
compute
the
meta
loss
on
those
but
they're,
also
using
some
other
previously
learned
classes
from
the
full
set
of
200
or
600.
C
Yeah
in
there
yeah:
okay,
that's
a
good
point.
E
You
said
the
the
bug
he
had
was
that
he's
he
was
zeroing
out
the
gradient.
Each
time
was
that
it.
B
B
F
Oh,
I
had
well.
C
F
So
you
said
it's
a
bit
disappointing,
but
it's
not
that
different
from
animal
from
oml,
because
in
oml
theta
is
the
convolutional
part,
that's
the
four
first
layers
and
then
w.
Is
this
the
fully
connected
the
last
two
so
yeah?
F
The
convolutional
layers
are
also
frozen
there
during
the
inner
loop
and
they're
updated
in
the
outer
loop
and
what
they
did
in
animal
in
order
to
keep
the
network
the
same
size,
they
had
the
four
convolution
and
two
fc
and
they
broke
into
two
networks
so
that
the
the
nm
network
has
two
calm
and
one
fully
connected,
and
the
prediction
network
has
two
convent
fully
connected.
So
what
they
do
is
they
freeze
the
comp
and
only
train
the
fully
connected?
But
that's
the
same
thing
they
do
in
oman.
F
B
Yeah
yeah
the
disappointing
part
to
me.
Yes,
I
I
totally
agree
with
everything
you
said.
The
disappointing
part
to
me
is
that
to
me
this
is
just
a
classifier.
B
Yes,
it's
a
it's
a
it's
a
it's
a
linear,
connected
network,
but
this
all
you
know,
you're
just
training,
a
subset
of
the
classifier
here
you
here
you're
doing
multi-head
trading
right
right,
and
so
your
pic,
you
know
what
task
it
is.
You're
only
updating
those
weights
and
the
hard
work
is
all
done
during
meta
learning.
B
B
B
F
B
Yeah,
you
know,
then
they
could
have
done
both.
You
know
I
I
don't
know
how
well
this
would
work
if
they
had
two
hidden
layers
in
here
right,
yeah.
I
I
guess
in
some
sense
in
that
case,
this
could
have
learned
to
do
nothing.
B
B
Okay,
so
there's
gating
that
was
done
by
the
animal
thing
and
in
both
oml
and
animal
they
get
really
sparse
activations
so
that
that's
kind
of
nice
all
right.
Then
there
was
this
other
paper
that
just
came
out.
I
think,
just
a
month
or
two
ago,
the
super
massive
superposition.
B
This
is
slightly
different.
There's
no
meta
learning
here.
In
fact,
the
learning
is
super
simple,
and
so
here
you
know,
you
know
you
have
a
standard
case
where
you're
outputting
a
probability
over
all
the
classes
given
a
set
of
weights
and
an
input.
So
this
is
your
the
output
of
your
neural
network
and
it's
trained
by
cross-entropy
loss.
So
this
is
just
the
standard
setup
and
in
continuous
learning
you
have
a
bunch
of
different
l-way
classification
tasks.
B
B
B
B
Masks:
okay,
so
it's
so
that
means
that
if
you
think
about
you're
doing
continuous
learning
task
one,
you
have
a
particular
subset
of
the
network
that
you
use
as
your
network
for
task,
one,
that's
what
they
call
a
super
mask
and
you
have
another
mask
for
task
two
another
mask
for
task,
three
and
so
on,
and
all
of
these
masks
are
kind
of
subsets
of
the
same
underlying
highly
connected
network.
B
Okay.
So
a
super
mask
is
a
sparse
sub
network
with
essentially
binary
weights
that
are
never
changed.
Each
super
mask
is
specific
to
each
task
and
they
make
the
point
that
this
is
really
efficient
to
store.
You
don't
need
to
store
all
of
the
weights.
You
just
need
to
store
the
random
seat
that
generated
the
weights,
and
then
you
have
to
store
each
of
these
masks
which
are
binary
okay.
So
the
only
thing
now
is:
okay.
How
do
you
learn
these
super
mass
for
each
particular
task?
B
This
part
actually
was
not
clear
to
me
in
the
paper
there's
a
couple
of
different
possibilities,
but
then
they
cite
they
do
cite
a
previous
paper
by
them
called,
what's
hidden
in
a
randomly
weighted
neural
network,
and
this
is
how
they
they
learn
the
mass
there
so
I'll
describe
this,
I'm
not
100
sure
how
they
actually
learn
it
in
this
paper,
but
in
their
previous
paper,
what
they
do
is
they
have
for
each
edge.
B
It's
got
a
weight,
but
it's
also
got
this
score
suv
and
then,
in
the
forward
pass.
They
only
look
at
the
top
k
percent,
the
weights
with
which
have
the
top
k
percent
of
the
scores,
so
they
pick
some
subset
of
it.
You
know
where
it's
determined
by
k.
What
that,
how
large
that
subset
is,
but
then
in
the
backward
pass
they
update
all
the
scores
with
a
straight
through
estimator,
so
they
you
know,
update
this.
They
update
the
scores,
only
not
the
weight
values,
so
they
update
the
score.
B
So,
during
the
backward
pass,
this
score,
which
was
not
in
the
top
k
percent,
could
suddenly
become
in
the
top
k
percent,
and
this
guy
could
now
drop.
So
you
can
actually
you
can
change
the
connectivity
dynamically
through
learning
and
to
me
this
is
very
analogous
to
what
marcus
has
showed
last
year
as
well
in
terms
of
learning
sparsity
through
the
variational
techniques.
B
I
don't
know
if
it's
exactly
equal
or
not
marcus,
maybe
you
can
you
can
open
on
that,
but
it's
at
least
conceptually
it's
very
similar,
so
you
can
think
of
this
score
as
almost
being
like
a
permanence,
except
here
he's
taking
the
top
k
percent
of
the
permanences.
Instead
of
setting
a
threshold
on
the
on
the.
C
A
I,
in
this
case
I
I
wouldn't-
I
wouldn't
describe
this
as
the
variational
stuff
yeah
it's
more
just
like,
but
but
the
permanent
idea
and
tuning
it
using
backprop.
Yes,
that's
it's
with
the
straight
through
estimator.
It's
exactly
the
same.
B
E
So
subatomic
on
the
backboard
pass
when
they,
since
they're
updating
all
the
scores
that
means
on
the
forward
passes.
We
have
to
compute
all
the
they
basically
have
to
treat
it
like
it's
a
it's
a
dance
network
to
do
the
forward
pass
right
because
you
need
all
the
activations
yeah.
B
You
need
all
the
they
need,
all
the
scores
in
the
forward
pass
during
training
after
training.
They
can
just
store
the
mask
for
each
once.
Training
is
completely
finished
and
now
you're
adjusting
inference
and
you
don't
care
about
the
training
anymore.
At
that
point,
they
can
just
store
the
mask
for
each
task.
B
B
So
I
think
this
is
like
uniformly
set
in
the
beginning,
and
then
they
run
a
couple
of
steps
of
gradient
descent
to
minimize
the
entropy
of
the
output,
so
they're
trying
to
find
basically
a
set
of
coefficients
so
that
the
output
is
very
certain.
So,
if
you
think
about
in
the
beginning,
the
output
may
be
maybe
like
this
on
the
left.
B
B
So
here
what
they're
doing
is,
if
they'd
run
a
couple
of
steps
of
this
gradient
and
the
network
is
still
uncertain,
then
let's
say
you're
training
on
a
new
example,
and
even
after
doing
this,
your
out
network
is
still
somewhat
uncertain.
You
don't
get
low
enough
entropy,
then
they
just
instantiate
a
new
mass
and
train
it
on
that
task.
B
B
So
here
they
were
able
to
go
to
2500
classes
in
a
row
which
is
super
impressive,
but
they
do
they
actually
didn't.
Do
it
on
omniglot.
They
did
it
on
permuted
endness,
which
I
believe
is
a
as
a
potentially
a
simpler
task.
B
I
think,
but
the
impressive
thing
is
that
there's
no
task
identity
during
training
or
inference
in
this
chart,
so
they're
able
to
handle
a
wider
set
of
categories
or
scenarios
of
continuous
learning,
and
then
even
in
this
case,
where
there's
no
identity
during
training
or
inference,
they
can
go
for
2500
categories
in
a
row.
B
B
D
Well,
if
they're
using
cross
entropy
here
after
the
numbers
get
really
high,
isn't
that
kind
of
a
very
weak
signal?
D
If
you,
if
you
have
lots
and
lots
and
lots
of
these,
the
masks
and
you're
trying
to
see
which
one
of
them
is
dominating.
D
B
D
E
So
to
define
the
mask
they're
minimizing
the
they're,
trying
to
minimize
the
entropy
of
the
output
distribution,
but
that
doesn't
necessarily
mean
that
that
that
matches
the
correct
output
distribution
right.
It
could
be
confident
but
wrong.
B
Yeah
yeah
absolutely
and
that
that
could
be
a
problem,
but
it
looks
like
they're
getting
pretty
good
results
in
this
one
on
this
test.
So
this
is
pretty
impressive.
Terrible
to
do
here.
G
And
that
plot,
do
you
know
what
up
means
like
what
that
refers
to.
B
Yeah,
I
think
their
upper
bound
is
probably,
if
you
train,
normally
on
all
the
classes
in
a
typical
setting.
What
is
the
maximum
accuracy
to
get
so
that
would
be
the
best
that
this
particular
network
could
do.
Assuming
you
know
full-on
training,
no
continuous
learning,
I'm
assuming
that's
what
that
means.
G
G
G
B
C
F
B
Yeah,
so
I
think
this
is
this
is
during
inference.
B
Yeah,
so
this
is,
this
is
pretty
cool.
I
like
I
like
this
idea.
It's
very
simple.
You
know
I
don't
know.
B
B
C
B
Here
they
can
get
to
about
73
accuracy
on
a
wide
version
of
breastnet
50..
So
the
two
a
couple
of
interesting
things
here
that
they
mentioned
this
is
the
the
previous
paper.
B
I
like
this
chart
here.
What
they
show
is
that,
as
you
make
the
networks
wider
and
wider
and
wider,
by
doing
these
sparse
subsets,
you
can
get
closer
and
closer
to
the
accuracy
of
the
dense
version
of
the
model,
and
so
this
is
again
a
case
where
kind
of
dimensionality
matters,
and
here
also
the
way
they're
doing
it.
You
have
many
combinations
that
are
possible
with
a
wider
network.
B
C
F
We've
seen
another
example
of
this
idea
of
learning
masks
coming
from
the
big
back
paper
in
the
presentation
that
shin
gave
last
year.
F
B
C
F
B
But
I
I
thought
I'd
just
close
with
the
same
thing
again,
I
thought
there
were
a
lot
of
connections
to
atm
temporal
memory.
You
know
here
as
you're,
seeing
with
the
super
mass
one
as
well
as
the
the
neuromodulatory
neuromodulatory
network.
Is
the
instantiate
very
sparse?
Well,
they
instantiate
sub
network
specific
to
each
task.
In
the
case
of
the
supermass,
it's
not
very
sparse,
but
in
the
case
of
animal
it's
at
least
the
output
is
that's
sent
to
the
classifier
is
very
sparse.
B
B
You
know
element-wise
multiplication,
that's
choosing
what
to
send
out
in
the
case
of
super
mass,
it's
very
dynamic
and
they
use
this
kind
of
entropy
measure
to
to
decide
which
ones
to
send
out.
B
I
think
I
think
the
dendrites
option
could
be
much
more
flexible
and
much
more
powerful
here
and
if
we
can
get
to
extremely
sparse
sub
networks,
I
think
that
could
be
super
powerful
as
well
in.
In
almost
all
of
these
cases,
I
think
they
have
sparse
representations
to
avoid
significant
overlap,
and
in
the
supermax
case
it
looks
like
the
weights
are
close
to
binary.
It's
just
two
different
values:
they
don't
and
in
their
case
instead
of
permanences,
they
have
this
pop-up
score,
which
chooses
which
way
it's
to
make
activity
a
learning.
B
So
I
thought
these
were
kind
of
interesting
correlations
between
that.
I
think
they're
sort
of
very
much
in
the
same
spirit
as
what
was
in
htms,
but
hopefully
we
can
do
things
in
a
much
more
flexible
way.
Each
of
these
seems
still
a
very
kind
of
rigid
in
some
particular
way,
and
it's
not
fully
satisfying
yet
and
hopefully,
if
we,
if
we
can
incorporate
a
lot
of
these
principles,
we
can
get
to
something:
that's
really
nice
and
flexible
and
and
powerful.
B
So
50
50
50,
I
think,
is
what
they
find.
So
here's,
like
the
percentage
of
weights
that
are
on
this
is
for
imagenet
50
50
to
30
is
the
sweet
spot
for
them.
B
F
But
it
says
layer
wise
budget
from
34,
which
one
is
34.
F
F
Is
thirty
percent
right
for
an
sat
or
even
lower?
I
think
it's
five
percent
no.
F
C
G
B
So
I
I
think
if
the
sparsity
was
actually
something
close
to
five
percent,
they
would
have
mentioned
it
because
they
make
such
a
big
deal
in
this
previous
paper
that
the
optimal
sparsity
is
around
30
to
70
yeah.
Optimal
density
is
30
to
70
percent.
B
Yeah
I
feel
like
here:
if
it
was
that
low,
it
would
have
mentioned
it
yeah.
So
I'm
guessing
it's
just
how
much
they
change
every
well.
How
much
is
that
changed
every
iteration
is
is
based
on
the
back
props.
So
I
don't
know
if
they
have
a
budget.
So
I
I'm
confused.
I
don't
know
what
this
layer
wise
budget
is.
E
E
B
Now,
interestingly,
they
didn't
look
at
their
activation.
Sparsity
they're,
just
saying
about
weight
sparsity,
whereas
the
other
two
animal
and
oml
looked
at
activation.
Sparsity.
E
So
for
some
of
the
super
masks
they
they
made
these.
I
guess
they
were
only
working
in
the
regime
where
the
weights
are
binary,
but
is
there
was
there
a
reason
for
that.
F
B
It's
either
minus
c
or
c
randomly.
F
And
I
think
the
reason
quran
is
that
if
you
look
at
the
original
super
mask
paper,
they
started
they
started
with
random
weight
and
they
tried
to
make
it
simpler
and
simpler
and
simpler
to
see
how
far
they
could
get
by
just
learning
the
mask,
and
they
got
to
the
point
that
all
you
all
you
needed
is
supposed
to
set
a
constant
and
all
that
matter
if
they
were
positive
or
negative.
So
that
was
kind
of
the
end
of
the
super
market,
the
original
one.
F
B
And
I
think
they
mentioned
that's
actually
inspired
by
hattie's
paper.
E
So
in
supermass
I
think
they
cited
two
papers
for
the
way
they
actually
learned
the
supermass.
So
one
was
the
ramanujan
paper
that
subita,
I
think
you
also
talked
about
that.
That's
one
that
uses
a
straight
through
estimator,
right,
yeah,
yeah
and
then,
and
then
this
one
that
michelangelo
just
sent
is
this.
Is
this
the
other
method
of.
F
Yes,
so
so
just
to
give
an
overall
view.
So
the
idea
here
of
this
original
supermass
paper
was
following
the
lottery
ticket
paper
and
what
they
were
trying
to
show
here
is
that
you
didn't
even
have
to
reinitialize
the
weight
so
in
the
lottery
ticket,
they
learn
the
mask
and
they
reinitialize
their
weight
to
the
original
value,
and
then
they
retrain
from
there
and
what
they're
showing
here
is
that
you
don't
even
have
to
initialize
the
weights.
F
You
just
learn
the
masks
and
if
you
re-initialize,
all
the
weights
just
to
a
constant
value,
that's
the
same
sign
of
the
original
weight.
So
if
it
was
originally
positive,
it's
a
plus
one.
If
it's
originally
negative,
it's
a
minus
one,
then
you
get
the
same
results.
So
this
was
following
the
lottery
ticket.
So
that's
that's
at
the
end
here
this
mask
the
large
final
same
sign
mass
criteria
so
yeah,
just
if
you
go
through
the
blog
paper
later.
It's
it's
actually
very
interesting,
heidi
actually
presented
to
us
here.
F
B
Yeah,
it
was
in
the
brains
that
they
meet
up
last
year.
B
I
think
the
one
another
nice
thing
with
this
particular
paper
is:
they
were
actually
able
to
get.
You
know,
competitive
or
better
than
you
know,
state-of-the-art
results
on
this
really
long,
continuous
learning
task,
whereas
I
think
some
of
the
other
ones,
the
results,
weren't
quite
state-of-the-art,
but
they
were
getting
close
to.
They
were
just
doing
better
than
well.
B
You
know
a
lot
better
than
chance
and
better
than
what
you
might
expect,
whereas
here
I
think
I
don't
think
there's
any
technique
so
far
that
comes
close
to
our
2500
categories
in
a
row
and
and
the
memory
usage
and
speed
is
really
fast.
It's
such
a
simple
technique,
so
they
make
a
point
of
explaining
how
that
they
can
do
this
really
fast
in
pytorch
and
on
gpus.
F
B
Yeah,
I
guess
you
still
have
to
have
it
fully
instantiated
in
memory
yeah
at
runtime,
so
it
still
has
to
fit
in
the
memory
of
a
gpu.
C
B
Yeah
yeah,
I
keep
wondering
you
know
you
know
what,
if
you
had
networks
that
were
so
wide
that
they
just
couldn't
fit
in
memory,
but
you
only
needed
a
really
sparse
subset
of
it.
You
know,
could
you
implement
it
in
such
a
way
that
that
you
could
run
that
efficiently?