►
Description
In our previous research meeting, Subutai reviewed three different papers on continuous learning models. In today's short research meeting, Karan reviews a paper from 1991 that he points out was referenced by all three. The paper, "Using Semi-Distributed Representations to Overcome Catastrophic Forgetting in Connectionist Networks" (http://axon.cs.byu.edu/~martinez/classes/678/Presentations/Dean.pdf), was one of the first papers to reference sparse representations in continuous learning.
B
On
all
right,
so
I
was
reviewing
this
paper
which
so
so,
actually,
let's
start
with
so
on
monday's
meeting,
subatai
presented
a
few
papers
and
they
all
they
all
seemed
to
reference.
This
paper
from
back
in
the
90s,
which
talked
about
the
use
of
sparse
representations
in
continuous
learning-
and
I
think
this
is
probably
one
of
the
first
ones
published
on
that,
since
all
of
the
ones
talked
about
had
they
all
used
sparsity
in
some
ways.
So
I
thought.
B
C
I
mean
because
you
know
you
think
about
the
thousand
brain
theory
that
is
a
semi-distributed
representation
of
the
world,
so
that
those
words
are
you
know,
you
know,
there's
it's
a
sparse
distributed
model,
so
I
don't
know,
I
don't
think
they
meant
that
here,
but
those
words
are
very
reminiscent
of
that.
So
but
okay.
D
Yeah,
it's
like
another
another
acronym
for
sdr
yeah,
distributed.
B
But
here
in
this
one,
when
they,
when
they
say
semi-distributed
representations,
they
don't
necessarily
mean
it
doesn't
necessarily
have
to
be
sparse.
That's
what
I'm
asking.
B
Yeah
yeah
it
doesn't,
it
doesn't
have
to
be
sparse,
but
I
think
it
it
ends
up
being
very
close
to
sparse
and
you'll
see
what
I
mean
in
some
of
the
later.
C
B
Yeah,
I
found
it
interesting
that
this
paper
it's
it's,
cited
a
lot
of
other
a
lot
of
the
other
papers
that
it's
cited
and
then
some
of
the
ones
that
cited
this
that
were
published
around
the
same
time.
They
were
all
published
in
in
cognitive
science,
but
and
none
of
them
were
even
though
they're
all
neural
network
papers.
None
of
them
were
actually
published
in
europe's,
and
I
guess
that's
because
new
york's
was
just
had
just
started
taking
off
at
that
point.
B
So
this
was
like
the
13th
cogside
conference,
but
new
york's
is
only
at
like
it's
first
or
second
conference
by
then
anyways.
Let's
start
so
so,
there's
there's
two.
I
guess
main
types
of
representations.
B
One
is
a
distributed
representation,
which
is
very
common
today
in
neural
networks
like
the
hidden
layers,
come
up
with
fully
distributed
representations
where
all
the
I
guess,
all
the
all
the
nodes
are
being
used
in
some
way,
and
then
you
also
have
local
representations,
which
are
basically
like
one-hot
encodings
or
something
of
the
and
but
but
then
there's
there's
a
bit
of
a
trade-off
between
these
two
types
of
representations,
so
neural
networks
and
their
distributed
representations.
They
get
really
good
generalization,
they're
able
to
generalize
to
new
new
inputs.
B
But
on
the
other
hand,
you
know
when
you
want
to
train
them,
to
do
something
new
they're,
not
able
to
retain
the
knowledge
that
they
learned
on
some
of
the
earlier
things,
because
you're
updating
the
way
that
the
whole
representation
is
built
and,
on
the
other
hand,
representations
such
as
one-on-one
codings,
which
are
local
and
not
distributed.
They
are
able
to
generally
retain
knowledge,
but
they're,
not
very
good,
at
generalizing
to
to
new
inputs.
B
So
the
whole
goal
is
to
get
something
that
that
sort
of
is
a
trade-off
and
can
combine
the
idea
of
local
representations
with
distributed
representations.
And
so
that's
where
semi-distributed
representations
come
in
okay,
so
this
is
one
claim
that
the
author,
robert
french,
made
in
the
paper
and
it's
a
big
one
that
drives
the
motivation
for
everything.
So
catastrophic,
forgetting
is
a
direct
consequence
of
the
overlap
of
distributed
representations
and
can
be
reduced
by
reducing
this
overlap
so
by
overlapping
representations.
B
He
actually
defines
a
way
how
we
can
quantify
this.
So
if
you
think
about
say
you
have
four
features
that
are
represented
right
in
a
vector
for
example,
and
like
so
so
one
example,
sorry,
one
input
can
be
represented
by
the
the
orange
and
another
one
by
the
green.
The
overlap
in
these
two
representations
is
just
like
the
minimum
of
each
of
these
of
the
values
at
each
of
the
features.
So
here
the
minimum
is
just
0.2
here,
there's
no
overlap
at
all
here.
The
overlap
would
be
0.9,
and
here
point
one.
B
So
the
over
over
total
overlap
would
just
be
the
average
of
all
those
values.
So
this
would
be
a
relatively
high
overlap,
whereas
here
the
overlap
is
a
lot
lower
and-
and
so
because
this
is-
I
guess
this
is
a
bit-
this
is
a
bit
more.
This
is
a
bit
closer
to
sparsity.
B
Yeah,
so
the
so,
so
what?
Basically?
What
what
this
guy's
saying
in
this
quote
is
that
when
you
have,
when
you
have
high
overlap,
you're
going
to
end
up
having
catastrophic
forgetting
occur,
and
the
reason
for
this
is
here
on
the
next
slide.
D
The
previous
one,
you
know
if
those
things
were
binary,
then
this
would
be
exactly
our
definition
of
overlap
too.
It
would
be.
There
would
be
basically
equivalent
yeah
that
you
have
to
have
one
in
both
places
for
there
to
be
an
overlap.
B
Yeah
yeah
exactly
okay,
so
we
want
to
ultimately
minimize
this
and
the
motivation
for
minimizing
this
overlap
between
inputs
of
two
different.
Two
different
representations
of
two
different
inputs
is
as
follows.
So
say
here
you
have
the
standard
model
which
which
we
see
in
all
neural
networks
where
you
have
a?
U
has
a
going
into
unit
v
and
is
given
by
this
weight,
and
so
when
you
want
to
calculate
the
activation
of
unit
v,
it's
just
a
non-linearity
f
applied
to
the
total
input
and
that
input
at
some
point
takes
w
multiplied
by
x.
B
To
u
x?
U
is
just
the
output
of
unit?
U
which
this
whole
thing
is
that
v.
So
when
we
want
to
back
propagate
the
errors-
and
you
want
to
compute
the
the
how
how
w
is
going
to
be
updated.
It's
essentially
essentially
comes
down
to
this
term,
where
you
have
this
x,
which
is
the
total
input
here,
which
is
the
activation
of
unit?
U,
and
so,
w
is
changing
proportional
to
what
x
is,
which
is
which
is
the
output
of?
U?
B
So
if
we
change,
if
we
change
different
weights
for
different
inputs,
then
meaning
that
like
say
you
have
different
inputs
or
different
tasks
in
this
case.
So
if
you're,
if
x,
is
always
activating
if
x,
is
always
active,
then
this
then
w
this
particular
weight
is
going
to
be
changing
all
the
time.
But
if
you
have
low
overlap,
which
is
what
robert
frencher
is
advocating
for,
then
it's
only
then
you
you
this
this
unit
is
probably
only
going
to
be
active.
B
X
is
only
going
to
be
non-zero
or
large
very
very
few
times,
and
so
it's
w's
only
gonna
change
once
in
a
while.
So
that's
all
intuition
here,
if
you
have
these,
if
you
have
sparser
representations,
then
it's
not
going
to
change
w's
not
going
to
change
as
much.
B
You
know,
and
the
rule
is
that
for
one
specific
task
or
one
type
of
input
w's
only
going
to
activate
for
for
that
one
and
then,
when
you
have
a
different
different
task,
w
x
is
not
going
to
activate
at
all,
so
w
won't
really
be
changed
so
each
time
you're,
just
basically
learning
a
different
sub
network
on
the
forward.
Pass,
which
means
you're
only
updating
that
network
on
the
backward
pass.
C
Yeah,
can
I
interrupt
for
a
second
here.
One
thing
is
always
puzzling
me,
so
this
all
makes
sense,
but
what's
one
thing
that's
puzzled
me
is
is
well
if
you're
changing
all
the
w's
but
you're
changing
your
money
a
little
bit
right,
you're,
not
changing
them
all
radically.
C
It's
not
clear
to
me
why
exactly
that
everything
is
forgotten
it
why
it's
catastrophic!
You
know
what
I'm
saying
it's
like
it's
like.
If
I
change
all
the
w's
a
little
bit
the
does,
does
it
still
work
or
is
it
everything
falls
apart?
Is
it
super
sensitive
to
every
little
nuance
and
w?
Why
would
that
be?
You
know
what
I'm
saying
it's
it's
not
obvious
to
me.
C
C
D
Know
yeah,
if
you
just
did
one
little
gradient
update.
That
would
be
true
if
you
just
did,
but
usually
they'll,
do
lots
and
lots
of
these
tiny
changes
until
you
work
until
it
works
well
on
this
new
thing
and
by
then
you've
changed
w
quite
a
bit,
it
could
be
pretty
dramatic.
It
doesn't
so
each
gradient
update
is
very
small,
but
often
they
have
to
do
lots
of
them.
You
know
in
order
to
actually,
if
you
change
it
just
a
little
bit,
it's
not
going
to
do
that.
Well
on
the
new
task
at
all.
B
Yeah,
okay,
I'm
gonna
just
move
on
so
so
they
give
a
a
suggestion
of
how
they
can
reduce
this
overlap
and
basically
it's
it's
almost
the
same
idea
as
doing
okay
winners.
So
if
you
have
this
as
your
original
input,
then
what
you're
doing
is
you're
making
the
the
ones
that
the
features
that
are
really
large,
you're,
making
them
even
bigger
and
the
ones
that
are
really
small,
you're,
making
them
even
smaller.
So
this
is
what
this
is
doing.
So
that's.
D
B
Not
exactly
so
it's
so
it's
just,
they
didn't
explicitly
say
what
alpha
is,
but
I
would
assume
it
only
makes
sense
that
it's
between
zero
and
one.
So
here,
if
this,
so
they
say
this
is
a
really.
This
is
a
eight
gold
is
a
very
high
value
right,
so
it's
closer
to
one
than
whatever
that
whatever
is
remaining
which
it
didn't
get,
then
you
would
sort
of
add
that
on,
but
you
would
add
it,
but
you'd
reduce
it.
B
D
C
C
It's
it's.
It's
still.
The
whole
system
is
a
bit
surprising
to
me
how
how
you
can
actually
get
all
these
weights
to
be
correct
for
the
entire
training
set.
Yet
it's
still
it's
just
it
just
yet.
Yet
it's
not
robust
they're
changing.
I
I
get
it
I
just
I'm.
All
I'm
pointing
out
is
not
intuitively
obvious.
Why
it's
like
that,
but
it's
okay
yeah!
It's.
A
B
Yeah,
so
so,
as
I
was
saying
about
this,
this,
this
in
a
sense,
is
kind
of
like
k.
Winners,
because
you
can
choose
the
not
you
can
choose
the
number
of
features
that
you
want
to
make
bigger
and
how
many
you
want
to
make
smaller.
It's
like.
I
guess
it's.
It's
almost
a
softened
version
of
of
of
k,
winners
where,
where,
if
I
guess,
if
you
set
alpha
to
one,
then
that
would
just
be
equivalent
of
k,
winners,
yeah,
that's.
D
But
it's
not
comparing
to
the
other
ones.
Is
it
it's
just.
B
So
what
the
ones
that
so
you
have
to
know,
which
ones
are
sharpened
and
which
ones
are
reduced?
You
know
that
you
have
to
know
that,
based
on
the
relative
rank
ordering.
D
Okay-
okay,
so
here
he's
he's
just
looking
at
activations
he's
not
looking
at
weights
at
all
and
he's
sharpening
the
active
activations
to
emphasize
the
ones
that
are
already
higher
and
de-emphasize
the
ones
that
are
lower.
That's
yeah!
That's.
B
Okay,
so
after
seeing
this,
I
just
wanted
to
make
the
connection
back
to
what
subatype
presented
on
monday.
So
we
talked
about
this
paper
oml,
where
they
claimed
to
have
very
sparse
activations
in
one
of
their
intermediate
layers.
I
think
after
they
did
all
the
after
they
did
the
after.
They
passed
everything
through
the
representation
learning
network.
They
showed
that,
like
they
were
getting
roughly
3.8
activation
and
but
on
average,
all
the
units
were
activating
for
different
inputs.
So
it's
not
like
there
were
no.
B
There
were
no
dead
neurons
here,
and
that
made
me
think
that
you
know
in
this
case,
since,
since
they
are
using
relu
here
and
a
lot
of
the
units
are
actually
have
gone
to
zero
when
they're
when
they're
learning
this
network,
when
they're
doing
the
backward
pass,
I
would
hypothesize
that
they're-
probably
probably
it's
probably
looking
something
like
this,
where
only
very
few
weights
are
being
updated
so
that
that
in
a
sense
is
like.
B
I
guess
one
one
advantage
here
that,
as
opposed
to
regular
dense
networks,
you
know
a
lot
of
the
weights
aren't
even
going
to
be
touched
here
in
this
case.
So
so
and
then
that
team
to
do
well
for
them,
so
I
guess
there's
it's
worth
investigating
whether
or
not
this.
This
idea
of
just
updating
a
very
few
number
of
weights
at
a
time
would
be
helpful.
B
Okay
and
finally,
also
a
few
weeks
ago.
I
guess
I've
I've
shown
all
these
graphs
before,
but
a
few
weeks
ago
you
know,
I
tried
out
on
a
simple,
continuous
learning
task
at
the
different
dense
network
versus
a
sparse
network.
So
there's
no
there's
no
weight,
consolidation
penalty
or
anything
here.
This
is
just
this
is
just
a
regular
loss,
and
I
found
that
I
guess
the
the
difference
is
this
difference
here.
B
Wasn't
too
big,
but
the
sparsity
could
explain
possibly
why
I
was
able
to
get
higher
accuracy
with
the
sparse
network
on
split
amnest,
but
then
I
couldn't
get
that
on
on
the
gsc.
So.
B
D
Yeah
yeah,
if
you
go
back
to
so
that's
that's
good.
If
you
go
back
to
the
previous
slide,
you
know
you
know
so
this.
What
you're,
showing
in
the
feedboard
network
is
because
the
activations
are
sparse
and
the
weight
updates
are
going
to
be
proportional
to
the
activation.
Only
a
small
number
of
those
weights
will
actually
get
updated
right
in.
D
B
But
I
just
super
tight
going
back
to
what
you
said.
If
we
look
at
the
the
update
rules
here
so
when
the
activation
is
zero,
that
kills
the
gradient
right
there,
but
the
weight
being
zero.
Does
that
also
kill
it?
Because
it's
not
going
to
make
this
this
term
zero?
Is
it.
D
Yeah,
it's
well
what
we
we
have
to
do
when
a
connection
when
a
weight
is
it's
not
that
the
weight
is
zero.
It's
that
the
connection
is
missing.
So
we
enforce
that
with
a
mask.
So
it's
it's
not
that
the
weight
is
zero
and
we
increment
it
and
it
becomes
non-zero.
D
It's
it's
actually
the
connection
it
is
missing
so
that
that's
another
thing
and
then
and
as
lucas
said,
that
also
impacts
gradient
flow
back
to
previous
parts
of
the
network
as
well.
E
Could
ask
a
quick
question
of
that?
If,
if
we're
applying
the
mask
we're
applying
the
mask
the
gradient,
so
the
gradients
are
being
computed,
then
being
clamped
to
zero
for
those
connections.
D
Yeah
there
are
a
few
different
ways
to
do
it.
I
think
our
current
approach
is
just
to
multiply
the
weights
with
a
binary
mask
and
that
will
just
kill
the
gradients.
E
D
Yeah,
that's
an
implementation,
but
conceptually
the
way
to
think
about
it
is
it's
not
that
the
weight
is
zero.
It's
that
the
weight
is
missing.
It's
just.
The
connection
is
not
there
that's
the
way,
and
in
that
case
there
just
wouldn't
be
any
gradients,
and
so
we
implemented
by
suppressing
it,
but
just
to
take
advantage
of
pi
torch's,
vectorized
operations
and
and
stuff.
But
that's
that's
just
a
implementation
detail.
You
could
imagine
implementing
it.
You
know
by
five
different
ways.
D
The
the
the
important
thing
is:
it's
not
that
it's
a
weight
of
zero
that
could
then
become
positive
or
negative.
Is
that
the
connection
is
just
not
there.
A
There
are
a
few
dynamic,
sparse
methods
that
make
use
of
that
wasted.
Energy
kevin,
like
you
said
so,
you
just
separately.
You
keep
track
of
all
the
gradients,
even
if
you're,
not
updating
and
then,
if
at
some
point
a
connection
has
a
lot
of
accumulator
grenade
and
then
you
let
that
connection
grow
and
you
put
it
back
in
the
network.
So.
D
D
Yeah
and
and
the
intuition
there
from
a
neuroscience
standpoint,
is
it's
that's
kind
of
like
heavy,
and
you
know
if
if
these
two
things
are
kind
of
working
together,
but
there's
no
connection,
then
you
would
want
to
grow
a
connection
there.
That's
the
kind
of
neuroscience
intuition
for
it.
D
B
Yeah,
I
think
what
I
like
about
it
is
that
you
know
it
connects
varsity
with
continuous
learning,
so
I
think
they're,
that's
it's
a
promising
direction.
Also.
B
Go
ahead,
I
just
wanted
to
mention
that
it
just
it
just
came
to
me
that
I
think
there
was
another
method
that
that
was
proposed
for
obtaining
this
sort
of
this
sort
of
update,
where
they
compute
the
gradient
as
regular,
but
then
they're
on
the
back
on
the
backward
pass.
There's
they're
only
taking
the
the
top
like
the
top
k,
partial
derivative
values
and
zeroing
out
everything
else
that
way:
you're
you're,
penalizing
the
weights
that
that
caused
the
biggest
or
how
explain
the
loss.
The
most.
B
D
Was
just
another
method,
who's
doing
that
french
was
doing
that
or
oh.
B
No,
that
was
that
was
another
paper
quarter.
I
can.
I
can
link
you
to
that,
okay,
but
but
that
that
was
another
technique.
Someone
had.
A
I
have
a
curiosity
on
that
slide.
I
don't
know
if
you
know
the
answer,
but
there
is
a
method
in
the
middle
sign
n
and
it
shows
a
lot
of
that
difference.
A
So
in
your
slide,
you
have
three
methods
there
like
pre-training,
oml
and
then
the
one
in
the
middle
srn,
and
that
one
shows
a
lot
of
that
neurons
right.
B
Yeah
here
they
were
why.
A
C
What
makes
that
I
think
it's
the
other
to
me,
it's
a
little
confusing
the
other
around,
because
you
know
we
found
with
the
spatial
cooler.
You
ended
up
with
dead
neurons
until
you
had
a
boosting
function,
and
so
that
seems
like
you're
a
natural
thing
to
do,
and
so
I
was
surprised
that
you
all
the
units
were
active
in
the
oml
and
the
pre-training
thing
without
any
kind
of
boosting,
so
that
to
me
that
was
like.
D
Yeah,
that's
the
one
of
the
points
they're
making
in
this
chart
is
that
through
oml
they
were
able
to
get
something
like
boosting,
but
the
way
they
did
it
is
they
used
an
evolutionary
approach
to
train
the
sparse
network
so
that
it
worked
well
on
continuous
learning.
So
it's
kind
of
it
was
this
very
sort
of
computationally
intensive
approach
to
come
up
with
the
sparse
representation
such
that
it
worked
well
for
continuous
learning
and.