►
From YouTube: Review: Superposition of many models into one
Description
For this Friday’s journal club, we will be looking at a recently published paper by Bruno Olshausen at the Redwood Institute. It is about a more plausible algorithm in terms of biological constraints. It also ties well with the latest discussions on Continuous Learning, it provides an ingenious and elegant approach to the catastrophic forgetting problem in multitask learning.
Here is the link to the paper: https://arxiv.org/abs/1902.05522
B
You
spend
$2
this
thing
about
10
or
15
minutes.
It's
relatively
straightforward
projects,
so
this
is
a
very
zipper
from
burners
group
at
the
Brennan
Center
on
crime,
Mercer
position
and
they're
trying
to
solve
a
problem
in
continuous
learning,
called
catastrophic,
forgetting
sin
generally
continued
learning
paradigm.
We
have
multiple
tasks
or
changing
over
time,
we're
a
show
like
a
bad
joke
test,
one
and
I'm
worship,
etc.
Okay,
I
mean
what
typically
happens
with
traditional
neural
networks.
Is
that
you
do
you
some
will
and
class
one
get
good
accuracy
and
then
is
how.
B
A
Problem,
continuous
learning,
so
I
think
I
understand
this
question,
but
it
doesn't
ask
it
anyway.
I
mean
you
have
to
be
this
right,
so
the
Assumption
here
is
that
there's
some
sort
of
like
simulated
annealing
process
going
on
where,
if
you
try
it
has
to
balance
out
the
training
examples
across
different
categories.
Otherwise
you
know
just
yeah.
We
didn't
have
to
turn
out
this
way.
A
D
C
A
B
A
So
here
the
general
assumption
under
machines,
most
machine
learning,
algorithms,
most
of
them
it
isn't
a
time-based
thing
and
that
the
order
doesn't
reflect
reasons
you
have
you
know,
doesn't
reflect
shift
in
the
world.
Well,
statistics
so
I'm
doing
image
classification.
Almost
all
the
problems
are
not
saying
well,
more.
Recent
images
Network
the
other
ones
yet
at
the
training
part,
it's
just
understanding
obvious
for
my
friend
that
is
proper
to
deal
well.
A
B
C
B
They
talk
about
two
different
ways
that
we
get
task
fixed,
so
one
is
that
the
inputs
are
changing
distribution,
so
this
is
like
from
the
fashion
I'm
gonna
say
this
has
to
like
pants
that
are
rotating,
so
you
actually
a
rotating
images
slightly
over
time.
So
you
might
have
your
degrees
here
in
degrees
here
in
four
degrees,
in
test
three,
and
so
that's
one
type
of
test
change
and
the
other
is
that
changing
the
labels
of
the
dataset.
B
So
in
the
side
bar
case
you
might
have
ten
labels
from
a
site
bartend
and
then
the
next
task
might
be
another
time
labels
chosen
from
cipher,
100
and
so
they're.
Actually,
if
the
outputs
they're
producing
that
are
changing
so
both
cases,
the
distributional
shift.
That's
a
label
opportunity
versus
the
input,
but
we
see
this
problem
in
keepers
learning
in
both
cases.
So
the
general
background.
So
the
premise
behind
this
paper
is
that
neural
networks
in
general
are
over
parameterised,
so
the
parameter
space
is
extremely
high.
B
Dimensional
by
the
inputs
of
things
we're
trying
to
that
model,
are
they
send
out
a
low
dimensional
manifold?
So
it's
a
much
smaller
dimensionality.
That's
just
within
a
larger
framers
base,
and
that
means
that
potentially
the
neural
networks
have
this
potential
to
pack
parameters.
Multiple
manifolds
into
the
same
large
parameter
space,
one
for
each
task
without
without
any
cross
influence.
So
illusion
the
possibility
is
yeah.
B
Or
rotation,
and
so
they're
more
they're
going
to
do,
is
they're
going
to
choose
a
particular
context,
matrix
for
every
single
task,
and
this
is
fixed.
We,
these
aren't
learning
parameters
and
we're
going
to
use
these
two
both
store
updates
to
the
weights,
as
well
as
retrieve
the
weights
for
a
particular
task.
So
we're
on
task
one:
we're
gonna
have
C
1
and
C
1.
B
It's
going
to
take
every
input,
rotated
and
then
weight
updates
are
going
to
be
sort
of
addressed
into
the
parameters
based
based
on
this
context
matrix,
and
so
we
don't
have
to
this
is
given.
We
know
that
we're
on
task
one,
so
we're
going
to
be
using
C
1.
We
know
that
as
soon
as
task
2
starts,
where
we
have
a
totally
different
matrix,
T
2,
there's
a
different
name
just.
B
Sort
of
like
an
address
into
a
parameter
space-
that's
how
I've
been
thinking
about
it.
So
we
have
a
huge
parameter
space.
The
context
is
transforming
parameters
or
inputs
into
one
section,
one
subspace
or
the
parameter
space.
That's
going
to
be
used
just
for
that
task.
We're
trying
to
kind
of
subdivide
the
subspace
is
that
the
Predators
that
homily.
A
C
B
You
know
five
based
on
us.
You
choose
this
context,
maybe
they're
just
transformations
so
like
a
really
simple
version
of
this
is
to
buy
there
any
context
where
you
take
every
element
and
you
either
multiply
by
1
or
negative
1,
and
you
randomly
choose
where
it
is
negative,
one
small,
that's
what
this
is
doing
is
it's
doing
with
rotation
on
the
inputs
in
a
way
that
it
makes
inputs.
It's
an
orthogonal
rotation.
There's
a
purchased
inputs
into
maximum
by
n
matrix.
It
is
addressing
as
you're
learning
task
2.
C
A
C
A
A
B
C
Right,
you
know
you
might
end
up
with
sparse
representations
this
way,
actually,
because
if
you,
if
I
understand
it
correctly
for
a
task
one,
you
know
if
you're,
if
you're
sending
in
inputs,
some
set
of
dot
products
will
be
positive
and
the
values
will
be
active
if
you're
in
task
2,
you
know
it's
orthogonal
rotation
of
the
weights
and
it'll
be
a
different
set
of
values.
That's
gonna
be
activated
different
set
of
rally
so
for.
C
B
C
A
Work
long
struggle
trying
to
understand
the
nature
of
high
dimensional
spaces
are
really
really
weird
right.
It's
really
hard
to
visualize
think
about
them,
especially
they're.
You
know
binary
or
very
low
resolution
right
high
dimensionality,
instead
of
just
always
like
it
felt
like
every
there's
only
when
you
actually
sort
of
laid
out
this
farce
argument,
which
of
course
each
other
totally
do
too,
but
you
kind
of
make
leaders
and
now
that
that
I
can
do
the
conclusion,
then
the
only
way
to
really
think
about
these
phases,
so
they
don't
collapse.
A
A
That
is
fact
so
in
some
sense,
I'm
saying,
if
you're
going
to
solve
the
problem,
you're
going
to
end
up
there,
that's
a
it's
I'm
testing,
my
knowledge,
but
I,
don't
understand
your
riddle
or
don't
get
that
whole
I
didn't
I'm,
just
saying
from
a
different
direction.
It
feels
like
that
it
has
to
do
this
man.
A
C
A
C
Well,
we
don't
want
it,
we
don't
do
that,
but
in
mathematically
as
possible,
and
you
still
get
the
same
properties
it's
just
now,
instead
of
having
twenty
spinet
so
that
you
have
thousands
of
synapses
on
each
connector
which
we
wouldn't
want
and
that's
kind
of
what
they're
doing
here
is
they're
I.
C
A
B
Okay,
so
Amelia
always
depends
upon
this
property
of
miracles
that
there
are,
they
learn
their
low
dimensional
input.
Space
is
low,
dimensional
I
think
this
RC
might
be
a
good
heuristic
for
finding
the
manifold,
but
I'm
not
sure
it's
the
only
one
right
missus
the
demonstration
of
a
non
sparse
way
of
actually
identifying
these
manifolds,
then
rotating
them
away
from
each
other.
A
D
B
D
C
C
B
A
C
C
So
it's
a
point
in
a
very
high
dimensional
space
and
you've
trained
it
on
past
one
and
now
you're
training
it
on
past
two,
and
you
have
a
choice
of
where
to
how
to
up
it.
This
way,
if
you
look
at
it
task
1,
it
may
be
that
there
is
a
whole
bunch
of
directions
you
can
move
it
in
which
would
not
affect
hassle
on
performance
and
there's
a
whole
bunch
of
directions.
You
can
move
it
in.
C
E
B
A
Is
it
possible
and
to
achieve
all
these
results
using
point
neurons
basically
did
not
support
the
digital
point
neurons
and
not
sputtering,
it's
impossible
or
is
it
impossible?
We
know
the
brain
goes
to
a
dump.
We
know
there's
an
elegant
solution
which
is
specifications
and
it's
alligator
works.
So
really
the
question
is
canopy
achieve
this
way
and
you
can
achieve
all
the
same
results
this
way
then
you're
gonna
call
that
right.
You
can't
achieve
move
this
way.
So
it's
worth
asking
these
questions
and
try
to
keep
them
in
mind.
A
B
A
And
we,
then
we
forget
things
based
on
heuristics
that
have
to
do.
Is
the
college
complex
about
recently?
You
reinforce
that
sort
of
thing.
We
follow
that
with
this
permanence
nodding,
which
really
just
reflects
me
yeah
or
it
provides
a
combination
of
recency
and
number
of
times.
We
understand
the
real
biology
is
a
little
more
complicated,
but
but
yes
there's
some
there's
something,
so
we
just
keep
learning
and
we
don't
really
forget
until
we
kind
of
kind
of
max
out
now.
Obviously
the
brain
actually
do
forget
things
on
a
regular
basis.
C
A
D
D
A
B
Some
of
them
are
considered
rotation,
matrices,
I,
guess
it's
upset,
but
in
generally
they're
just
transformations,
so
ever
yeah
we're
doing
some
sort
of
warping
a
rotation
o
space
to
get
either
in
quasar
parameters
into
distant
parts
of
the
parameter
space,
so
they
sort
of
the
rotation.
The
problem
with
this
one
is
that
this
is
essentially
a
dense
rotation
matrix,
so
every
one
of
these
I
use
populated,
meaning
that
you
have
to
add
and
squared
parameters
for
every
additional
task,
you'll
alert,
and
so
this
was
inefficient.
B
We
were
a
little
bit
confused
by
why
this
is
so
problematic,
since
these
aren't
learn
parameters.
Atavist-
and
this
is
the
hyper
parameters-
are
fixed
before
training
starts,
so
it
still
seems
like
this
is
potentially
useful.
They've
been
overly
reporter
results
from
this
one.
They
try
this
complex
in
this
binary.
You
know
three
things
that
they
talked
go
and
study
of
producing
context.
It
is
kind
of
a.
B
B
B
Don't
have
a
great
this.
The
way
it
works
is
it's
a
diagonal
matrix
like
this,
and
every
one
of
the
diagonal
entries
is
a
different
rotation
on
a
unit
circle,
and
so
it
ends
up
being
a
big
transformation,
really
is
a
single
negative
PI
to
PI
revision,
and
then
the
binary
version
is
just
the
same
thing,
whether
you're
sampling
uniformly
random
from
negative
one
one.
So
you
just
get
a
binary
value
like
this,
so
that
means
of
each
access.
A
B
Yeah
I
think
this
is
it
so
they
were
seeing
pretty
good
state
of
the
art
with
the
complex
in
the
binary
context,
elections.
The
limitations
to
me,
one
of
the
big
limitations-
and
this
is
outside
of
the
scope
of
this
paper-
was
that
these
contexts
have
to
be
predefined
and
the
real
world.
Of
course,
we
have
no
idea
what
context
for
yeah.
We
don't
want
even
a
label
of
attention.
So
it's
it's
nice
to
be
able
to
just
say
we
never
test.
You
know.
B
A
D
C
D
B
A
D
B
C
A
A
B
B
A
B
C
A
A
C
B
C
B
B
Their
work
here
and
nothing
to
share
here.
This
was
just
me
trying
to
wrap
my
mind
around
how
these
transformations
are
working.
The
new
one
commits,
and
there
is
it
you
know
instead
of
the
thing
about
this-
is
maybe
doing
difference
or
transformations
on
the
premise
based.
You
can
look
at
it.
Every
time
a
training
sample
comes
in
10x,
you
actually
transform
X
and
it's
the
same.
B
As
for
sleepy
most
the
this
idea
of
like
context
collapse,
though
it's
an
ex
parte
has
this
or
la
canal
and
then
for
example,
it's
the
test.
We're
training,
our
actually
like
T,
1
and
T
2
are
identical.
We're
still
going
to
have
these,
like
sparse,
are
in
totally
different
spaces
you
to
artifact
yeah
I,
instead
of
giving
0
degrees
and
two
trees.
I
just
gave
two
degrees
in
zero
degree,
so
it's
actually
the
same
exact
mo
distribution.
I
told
it
I
told
it
that
it
needs
to
choose
a
different
context,
matrix
for
test.
A
C
A
C
D
C
C
E
C
About
in
terms
of
some
real-world
stream
that
now,
in
some
cases,
some
of
them
change
for
what
happened
at
the
temple
memories,
just
those
would
get
altered.
The
other
stuff
would
not
get
affected.
So
again,
no
catastrophic
forgetting
and
the
learning
is
contextual
to
the
stuff
that
actually
changed.
You'll
need
context.
Matrices
you
need
anything,
it
just
does
it
automatically
and
he
took
a
context
drift.
Similarly,
it's
going
to
automatically
be
that
learning
circle
stuff
will
get
about.
A
A
A
C
I,
don't
wanna,
explain
yourself
and
I
just
say:
okay,
you
just
get
a
and
mix
with
B
and
that's
it
and
you
could
think
of
just
the
context
for
paying
a
complex
B
and
we
stand
together
and
then
I
have
this
new
context
and
I
can
start
to
think
about
using
this
new
context.
It's
like
a
merger,
page
I'm.
Using
this
analogy.
It
looks
like
that.
So
I
have
a
starting
point
to
learn
this
new
desk
just.
C
B
B
B
It's
automatically
figure
out
the
contact-
that's
not
in
this
paper
right,
but
so
we
use
heuristics
that
are
used
like
the
distribution
feels
like
it's
in
this
totally
new
area
that
we
haven't
seen
before.
But
part
of
distribution
is
very
similar,
so
we
said,
let's
take
part
this
context
and
part
of
this
context,
but
how
exactly
that
happens
is
like.