►
From YouTube: Selfless Sequential Learning
Description
The very interesting and recently published paper at ICLR2019 studying the impact of sparsity in the context of Continual Learning:
https://openreview.net/forum?id=Bkxbrn0cYX
Related: Continual Learning via Neural Pruning
Siavash Golkar, Michael Kagan, Kyunghyun Cho https://arxiv.org/abs/1903.04476
A
Okay,
folks,
we're
gonna
start
in
just
a
few
minutes,
we're
transitioning
from
our
stand-up
meeting,
which
you
just
ended
to
the
research
meeting.
So
just
give
me
a
couple
of
minutes:
I
just
want
to
come
live
and
let
you
know
that
we're
about
to
start
I,
don't
think
you
can
hear
anything
except
me
right
now.
I
can
see
the
office
but
I'm
not
gonna
turn
it
out
till
they're
ready,
so
I'm
just
gonna
get
to
stare
at
me
in
my
office
for
a
few
minutes.
A
We're
this
is
like
a
series
of
paper
reviews
we're
doing
for
an
event.
A
journal,
Club
I
hope
you
guys
are
enjoying
them.
If
you
are
please
be
sure
to
like
the
video
and
subscribe
to
our
Channel,
we're
doing
this
just
about
once
a
week
we're
trying
to
do
it
once
a
week.
So
that's
my
agenda.
You
can
ignore
that,
but
I,
probably
I
might
be
live-streaming
later
today
on
some
of
this
stuff
on
Twitch.
A
A
A
Just
bear
with
me
a
minute
here:
we're
almost
there
we're
just
trying
to
get
a
version
of
the
document.
That's
marked
up
already,
so
it
can
be
displayed
for
you
guys
to
add
emphasis
to
whatever
we're
talking
about
hello,
Cody,
hello,
Chris,
thanks
for
joining
I.
Don't
know
why
my
chats
sometimes
works.
Chacha's
is
working
yeah.
A
A
B
A
B
C
D
C
A
A
D
E
Thank
you,
Marcus
and
well
so
I,
oh
I,
hope
you.
You
read
it
too
in
the
sense
that
when,
in
the
in
that
I
mean
on
this
I'm,
not
one
of
the
outers
of
these
and
just
as
a
disclaimer
I
mean
I
I
do
my
best
to
to
try
to
summarize
the
basic
ideas
of
this
work
but
feel
free
to
interrupt,
and
you
know,
join
the
discussion
we
see
where
it
goes.
E
Essentially,
these
paper
is
studying,
essentially
the
idea
of
sparsity
and
importance
of
sparsity
in
the
context
of
continued
learning,
so
I
guess
that
this
idea
of
Spotify
of
the
connections
the
way
it's
the
activation.
So
having
representations
are
sparse.
It
has
been
around
deep
learning
mastery
learning
for
many
years
now,
and
many
people
did.
They
think
that
this
is
important
for
several
reasons
right
for
generalization
and
to
avoid
overfitting.
E
But
I
think
this
is
one
of
the
first
works
and
this
was
published
at
IX
ELR
2018
that
tries
to
really
go
in
more
in
depth
about
sparsity
and
continually
in
the
context
of
continued
learning.
Out
is
canal
learning
continuously,
and
the
basic
idea
here
is
that
if
you
have
near
Network
right
with
a
fixed
capacity,
the
gradient
descent
optimization
often
leads
to
another
saturation
of
all
the
internet
work.
When
you
learn
a
specific
from
a
specific
training
set
running
batch
as.
E
Contribution
of
each
weight
right:
it's
not
that
you,
you
learn
from
a
few
images,
for
example,
few
examples
and
you
use
just
a
person
of
this
network,
but
you
know,
essentially
you
always
use
all
the
all
you
can
about
the
network
and
what
another
interesting
idea
that
has
been
around.
Is
that
I
mean
if
you
do
if
you
increase
the
number
of
parameters,
even
though
the
problem
is
the
same,
actually
you
get
better
results
just
because
of
these
Oh
for
ArmA
2
ization
of
the
network.
E
This
is
something
has
been
a
explore
a
bit,
and
it's
also
cool
that
from
a
network
that
is
over
parameterized.
You
can
also
he
still
the
knowledge
to
another
smaller
networks
and
still
being
able
to
reach
me
the
same
level
of
accuracy.
So,
essentially
you
can.
You
could
always
find
also
for
the
pruning
works,
etc.
You
this
idea.
I
think
that
this
idea
of
other
parameter
is
asian.
Is
inner
inherent
the
concept
of
gradient
descents
after
that,
the
optimization
that
comes
with
it
so.
E
Ok
and
essentially
the
basic
I
guess
the
paper
is
about
a
comparison
with
all
other
regularization,
gliders
and
activation
functions
out
the
impact
on
this
process.
Essentially,
they
proposed
a
new
regularizer,
a
new
varsity
regularized.
Let's
say
that
is
already
taught
for
continued
learning
essentially,
and
it's
based
on
very
loosely
inspiration
from
biology.
In
the
sense
they
try
to
enable
us
run
in
a
specific
region
of
the
in
fact,
evasion.
Space
neurons
that
that
activates
at
the
same
time,
in
nearby
locations,
essentially,
okay,
so
I
thought
it
was
very
it's
a
very
good
paper.
E
I
think,
and
it's
very
easy
to
follow
so
and
then
I
really
enjoyed
the
introduction.
Some
some
people,
especially
in
the
open
view
platform.
You
know
this
is
you-
can
find
the
reviews
online
so
that
the
introduction
was
maybe
too
long
or
something
but
I.
Think
it's
quite
a
nice
interesting.
You
have
a
lot
of
references
here
pointing
out
you
know
different
concepts
in
state,
sparsity,
literature,
I,
would
say,
and
yeah
and
I
think
something
I
would
also
to
discuss
with.
You
is
the
idea
of
specify
in
the
representations.
B
B
E
B
E
Don't
remember
my
own
comments.
Sparsity
is
key
for
the
early
that
representation
yeah.
That's
that's
also
very
interesting.
That
idea
that,
with
the
standard
graininess
said,
you
very
entangled
representations
and
an
AV
way
is
in.
Essentially,
if
you
look
at
the
importance
distribution
of
the
weights,
you
can
see
that
some
weights
are
more
important
than
others,
but
essentially
these,
if
you
kadhi,
is
round
you
see
that
still
even
for
small
tasks,
small
datasets,
you
use
more
than
50%
of
weights.
That
are
important
for
that.
For
that
particular
task.
E
E
So
Ganesha
is
just
an
image.
A
cartoon
image.
I
would
say
on
the
ideas,
fortifying
the
parameters,
so
ddd
connections,
essentially
between
neurons
and
then
representations,
and
they
argue
that
essentially,
a
sparsity
of
the
representation
is
more
important
for
for
continued
learning.
In
this
case
and
to
disentangle
representing
the
knowledge,
essentially
as
we
encoding
to
the
network
and.
E
C
E
E
E
B
D
D
Stick
around
for
two
back:
they
basically
were
using
ray
tracing
and
lower
the
number
samples
they
had.
You
could
use
the
cheat
day
to
deduce
what
the
rest
of
it
of
that
structure.
Would
him
and
my
opponent,
so
you
know,
if
you
take
it
from
a
pure
signal
processing,
point
of
view.
It
would
be
enough
to
say
what's
enough
of
a
structure
for
each.
You
recognize
it
as
legit,
then.
E
E
C
E
E
So
this
is
I
think
there's
a
lot
of
different
proposals
in
this
area.
I
mean
sports
in
general
for
Contino
learning,
maybe
the
other
paper
we
we
also
suggested
reading
for
today
is
using
that
concept
as
well,
but
I
I,
don't
remember
any
other
content
learning
paper
it's
at
least
to
the
best
of
my
knowledge,
working
specifically
on
sparsity
for
controller,
even
though
I
guess
that
it
was
something
that
many
people
were
thinking
about,
because
we
know
it's
it's
well
known
is
a
capacity
situation.
B
E
D
E
E
E
B
B
E
E
C
B
E
F
Mean
I
would
say
that
in
general,
activations
of
the
final
mystical,
it's
parse
for
whatever
reason
right,
but
there
are
also
brain
areas
that
specifically
do
organize
and
specify
codes
like,
for
example,
the
hippocampus,
which
is
really
important
for
memory
acquisition,
particularly
long-term
memory
acquisition
and
then
driving.
The
consolidation
process
has
an
input
area
called
the
dentate
gyrus,
which
has
been
studied
a
lot,
but
it
with
respect
to
its
input,
output
dynamics,
and
it's
known
that
it
essentially
like
it.
Its
function
is
really
sort
of
a
pattern:
orthogonalize
ER.
It
makes
highly
separable.
F
Representations
of
the
input
activation
of
internal
cortex,
which
might
have
a
lot
more
overlap
and
then
sort
of
what
comes
out
of
there
for
their
actual
sort
of
memory
encoding
process,
if
you
think
of
the
deep
structures
of
hippocampus
of
that,
are
a
lot
more
pattern,
separated
allowing
you
to
learn
independent
episodes
and
for
memories
to
not
interfere
with
each
other.
But
that
is
in
part
because
it
becomes
a
very
small
brain
area
right.
It's
like
2.5
million
pyramidal
neurons
in
the
main,
auto
associative
field.
F
Where,
then,
you,
you
learn
patterns
by
auto,
auto
by
the
camera
connectivity
with
highly
plastic
synapses,
which
then
encode
that
pattern
and
even
then
that's
a
transient
process
right,
so
the
hippocampus
does
forget
and
through
the
consultation
process
like
eventually
becomes
irrelevant.
So
there
are
some
brain
as
but
people
have
looked
at
input-output
characteristics
and
have
to
do
that.
Yes,
this
brain
area,
mostly,
does
you
know
like
a
formalization
and
pattern
separation,
so
those
two
exist:
I,
don't
know
how
widespread
that
is.
F
C
B
C
Yeah
so
yes,
mergo
worked
on
something
called
lateral,
lateral
pooling
which
was
taking
a
space
with
cooler
and
giving
it
this
extra.
This
extra
property,
where
the
individual
mini
columns,
try
to
decore
a
good
people
are
alike
with
each
other.
So
so
he
did
work
on
that
project
for
a
while
soon,
as
that
would
be
better.
F
E
E
So
f
of
theta
n,
where
Negus
here
is
easy
task,
the
current
task
you
want
to
solve.
So
you
have
the
standard
loss,
function,
try
to
minimize
the
prediction
error
and
then
you
add,
plus
another
term
e
that
is
going
to
regularize,
because
your
learning
process,
essentially
and
again,
as
we
have
seen
before
this
is
a
hidden
essentially
with
a
static
real
value
here.
E
The
current
weight,
your
trying
to
modify
the
same
weight
right
so
SK
you're
trying
to
modify
to
move
in
the
direction
to
to
essentially
let
use
this
loss
right
and
the
idea
is
that,
as
you
move
far
away
from
the
optimal
way
of
the
previous
task-
and
this
way
it's
very
very
important
for
abuse
test-
and
this
is
posture-
is
gonna-
grow
up.
Explore
right.
You.
E
So
that's
the
concept
and
it's
it
was
shown
in
2016
by
elastic
way,
consolidation,
deepmind
and
synaptic
intelligence
from
Ganguly
here
Stanford
and
you
know
they
are
on
work.
Our
own
later
on
I
mean
many
people,
it
a
some
outward
into
these
ADL
of
regularizing
learning
over
multiple
tasks
of
benches.
C
E
So
this
is
a
easy
fix
right.
The
alum,
the
term
here
is
just
to
say
how
much
you
want
to
your
weight,
some
of
this
fat
of
remember
in
the
past,
or
just
learning
about
this
particular
task,
the
current
abstract.
So
this
is
I,
saying:
okay,
I
won't
really
do
remember
stuff.
If
you,
if
these
lambda
Omega
is
I.
E
B
D
E
E
C
E
Yeah
I
guess
you
know
it
depends
if
in
that
importance
it
somehow
like
we
do,
for
example,
if
that
the
importance
term
is
taking
into
account
all
the
possible.
You
know
the
previous
tasks,
then
you
need
just
one
regularization
term.
Otherwise
you
need
multiple
of
them.
I,
don't
remember
exactly
how
they
do
it,
but
the
original
proposal
from
did
mine
was
to
have
organization
term
for
each
task.
E
Essentially,
if
you
move
this
is
for
the
second
task.
Let's
say,
then
you
move
on
on
the
third
task.
You
need
to
add
another
prioritization
term
that
is
relative
to
series,
yes
or
or
you
can
compress
the
importance
value
you
can
as
you.
As
you
said,
you
can
you
can
you
can
say:
okay,
let's
say
that
I
am
in
a
situation
in
which
I
consolidated
the
weights,
so
I
conciliate
the
weights
for
all
the
previous
tasks.
So
then
I
can
think
I'll
say.
Ok
now
all
these
weights
are
rrk
are
preserving
already.
E
Preserve
as
much
as
possible,
the
weights
are,
there
have
been
importantly
passed
and,
and
that
that
cannot
change.
It's
like
freezing
his
parameters
after
why?
Right?
So
if
they
have
been
the
idea,
if
they
have
been
important
in
at
least
a
few
of
previous
tasks,
then
you
want
that
you
want
them
to
not
change
right.
D
D
As
opposed
to
I
identify
structure
of
these
tasks
and
preserve
them,
I'm
going
to
assume
that
it
all
kind
of
falls
on
out
that,
on
average,
this
thing
is
important,
but
that
assumes
that
there's,
no,
the
theses
can
be
treated
as
independent
attributes,
as
opposed
to
there's
something
iconic
about
this
particular
combination
of
this
stuff
set
of
parameters
that,
if
I
start
building
things
I
mean
I
just
I.
Just.
D
Animating
them
right,
yeah
yeah.
What
I'm
saying
is
that
there's
a
I
think
there's
a
road
to
examine
that
assumption
and
see
or
how
much
you
lose
by
doing
it.
I
don't
know
whether
you
like
partition
the
parameters
in
some
arbitrary
way,
see
you
know,
keep
them
separate.
You
know
and
then
see
how
much
loss
do
I
get
when
I
repartition
these
things,
yeah
yeah.
E
It's
it's
only
and
I'm
going
to
be
made
in
a
sense.
How
much
you
want
much
structure
you
want
to
impose
to
the
network
what
you
want
that
finance
ever
structural
yeah.
D
D
E
And
also
these
these
proposal
itself,
it's
very
I
mean
it's
something
that
is
especially
from
you
know,
Matt
background
and
perspective.
It's
very
nice
because
you
have
just
a
loss,
function,
Indian
to
optimize
in
our
without
an
imposing
any
structural.
E
Okay,
so
this
was
the
basic
idea
of
essentially
controlling
forgetting
through
regularization,
pure
regularization
and
but
then
the
original
proposal
was
to
add
an
additional
memorization
term.
It
was,
let's
say,
controlling
the
sparsity
of
the
of
the
representations
at
all
possible
layers,
so
it
was
called
our
SSL
and
then
they
say:
ok,
now,
let's
go
deep
into
this,
because
you've
been
implementing
in
different
ways.
So
the
first
proposal
is
this
one.
E
Where
essentially,
we
are
looking
at
in
a
specific
layer.
Html
you
have
several
new
Bronze
Age
activations
of
H
Y.
It's
just
well,
okay,
the
number,
the
number
of
minutes
we
are
considering
while
doing
this.
We
can
ignore
this,
but
the
idea
is
that
you
have
several
activations
and
you
look
at
all
the
possible
combinations,
its
activation
in
layer.
C
E
F
F
E
B
E
E
E
F
C
B
E
I
think
Rob
is
more
much
more
structurally
doing
this
right.
Instead
of
finding
compromises
between
constraints
that
you
posed
opposed
it's
much
more,
it's
already
there
in
terms
of
structure
of
constraints,
in
a
sense
that
it
is
not
a
it's
not
like
you're
putting
constraints,
and
then
you
say,
okay
now
solve
the
problem
globally,
with
these
constraints
is
much
more
the
constraints
already
embedded
into
the
you
know
in
the
local.
You
know,
parts.
E
D
F
B
F
If
the
difference
is
zero,
that
you're
guaranteed
to
have
the
full
penalty
right
and
then
it
decays
off
with
an
exponential
term,
which
is
I
mean
obviously
not
exactly
how
real
one
works,
because
you
know
at
like
a
meter
out
the
connection.
Probability
is
not
some
very
small
number:
it's
actually
zero,
it's
a
physical
limit
to
how
big
an
earner
can
be,
but
it's
a
decent
approximation
off
the
axonal
and
I
can
dendritic.
You
know,
overlap
over
space.
My
dear
connection
probability
roughly
decays
exponentially
in
the
brain.
B
B
B
E
B
D
This
analogy
to
basically
say
I'll
include
the
sample.
It
is
close
to
a
the
value
that
I
started
with
and
the
distance
wise.
So
you
basically
have
this
kind
of
saying
what
the
neighborhood
is,
but
this
is
a
neighborhood
at
an
arbitrary
you
know
cutting
along.
You
know
these
things
as
opposed
to
a
two
dimensional
one
say:
yeah
I've
got
this
cluster
here
and
there
is
some
reason
to
believe
that
these
things
yeah
baby,
you
similarly
right.
So
that's
what
bothers.
F
E
It
said
somewhere,
I
didn't
notice
that,
but
maybe
they
are
just
looking
yet
for
simplicity.
Not
only
did
they
talk
about
Ethan
layers,
so
I
don't
know
what
they
mean
by
hidden
layers
in
his
skin,
because
I've
seen
in
their
experiments
on
tiny
imagenet,
they
had
also
convolutions
and
well.
You
know,
I,
don't.
D
D
E
D
B
E
And
then
there
is
a
I
think
that
obviously
the
results,
but
essentially
they
want
to,
of
course,
use
the
knowledge
that
you
already
formed
in
Europe
in
the
past,
so
essentially
you're
saying
and
then
they
want
to
form
as
well
a
new
knowledge.
So
you
want
possibly
new
needs
required
in
to
recognize
new
patterns
right
and
this
they
see
they've
seen
that
with
this
particular
implementation
loss
function
as
he
sees
for
now.
E
If
you,
if
you
use,
if
you
move
to
other,
you
know
new
patterns,
you
want
to
learn
from
something,
and
essentially
these
weights
that
have
been
already
trained
these
neurons
they
they
tend
to
fire
again
and
to
in
a
bit
somehow
all
the
other
weights.
So
you
have
the
same
problem,
and
you
know
this
as
well.
I
think
in
your
paper
with
this
idea.
E
B
E
B
E
B
F
In
your
brain
that
doesn't
happen
because
the
resources
are
finite,
so
you
have
synaptic
depression,
which
depletes
transmitter,
for
example,
and
you
have
neural
adaptation,
which
means
that
neurons
are
not
too
active.
Are
gonna,
be
inactive
for
a
while
you
have
refraction
meter.
No,
that
just
has
fire,
cannot
immediately
fire
again.
It's
like
tons
of
mechanisms
in
the
brain
that
makes
sure
that
no
resource
can
be
over
the
important
and
overused
and
there's
a
biological
limit
to
how
fast
you
can
fire
yeah,
like
I,
said
this.
E
F
C
B
B
D
E
B
E
F
E
B
F
C
F
E
F
E
To
wanted
to
give
any
introductions
ahead,
essentially,
I
think
the
the
first
version
was
a
very
of
the
paper
was
not
that
focus
on
on
the
premium
empirical
evaluation,
and
then
they
add
the
tell
what
the
II
review
was
asked
that
to,
and
then
they
become
much
more
substantial
and
now
we
have
here
are
three
different
benchmarks,
so
at
least
see
far
a
tiny
image
net.
So
now,
if
you
work
with
them,
but
I
guess
they
are
still
considering
a
maximum.
F
E
E
Then
you
have
for
a
few
classes,
you
want
to
distinguish
and
we're
and
100
means
there's
100.
E
D
E
Because
it's
RGB
he's
out
to
be
right,
so
I
guess
I
would
prefer
them
to
use
coffee
this
much
more
complex
to
see.
It's
really
scaling
the
idea
that
you
want
to
scale
a
bit
up.
Is
this
experiment?
Because
it's
not
really
clear
to
me
why
I
mean
they
work?
Why
work
that
differently
among
each
other?
We
can
see
that
enough,
but
essentially
here
what
you're
seeing
is
so
for
Abney's
results
on
them
needs
of
different
sparsity.
Let's
say
your
eyes
are
activation
functions
and
their
own
proposal.
E
When
you
see
five
different
tasks
on
amnesty
over
time,
even
every
task
as
to
distinguish
among
two
different
digits
of
young
needs,
the
data
I'm
sure
you
have
seen
that,
but
we
haven't
written
digits,
greyscale
and
in
a
frame
eat.
Rescue
want
to
distinguish
two
different
digits
right
anymore,
one
to
the
next
I
think,
I
think
I.
B
B
F
F
E
This
is
on
Nana
stand
is
the
accuracy
for
every
strategy
on
every
task
at
the
end
of
the
training,
in
a
sequence
of
all
these
tasks?
Right
and
here
you
can
see
I,
don't
know
why
they
didn't
put
here
another.
Let's
say
a
few
bars
here
in
the
average
and
they
put
it
evil
head
here.
But
essentially
you
can
see
here
in
the
legend
the
the
average
accuracy
on
all
that
possible.
B
E
E
Yeah
l1
red,
so
what
it's?
The
most
interesting,
at
least
to
me,
is
to
see
the
difference
with
respect
to
the
basic
or
yellow
activation,
in
the
sense
that
okay
and
then
another
thing
that
I
critics
are
criticized
about
in
the
evaluation
is
that
you
can
see.
You
cannot
there's
no
comparison
with
a
training,
static
training
on
the
desert,
wind
arable
risers,
so
you
can
really
tell
if
the
improvement
is
because
it
just
for
from
your
network
is
just
better
plan.
E
B
E
E
It's
okay,
that's!
Okay!
On
only
in
continued
learning
these
setting
that
they
tested
it.
They
are
algorithms,
but
what
I
am
saying
that
maybe
these
regularizes
of
their
own,
even
in
Edina,
started
training,
so
not
in
a
continual
learning
setting
then
free
with
a
bra,
essentially
advantages-
and
this
is
saying:
okay
yeah,
it's
cool
to
use
it
because
it's
better,
but
it's
not
really
helping
me
in
the
idea
of
this
Integra.
You
know
the
impact
of
these
algorithms
proposal,
especially
in
the
context
of
continued.
B
B
E
It's
it's
like
I,
don't
know,
let's
say
that
we
don't
have
convolutions,
then
I,
say
then
I
say:
okay,
let's
you
have
this
task.
E
Despite
the
interesting
in
what
you
know,
learning
then
I
say:
okay,
yeah
I
now
introduce
conclusion
for
the
first
times
and
then
they
say:
okay,
it
works
much
better
on
continual
learning,
solutions
for
continued
learning,
and
you
say:
oh
okay,
that's
cool,
but
if
convolutions
are
working
great,
also
on
a
static
training
set,
it
means
that
convolutions
by
themself
out
great
for
for
learning,
not
for
continual
learning.
I.
B
B
E
I
mean
to
see
if
maybe
they
do
they
just
better,
so
the
accuracy
in
the
end,
it's
better
because
it
they
are
just
bad
in
general
and
they
started
also
reviewers.
They
they
pointed
out
some
other
than
the
ad.
They
said
yeah.
Well,
the
point
is
working
on
continue.
Learning,
okay,
but
yeah.
One
of
the
reviewers
said
that,
and
I
mean.
F
Exact
uestion
to
the
field
is
right.
Given
that
continued
learning,
it's
an
additional
challenge
to
a
system.
Like
you,
don't
know
all
of
your
data
upfront,
yeah
yeah,
required
on
the
way
right-
and
you
know,
is
there
an
actual
penalty
for
that
right,
I
mean
it's
clear
that
you
know
many
systems
can't
do
that,
but
if
you
build
a
system
that
can
do
that,
doesn't
actually
you
know
how
much
of
a
performance
in
like
do
you
get
for
building
a
system,
but.
B
F
B
B
E
You
cannot
do
that,
but
you
you
cannot
tell
that.
No,
you
can't
in
this
there's
a
there's
a
in
a
later
experiment
in
this
paper
that
in
which
they
they
assess
the
performance
on
all
the
training
sets.
You
know,
train
together
in
a
multitask
fashion,
joint
multi
training,
how
they
call
it.
But
it's
just
one
one
place
you
can
find
this
information,
but
not
for
every
possible.
You
know
experiments
a
baby
and
I
think
this
is
a.
E
It
would
have
been
very
useful,
at
least
for
me
and
yeah.
Then
another
thing
is
that
firmly
I,
don't
I,
don't
know
it's
not
really
we're
talking
about
the
difference
of
three
percentage
points
in
the
accuracy,
so
it
sound
pretty
easy
to
tell
I
mean
it
seems
like
on
this
scale
that
matters
but
I
mean
I
I,
don't
see
that
they
have
done
a
several
rounds
on
it.
I,
don't
think
it's
an
average.
You
about
the
different
runs
when
you
change
the.
F
E
C
B
E
Will
be
interesting
deviation
to
see
if
really
that
three
percent
it's
it-
has
a
value
first
or
right
it
just
not
really
that
means,
but
but
yeah
of
course,
yeah
I,
don't
I
mean
many
people.
We
also
discussed
these
narratives
2018
and
control
learning
workshop.
Now
people
were
saying:
okay,
just
don't
use
them
nice
for
continual
learning,
because
it's
really
difficult,
but
even
at
four
deep
learning
in
general
I
mean
right.
F
F
E
E
Another
problem
here
is
in
this
in
this
particular
setting
of
lemon
East,
where
you
have
this
permutation.
Essentially
people
I
know
if
they
have
shown
it,
it
would
be
nice
to
show
it
in
a
paper,
but
we
know
we
all
know
that.
Essentially
the
network
can
change
just
the
the
first
layer
to
learn
the
permutation
and
everything
is
fine
right.
So
you
don't
have
to
learn
a
new
task.
You
just
learn
the
permutations.
D
B
D
B
B
B
B
B
B
B
B
B
F
F
A
F
B
B
E
Maybe
I
know
for
these
ports,
if
are
on
the
rabbit
on
the
and
then
for
tiny
imagenet
split
I,
don't
I
don't
really
buy
any
case
in
this
case
they
say
that
essentially,
the
improvement
now
is
over
over
no
regularization
at
all.
It
goes
from
four
to
eight
so
for
in
the
in
the
CFR,
I,
guess,
experiment
and
eight
Indy
know
the
almost
six
or
eight
Indy
in
the
c4
and
and
and
for
for
the
imagine,
a
tiny
machine
at
the
benchmark.
So
I
always
thought
I
tried
to
look
into
the
paper.
D
F
B
D
B
F
E
F
B
B
F
C
E
That
you
test
in
between
yeah,
you
know
it's
it's
a
matter
of
space.
We
do
what
we
say
in
our
papers.
We
like
to
see
the
trend
of
learning
right
and
we
have
the
fix
that
set.
So
we
test
every
time
we
learn
a
new
thing.
We
test
the
whole
system,
everything
so
that
you
can
see.
Actually
that
trend
going
up,
there
are
people
showing
that
suing
at
least
the
accuracy
based
on
the
things
we
have
seen
in
that
particular
point
in
time.
So
it's.
B
B
E
B
E
E
E
E
F
F
E
B
E
E
Different,
let's
say
e
equal
to
learning
strategies
plus
their
own
regularizer
for
sparsity
and
see
how
it
goes
on.
There
are
a
couple
more
information
here
and
and
then
another
different
task
in
which
you
don't
abuse
people.
They
were
complaining
about
the
the
fact
that
they
they
are
using
actually
an
Oracle
telling
them
both
during
training
test
Haley.
This
is
a
new
task.
E
These
are
images
belong
to
these
particular
tasks.
These
are
block
to
these
new
tasks,
so
you
can
listen
to
uncle
essentially
D,
both
doing
training.
Then
they
were
using
that
so
they
you
worse,
asked
them
to
integrate
a
new
experiment
that
we
call
it
in
a
new
instance
scenario.
So
the
idea
that
you,
you
have
all
the
classes
from
the
first
batch,
but
then
over
time
you
encountered
in
frantically
solutions
of
the
same
classes,
enabled.
B
B
E
B
E
E
E
Only
not
seagulls
there
couple
I
need
an
IAB,
essentially
training
set
with
ten
classes.
I
think.
D
B
E
T
10
is
not
sequential
it
as
a
drama.
The
different
set
sequentially
learned,
but
it's
totally
independent
yeah,
that's
a
that
was
something.
I
was
complaining
about
the
other
day
in
the
sense
that
and
they're
saying
now
that
you
need
an
Oracle
telling
you
now
you
have
to
solve
these
tasks
of
these
stands
you
it's
not
a
single
task,
where
you
you
say:
okay,
now,
whatever
I
give
you
you
have
to
calcify
among
100
right
classes
is
now
we
are
showing
the
second
task.
With
these
10
classes,
I
give
you
the
images.
D
E
B
C
E
C
B
C
B
In
this
method,
subsequent
tests
between
using
the
inactive
mirrors
and
futures
of
this
patch
by
network
and
cows,
your
deterioration
of
the
furnace
of
previous
tests,
so
the
whole
idea
in
in
those
met
the
elastic
wave,
oscillation
and
the
other
regularization
methods.
You
always
have
some
overlap
and,
in
this
case,
you're.
Looking
for
a
method
that
had
su
overlap
and
you
could
learn
a
new
test
and
without
causing
deterioration.
F
A
E
D
D
B
F
B
Yeah,
so
this
is
interesting
and
he
talks
about
forgetting
that's
interesting
parts
of
concept
of
graceful
for
Gary.
Sorry
idea:
that's
preferable
to
surfers
Balham
all
to
forget
in
a
controlled
manner.
If
it
helps
we
gain
network
capacity
and
prevent
some
control
loss
of
performance
and
okay,
and
he
shows
empirically
that
his
method
continues.
Learn
your
pruning
leads
to
significantly
improve
results
over
current
weight
elasticity
based
methods
such
a
very
good
paper.
It's
a
pity
to
do
in
ten
minutes
for.
A
B
B
B
B
So
so
he
explains
the
difference
between
I,
don't
think
this,
but
he
explains
the
difference
between
a
sparse
plates
and
sparse
activation.
So
in
this
paper
his
alkylating
he's
using
sparse
equations,
that
of
sparse
rates,
yeah
and
so
here's
the
method
actually
say
this.
This
paragraph
explains
everything
so
these
in
alright.
E
B
He
divides
it.
The
weights
into
three
parts:
one
is
the
weights
which
are
active
and
they
connect
active
nodes
directive
notes.
So
this
he
call
so
he's
not
gonna
touch
those,
and
then
he
has
free
weights,
which
are
weights
that
connect
inactive
nodes.
Sorry
active
nodes
to
inactive
nodes
or
inactive
connective,
so
these
are
free.
So
if
you
change
those
they're,
not
gonna
impact
the
first
task
anyway,
so
they're.
F
F
C
B
The
active
connected
by
my
skull
in
free,
I
I,
don't
know
answer
we
cannot
go.
Let's
see
these.
Are
you
not
in
blue
next
to
have
weights
which
connect
any
node
on
active
nodes,
which
you
call
free
so
any
notes
in
acting
on
this
tree,
and
then
he
makes
it
differentiate
it
here
at
some
point
right,
buddy.
E
I
mean
I,
don't
know
if
this
is
creating
a
confusion
about
the
I.
Think
active
in
this
sense
is
not
like.
They
are
fighting
all
the
time,
just
that
they
are
every
move.
They
modulated
activity
can
leave
to
you
not
recognition
of
particular
part,
and
they
are,
let's
say
the
blue,
the
blue
neurons.
There
did
not
either
don't
have
to
fight
together.
They
are
just
allocated
to
recognize
several
patterns
of
the
first
task
right
now,
you
don't
want
to
eat
a
female
grab
another
by
well
the
idea
there
is
Ju
training,
the
internet
work.
F
B
B
E
D
E
F
A
F
E
E
Problem
we'll
see
the
later
we
can
reach
that
point
in
the
experiments.
Is
that
it's?
You
know
you
quickly
separate
in
the
network
this
way,
because
you
essentially
freezer
by
the
amount
of
even
also
because
we
prune
again
there
are
other
techniques
in
sparsity.
You
don't
reach
the
same
level
that
we
talk
to
pretend
unless
you
know.
B
B
F
E
I
think
it's
it's
nice,
you
have
a
very
decent
representation,
Sarah
forgetting
about
the
problem.
I
mean
also
to
see,
if
is
quite,
you
know,
small
problem
after
the
fourth
batch
I've
seen
that
you
reach
the
saturation
of
the
network,
so
I
don't
know
if
this
is
really
scalable
and
I
mean
I
like
more.
F
E
Yes,
exactly,
then
they
have
introduced.
This
idea
of
you
know
graceful
forgetting
and
yeah
I,
don't
know
what
they
change
at
that
point,
but
essentially
they
they.
They
start
to
change
some
connections
right,
yeah.
B
C
B
B
So
I
move
on
to
text
and
nice
trick
kind
of
side
of
means,
but
we
can
go
over
this
plot.
Having
this
plot
is
easier.
D
B
D
B
E
A
it's
definitely
interesting,
but
they
I
think
that
they
use
edge
of
the
neuron.
Is
it's?
It's
not
T,
it's
a
really
efficient.
This
is
the
do
you
eat
when
you
don't
want
to
have
interference
like
in
these
formulation.
You
are
you're,
actually
freezing
everything
and
then
you're
not
reusing
the
same
neurons
for.
F
D
Right
way,
setting
up
the
training
for
lower
levels
and
basically
borrowers
from
those
only
what
it
absolutely
has
to
track
back
so,
rather
than
say:
random
numbers,
general
you're,
saying
I'm
trading
on
something
to
train
something
else,
buying
George.
After
all
these
tasks,
these
things
really
important
so
we'll
hold
onto
them.
Then
you've
got
to
dive
down
deeper.
You
know
this.
E
E
E
E
That's
really
cool
so
is
exploring
the
real
space.
Yes,
yes
imposing
some
priors,
clearly
yeah,
and
there
are
other
observations
always
that
we
have
keep
in
mind
that
when
you,
when
you
do
our
muling
training
of
an
same
network
on
all
these
dusty,
you
can
you
can
fit
everything.
So,
theoretically,
you
know
all
these
information
can
be
fitting
within
these
limits.
Is
we
shouldn't
reach
these
issues
of
well.
D
Show
how
you're
kind
of
expand
these
cobbles
of
this
hyper
cracker,
you
know
space,
you
know,
mostly
strategies
are
saying
you
know.
How
do
we
get
to
the
point
where
these
things
are
sufficiently
is
ambiguity
that
we
actually
smartly
but
efficiently
on
space,
and
these
are
search
strategies
yeah.
So
you
know,
and
if
you
have
some
principle
way
of
rapidly,
you
know
expanding
into
the
space
in
a
way
that
you
know
preserves
the
ability,
it
is
ambiguous
things,
yeah,
yeah,.
B
Of
thinking
of
an
approach
of
the
words
continue
learning
using
extra
capacity,
the
cases
these
networks
or
small
events
but
I
think
it's
an
useful
way
of
learning.
I
mean
the
networks
were
a
lot
bigger.
Maybe
you
could
extend
the
same
thing
like
hundred
a
thousand
times
I
thought.
For
me
it
was
a
good
idea.
B
E
Maybe
the
the
interesting
part
is
that
how
they
solve
the
problem
that
you
need.
You
know
it's
difficult
to
twelve
learn
with
sparse
if
I
wrote
it,
and
in
this
case
they
saw
this
by
saying
way,
you
can
use
all
the
network
and
then
I
prune
it
later
correct,
so
you
use
so
this
is
probably
naming
your
credits
and
where
you
you,
don't
you
know
it's
really
tough,
to
learn
already
sparse.
F
F
B
D
D
B
E
B
It's
a
different
approach
like
not
using
a
regular
ID.
It's
at
the
same
time,
so
we're
used
to
be
two
approaches
that
you
don't
use
a
regular
or
you
do
some
sort
of
architectural
search.
You
add
something
to
detector
and,
for
example,
in
terms
of
the
both
at
the
same
time
is
really
I.
Think
the
both
architecture
search
and
regularization-
and
this
case
is
different
because
there's
none
of
both,
let's
see
Faysal,
had
the
regularization
and
it
doesn't
work
that
your
search
I
mean
as
in
adding
to
the
National.
So
it's
like
a
third
approach.
B
E
D
E
E
Gated
an
internal
network
for
each
task,
so
it
was
very
difficult
and
complex
and
not
scalable,
but
with
rather
a
collection
so
to
all
the
possible
wait.
So
he
is
never
so
it
was
very.
It
was
exploding
in
terms
of
dimensions,
but
if
you
think
about
it,
it's
what
they
do.
You
could
think
of
every
time
you
encountered
skewer.
It
is
a
network
training
with
our
you
doing
the
routing
again
and
then
that's
that's
a
new
column
and
you
move
on
so
I.
Don't
you
don't
see
it
snow
is
a
similar.
E
B
B
D
D
D
B
B
D
B
D
E
E
E
D
B
D
I
mean
there's
certain
games
that
I
play
that
haven't,
played
it
bullets,
it's
easy
a
while
to
kind
of
get
back.
You
know
to
do
that
because
we
didn't
Zuma
Blee
harness
for
other
things,
but
I
haven't
sufficient
plasticity
like
about
the
90%
of
what
I
was
yeah.
So
it's
that
kind
of
behavior
I'm
looking
it
and
as
to
see
reflecting
as
a
to
higher
order.
B
D
You
know,
but
I
should
at
least
have
this
property
so
that
you
don't
get
into
a
bind
where
you're,
literally
at
full
capacity
yeah
nothing
more.
There's
got
to
be
some
kind
of
trade
off
these
things.
Now
the
the
schema
where
you
say:
hey,
no
reality,
we
have
this
huge
net
where
I
can
remember
five.
How
do
you
want
to
represent
that?
D
D
D
Like
this
thing,
because
the
you
know
so
many
of
these
things
that
use
regular
losers
are
using
how
the
global
priors
that
don't
reflect
the
experience,
there's
assumption
that
this
thing
somehow
has
alerted
capability
to
it,
but
is
something
system.
What
I've
learned
is
what's
why
he
did
the
network
and
operating
on
that
level
means
that
I'm
truly
affecting
what
was
learning
you
know
and
using
a
fairly
simple
principle
that
you
know
what
is
useful
hold
on
yeah
yeah
repurpose.
C
B
A
A
Oh,
it's
broken
whatever
I
don't
know
anyway.
Thank
you
for
watching
I
think
there
will
probably
be
another
research
meeting
on
Monday
which
a
live
stream
I.
Think
Marcus
is
gonna.
Do
something
I
think
there
will
probably
be
a
research
meeting
on
Wednesday
I
might
even
be
talking
about
something
then
I'll
plant
them
out
and
then
definitely
something
on
Friday,
so
I
hope
you
enjoy
these
meetings.
A
Please
do
me
a
favor
like
I
said
like
and
subscribe.
That's
the
best
way
to
support
right
now,
what
we're
doing
we're
live-streaming
all
the
research
meetings,
all
of
our
journal,
Club
reviews,
the
best
way
to
help
us
is
like
and
subscribe.
So
please
take
a
moment
out
of
your
time
and
do
that
right
now
and
I
will
leave
you
alone.
Here
comes
some
chats
here,
everybody
all
of
a
sudden.
The
same
thing:
yeah
you're,
very
welcome.
I'm,
really
happy
to
be
doing
this
sort
of
thing,
I,
love
being
transparent
about
scientific
discoveries
and
research.
A
I
think
it's
a
wonderful
way
that
we
can
just
put
all
of
this
out
there
for
anyone
to
consume
and
build
upon
and
try
and
work
with
and
understand.
We
want
to
understand
how
the
brains
work
and
share
everything
that
we've
learned
so
far.
That's
what
we
do
here
at
MIT.
Take
care.
I'll,
see
you
guys
on
HTM
forum.
Otherwise,
I'll
see
you
live
streaming.
Probably
Monday
I
probably
won't
be
streaming.
The
rest
of
the
day
so
have
a
great
weekend.