►
Description
Numenta Journal Club reviews: https://arxiv.org/abs/1712.01312
Discussion at https://discourse.numenta.org/t/learning-sparse-neural-networks-through-l0-regularization/6471/10
A
A
A
A
A
A
A
C
C
C
So
I
I've
been
looking
at
this.
This
paper
learning
sparsely
neural
networks
through
l0,
regularization
from
Louise,
OHS
and
well
I
hadn't
came
up.
A
couple
of
these
people
are
the
ones
who
created
variational
autoencoders,
but
that's
sort
of
an
aside
that
have
some
overlap
with
this
because
to
understand
this,
so
the
title
says:
l0
regularization,
but
I'm
not
going
to
talk
about
it
and
not
those
mathematical
terms.
C
I'm
just
gonna
say
they
count
how
many
synapses
there
are,
or
they
count
how
many
nonzero
weights
there
are,
and
it's
like
there's
a
reward
networks
for
having
nonzero
things.
So
their
overall
goal
is
no
networks
where
most
of
the
weights
are
exactly
zero.
So
sparse
networks,
by
definition,
sparsity
and
the
motivation
is
I
would
say
the
primarily
computational
efficiency
also
to
prevent
overfitting.
C
The
connectivity,
so
it's
like
and
one
you
know-
input
comes
into
the
network
and
put
feeds
up
through
the
network.
How
many
floating-point
operations
just
happened,
they're
lowering
that
effort,
and
so
that's
the
general
idea
and
I'm
going
to
jump
a
little
bit.
I
had
to
talk
about
what
their
solution
looks
like
and
that'll,
take
a
step
back
and
explain
how
they
got
there
and
by
the
way,
this
I
don't
know.
This
presentation
is
focused
on
laying
out
describing
it,
so
we
can
kind
of
evaluate
on
those
terms.
C
I'm,
not
gonna,
put
it
into
context
of
other
methods
for
sparsa
fication
I
want
to
study
some
more
before
I
before
I
tell
how
well
this
method
performs
relative
to
others.
I'm
not
ready
to
talk
about
that,
but
but
their
model
itself
is
I,
think
it's
elegant
and
nice
and
we're
looking
at
so
but
I
plan
on
doing
a
little
more
with
this
understanding
how
it
fits
the
broader
context
after
this.
C
So
if
I
were
to
just
like
over
compress
this
and
just
simplify
it
to
just
say
like
probably
what
they
do,
they
do
something
a
little
bit
like
having
permanence
ha's
on
connections
except
the
permanence
azar
probabilistic.
So
you
can
imagine
like
we
have.
We
have
something
called
permanence
where
we
evaluate
how
connected
to
cells
are
there's
about
a
threshold
they're
connected
there's
this
a
little
different,
that's
like,
rather
than
being
above
a
threshold.
C
It's
like
univision
permanence
is
point
five
half
the
time
it
will
be
connected
half
the
time
it
won't
be
and
just
to
be
clear
that
they
never
use
the
word
permanent.
That's
just
the
word
I'm
overlaying
on
this.
So
to
do
to
say
that
more
formally
now
each
of
their
ways
each
of
the
ways
and
in
each
time
step
is
it
has
a
stochastic
element
to
it.
So
there
is
they.
The
actual
learned,
wait
and
they
know
it
as
technically.
C
The
paper
uses
theta
because
they
wanted
to
talk
about
parameters
and
but
I
just
switched
using
W
sense.
Wanna
talk
about
weights,
the
learned
weights
are,
as
they
always
are.
No
networks
they're
they're,
just
they're,
not
sometimes
we
talk
about
getting
rid
of
weights
that
that's
totally
orthogonal
to
this.
That's
not
the
goal.
Here's
so
they
have
learned
weights
as
any
network,
but
they
have
this
extra
factor
between
0
and
1.
C
I
should
have
written
it
well,
I
Drive
this
extra
factor
between
0
and
1,
the
axes,
sort
of
the
gate,
and
this
factor
this
is
Eve
I've,
drawn
just
a
picture
of
the
type
of
probability
distribution
they
use.
Where,
with
some
probability,
Z
is
going
to
be
exactly
0.
That's
kind
of
the
way
there's
going
to
be
just
completely
0
with
some
probability
is
going
to
be
exactly
1
and
they
do
have
this
possibility
that
it's
somewhere
in
between
as
well.
C
C
C
If
this
that
having
this
connection
is
not
advantageous,
the
distribution
over
time
will
shift
this
direction,
such
that
it
more
closely
on
most
often
is
0
and
it
will,
if
it
is
advantageous,
the
distributional
shift
will
shift
in
this
direction,
so
they
actually
they're,
actually
modeling.
You
know
whoa
there.
C
D
Engine
because
I
think
this
is
closer
to
a
pathology
than
what
we
did
and
we
you
know,
we've
said:
okay,
let's
just
go
to
binary
and
in
professional,
and
but
this
is
probably
closer
to
what
it's
actually
happening.
That
was
kinda
gross
yeah,
but
it's
not
clear
what
the
advantage
of
doing
it
seems.
The
advantage
is
getting
to
these
in
states
where
you
really
they're,
not
just
important
for
the
synapse
is
not
important
yet,
and
the
in
between
states
are
in
my
mind,
not
only
where
you're
going
to
go
this
in
this
paper.
D
E
C
D
C
Mean
now
in
principle
they
could
for
inference.
They
could
just
sample
a
random
southern
disease,
the
sample
of
mask
once
to
sandwich
and
and
just
use
that
for
every
reference,
there's
no
advantage.
It's
not
like
just
now.
The
inference
consists
of
doing
multiple
passes,
so
they
could
just
sample.
It
seems.
D
D
C
C
C
On
the
constituents
they
didn't
train,
they
could
have
so
70
on
it's
just
the
beta.
You
can
stay
constant.
They
causes
the
curve
they
have
this
shape,
so
yeah
I
can
move
on
to
okay.
What
caused
them
to
land
on
the
solution?
I'm
gonna,
do
it
in
two
parts.
First,
why
probabilistic
at
all,
why
make
Z
stochastic
see
here
is
I
drew
a
little
vector
sign
above
it.
That
means
it's
a
mask
this.
So
like
Z,
this
was
an
individual
weight.
The
Z
vector
is
like.
Why
is
the?
C
Why
is
the
connection
matrix?
A
random
variable,
R,
the
cracker,
the
connection
factor?
So
the
reason
for
this
is
well,
if
you
so
stepping
back
to
like
the
original
purpose
of
this,
they
won't.
They
want
the
network
to
training
these
networks.
They
want
a
cost
function
that
gives
a
baby
like
reward
or
big
negative
costs
to
to
having
a
weight.
That
is
truly
zero.
Having
a
connection
that
is,
zero,
not
lost.
C
Yeah,
it's
negative
negative
cost
so
yeah.
The
point
is
that
wants
to
reward
or
punish
announcer
yeah,
so
I'll
describe
it
that
way,
since
we're
set
cost,
we
want
to
punish
nine
zeros,
and
that
inherently
has
like
the
thing
here.
Is
they
specifically
don't
want
to
punish
one
point
out
more
than
in
zero?
Five.
It's
just
it's
just
like
this
is
like
a
discrete
reward.
If
you're
0,
you
get
the
future
reward,
if
you're
not
0,
you
don't
like
this
reward.
That
inherently
is
a
very
non
smooth.
C
Roller
functions,
very
smooth,
very
discreet,
awesome
function
and
this
whole
approach
of
like
of
using
infinitesimal
gradients,
to
figure
out
the
best
network,
just
doesn't
work
when
you,
when
you're
using
those
sets
of
functions
so
there
there
there
general
way
to
solve.
That
is
if
you
make
this
mask,
if
you
make
the
connectivity
a
random
variable
that
lookbooks
it'd
just
look
something
like
this:
let's
just
talk
about
like
a
Bernoulli
random
variable,
just
discrete
it's
either
0
or
1,
with
some
probability
PI,
it's
it's
1!
C
C
The
average
cost
the
average
error
you
could
call
it.
What
is
the
average
error
of
this
of
this
network
under
these
probabilities?
Now,
if
you
change
the
probability
on
a
specific
sense,
just
one
synapse
changes
probability
just
slightly.
This
costs
will
vary
smoothly
because
you're
just
moving
the
probabilities
a
little
bit,
you're
gonna
cause
everything
to
to
change
an
infinitesimal
change
here
is
going
to
cause
an
intestinal
change
there.
C
So,
by
this
pair,
these
two
steps
of
choose
C
randomly
and
use
the
expected
value
when
you
like
waiting
and
the
probabilities
are
everything
that
gives
you
a
smooth
cost
function
problem
is
that
this
is
not
attractable
to
compute
a
summation
over
all
possible
masks.
That
means
some
I
mean
you
have
a
thousand
cells
that
stupid
a
thousand
somebody
summation
of
two
mm
OH.
C
D
C
D
C
D
C
Do
yeah
well
I'm
going
to
move
on
so
rather
than
you
know,
expand
this
outward.
You
can
approximate
this
by
just
simple
and
a
few
of
these
tackling
a
few
Z's,
not
just
choosing
these
randomly
sampling
them
from
like
what
are
some
likely.
Yes,
yeah,
then
taking
a
few
lightly
masks
and
just
averaging
across
those.
This
is
the
notation
of
it's
often
used
when
you're
sampling,
something
so
take
a
few
samples
of
it
in
averaging
accustomed
and
they
go
a
step.
C
So,
okay,
that
is
their
approach.
They
are
sampling,
a
random
mask
and
each
time
stuff.
Your
problem
is
now
that,
if
you
were,
if
you
use
these
looks
like
this
and
you
want
to
train
these
PI's,
you
want
to
figure
out.
Do
you
want
to
be
larger
or
smaller
you're
back
to
a
discrete
cost?
As
you
change
these
sort
of
like
yeah,
say
you
choose
a
random
mask
and
you
decide
okay,
how
do
I
want
to
change
my
probabilities?
C
C
Yes,
and
that
was
why
that
was
where
they
make
it
now
with
it.
As
you
vary
pi
as
you
vary,
this
parameter,
that's
controlling
what
Z
is
going
to
pop
out.
It
is
that,
if
they
have
this
in
between
continuous
is
own.
It
is
like
between
these
two
discrete
points,
then
the
not
that
changes
changes
it
so
that
changing
PI
under
has
no
effect
or
has
a
smooth
effect.
C
E
So
the
question
is
we
we
really
want.
It
is
great
fun
and
we
really
wanted
from
other
zeros,
disproportionately
yeah
and
otherwise
want
it
to
be
like.
So
that's
an
optimization
techniques,
and
so
they
go
through
all
of
this.
Yes,
but
at
the
end
of
the
day,
are
they
have
we
lost
that
property?
Is
it
still
really
encouraging
true,
smart
young,
or
is
it
going
to
get
stuck
in
a
lot
of
these
indicators,
so
sort
of
like.
C
E
C
So
yeah
I
can
answer
that
in
two.
What
is
one
of
those
answers
is
going
to
be.
I
should
experiment
that
we're
with
it
directly
and
see
what
these
see,
what
these
end
up,
looking
like
as
I
train
it
and
see
what
they
are.
It
may
have
answers
that
I,
don't
think
they
did
a
slightly
more
satisfying
answer
is
in
this
world
of
probabilities.
C
What
they're
rewarding
the
network
for
is
how
much
of
its
maths
someone
should
the
probability
masses
here,
Oh
any
anything
order
or
I
should
say
it's
punished
for
how
much
of
its
masses
over
here
how
much
of
his
masses
any
math
probability
masses
right
here,
it's
very
cost
for
that
it
has
no
cost
for
any
mess.
That's
here.
So
there
is
a
strong
incentive
for
this
to
be
exactly
zero
because
any
mess,
that's
right
here,
is
paying
the
same
cost
any
masses
over
here.
C
So
so
I
would
say
that,
like
the
vast
majority
of
the
mass,
because
network
is
on
zero,
now
there's
a
lot
of
it
lie,
and
this
in
between
region,
as
opposed
to
on
one
I,
don't
know
so
that
I
can
say.
Did
they
give
any
numbers
in
the
paper
on
this
and
you
do
they
talk
about
how
well
they
actually
achieve
sparsity.
C
C
Yeah
they
they
plot
it
against
other
techniques.
They
show
that
it
like
the
number
of
parameters,
the
number
of
effective.
This
is
shrinking
constantly
over
time
and
they
compare
it
to
other
techniques,
but
that's
the
best.
I
can
say:
I
can't
right
now,
I
can't
tell
you
like
they
achieved
5%
sparsity
I'm,
just
making
that
number
up.
So
that
is
that's
more
results.
We're
still
still
getting
a
good
picture
in
my
head.
C
D
C
Yeah,
maybe
the
last
thing
I
was
pointing
out.
I
was
when
I
looked
at.
This
I
was
a
little
worried
that
these
in-between
values
would
just
be
if
it
would,
if
the
network
would
somehow
be
storing
weights
here,
like
a
happy
version,
yeah
and
I,
don't
think
it's
going
to
do
that.
That's
all
I
was
saying
because
it's
so
unreliable,
it's
always
hopping.
E
C
C
So,
okay,
final
part,
I'll
talk
about
so
I
said
early
on
that
this
was
kind
of
magical
to
me
that
you
could
train
a
probability,
distribution
and
I'm
sort
of
phrasing
it
in
ways
to
make
it
appear,
magical,
there's
a
strict
form
and
to
do
this,
but
it's
once
you
know
this
straightforward,
it's
obvious,
but
I'm
not
going
to
do
it
and
it's
elegant
and
it's
used
elsewhere.
It's
not
just
from
the
smell.
They
call
it
though
other
papers.
They
call
it,
though
facing
reaper
a
motorisation
trick.
C
This
is
the
strength
that
you'll
even
see
titles
of
papers
and
talking
about
the
reprime
of
your
senses
and
in
its
this
clever.
So
here's
like
the
question
of
how
I've
drawn
this,
and
sometimes
when
people
depict
backpropagation,
don't
draw
like
gradient
passing
back
through
this
network.
You
train,
then
you
use
this
to
figure
out
how
to
change
the
way,
use
it
to
figure
out
how
to
change
this.
Oh,
it
was
like
black
magic
to
me.
How
would
you?
How
would
you
take
your
gradient
and
update
this
curve?
C
What
does
that
even
mean-
and
the
answer
to
this
is-
is
if
you
can,
if
you
can
generate
random
numbers,
if
you
can
generate
if
you
can
sample
this
distribution.
However,
you
how
it
invert
this
here
and
ballooning
that,
if
you
can
pose
it
as
okay,
first
just
generate
a
random
number
between
zero.
Just
it
does
anything
and
by
the
way,
it's
just
kind
of
a
coincidence
that
I'm
using
0
and
1
here
and
also
using
them
here
like
this,
could
have
been
generating
random
in
between
like
47
and
49.
C
If
you
can,
if
you
can
start
with
that
and
then
and
then
do
a
bunch
of
math
to
get
your
final
result,
it
still
is
the
case
of
the
back
propagation
when
it
arrives
at
the
random
stuff,
can't
do
anything
but
because
you've
set
it
up
in
this
way.
The
gradient
comes
back.
It
comes
here.
It
goes.
It
can
now
update
your
parameters.
C
C
E
D
C
So
anyway,
this
is
how
they
model
kind
of
works,
and
the
end
I
think
is
pretty
simple:
I
mean
it's
it's
as
these:
what
I'm
calling
kind
of
probabilistic
or
stochastic,
synapses
and
and
making
where
most
of
the
time
most
of
those
synapses
are
gonna,
have
to
wait.
Zero,
sometimes
they
won't
and
and
that's
how
it's
sort
of
the
Explorers
should
inform
a
connection
here.
Should
I
should
I
get
rid
of
etcetera,
and
this
strategy
feels
like
it.
It's
just
kind
of
a
natural
result.
C
If
you
think
about
the
problem
for
long
enough,
how
you're
going
to
train
sparse
networks,
it
feels
like
this
is
where
you're
gonna
arrive.
If
there's
like
one
of
like
a
logical
conclusion
of
almost
like,
they
discovered
this
rather
than
and
then
fit
it,
so
it
just
feels
next
I,
like
it
train
spars,
you.
D
D
C
This
method
that
makes
it
an
especially
nice
one,
is
a
lot
of
those
methods,
involve
training,
dents
and
then
like
creaming
it
into
something
else,
whereas
this
is
like
during
learning
it's
just
continuously
kind
of
getting
rid
of
it's
actually
yeah,
maybe
more
efficient,
yeah
or
like
starts
out
with
somewhat
that.
It
also
has
this
nice
leaky.
It
starts
out,
like
kind
of
probabilistic
sparse
it
kinda
starts
out
like
a
lot
of
these
are
going
to
be
kind
of
kind
of
off,
maybe
mostly
off,
and
then
the
ones
that
are
useful
kind
of
common.
D
C
E
E
D
Activity
as
far
as
security
forces,
it's
a
very
your
network
when
you
are
something
totally
the
way
of
representing
information.
You
don't
have
with
all
this
talking
about.
You
know
these
overlap
problems.
It
just
seems
like
we
might
end
up
with
a
different
set
of
solutions
if
you
start
over.
That
is
your
assumption.
It's.
E
E
E
D
E
Yeah,
it
really
has
to
know
even
in
the
one
they
tried,
how
sparse
they
got
just
looking
at
their
charts
a
little
hard
to
tell
they're,
comparing
flops
yeah
and
they're
gonna
really
give
the
factor,
and
you
can
kind
of
guess.
Maybe
it's
like
a
third
or
no.
If
they're
playing
a
quarter
are
ten
to
twenty
percent.
D
Feel
like
they're
trying
to
say:
okay,
the
one
beats
like
a
fully
connected
network
of
all
these
high
precision
weights,
but
we
build
I,
don't
want
to
end
our
so
how
to
break
that
symmetry
of
that
system
where,
if
you
start
with
a
system
where
you
have
your
continuity
matrix
is
inherently
as
far
as
to
begin
with
and
in
that
potential
set
up
all
your
setups
to
don't
want
to
do
or
something
like
that,
then
you
would've
broken
the
symmetry
system.
You
know,
that's,
like
you
know,
never
going
to
happen
now.