►
From YouTube: Backprop-Trained Permanences (NRM Feb 10, 2020)
Description
Numenta Research Meeting, with Marcus Lewis presenting a writeup on Backprop-Trained Permanences. See it at https://github.com/mrcslws/nupic.research/blob/backprop-structure/projects/backprop_structure/documents/backprop-permanences/backprop-permanences.pdf
Discussion at https://discourse.numenta.org/t/backprop-trained-permanences-nrm-feb-10-2020/7166.
A
Yeah
thumbs
up
looks
like
we're
going.
Buddy
is
working.
This
shouldn't
be
too
long
unless
it
becomes
long
I'm,
because
I'm
doing
something
that
I
talked
about
before
and
these
recording
meetings,
but
it's
just
a
little
bit
more
results
and
more
summarizing
it.
So
I
have
this
right
up
here
called
that
goes
into
detail
about
all
this.
So
therefore,
I
don't
have
to
go
deep
into
detail
here.
Just
tell
you
the
broad
themes,
so
this
is
called
achieving
sparse
connectivity.
A
It
will
compute
the
weights
by
multiplying
the
way
that
it's
stored
by
this
gating
variable
and
and
to
really
to
connect
this
directly
to
permanence,
as
the
whole
thing
works.
If
you
use
this
rule
where,
if
permanence
is
less
than
0.5,
then
the
gate,
the
zeroes
out
the
way
if
the
permanence
is
greater
than
25,
it
lets
it
through
the
weight.
The
connection
exists
now
for
this.
A
For
most
the
results
in
this
paper,
I
used
an
alternate
stochastic
version
of
it
wherever
the
permanence
acts
as
more
of
a
probability,
we're
in
the
synapse
works
with
probability
equal
to
theta,
but
this
can
work
to
it.
Just
you
have
to
you
have
to
train
it
a
little
different.
You
have
to
throw
in
some
other
kind
of
dropout,
but
it
also
works
to
do
that.
A
So
this
is
like
permanence
is
in
the
sense
that
it's
there
used
in
the
same
exact
way
like
during
inference,
for
example,
but
but
they're
trained
a
little
differently,
they're
trained,
rather
than
being
based
purely
on
like
this
fired
before
this
one
and
is
trained
on
whether
the
synapse
was
useful
on
whether
the
synapse
helped
successful
classification
and
at
the
core.
This
is
how
this
is,
what
I'm
doing
we're
hearing
something
about
them,
so
I've
written
about
like
how
how
to
do
this?
How
to
train
these?
A
A
Then
they
also
used
it
for
for,
for
example,
for
binary
weights
where
a
weight
either
exists
or
doesn't
exist.
I
have
combined
it
with
other
ways
so
that
I
can
get
it
use
as
a
sparse
affine
tool.
Okay,
so
that's
what
I've
done
here.
I've
also
experimented
with
the
binary
waste
thing
and
it
works,
but
that
just
wasn't
the
task
here
so
now,
I'm
going
to
show
you
2.1
and
then
I'll
talk
about
it
in
terms
of
the
cartoon
one
I'm
gonna
scroll
down,
I
talked
about
the
model
and
everything
so
on.
A
The
left
here,
Google
speech
commands
just
to
talk
about
these
charts.
I
have
accurately
thought
it
on
the
left,
as
you
can
see,
I'm
zoomed
in
way,
unlike
the
90,
some
percent,
accuracies
and
number
of
weights,
is
on
the
bottom
and
that's
on
a
log
scale.
So
moving
to
the
left
is
a
pretty
dramatic,
sparsa
fication,
a
pretty
dramatic
reduction
in
cost
I'm
plotting
here.
The
blue
thing
is
my
me:
are
my
results
basically.
B
A
A
C
A
A
D
C
B
A
Still
an
open
question
already:
okay,
that's
I
want
to
compare
this.
For
example,
something
called
a
variational
drop
out
and
also
they're,
just
old
metal
techniques
of
just
weight
pruning.
If
you
just
prune
out
all
the
ways
how
the
sparks
can
you
get
other
people
in
the
real
time?
So
really
everything
I'm
going
to
talk
about
revolves
around
these
charts
or
the
cartoon
version
of
them.
So
here
yeah
I'm,
just
showing
the
same
thing
where
the
dynamic
version
can
do
better
than
the
static.
A
B
A
A
Do
we
especially
expect
this
line
to
be
better,
so
you
could
imagine
a
sparse,
too,
sparse
might
be
somewhere
in
between
these
and
we'd
be
trying
to
get
as
far
over
as
we
can
that's.
At
least
what
I
would
expect
now.
I
do
want
to
draw
attention
to
the
fact
that
that
I
don't
know
how
I
expected,
if
you,
the
fact
that
dense
networks
still
do
get
better
results,
we
it
does
involve
giving
up
accuracy
anytime.
C
A
Here's
the
same
chart
just
a
little
more
conceptual
right
now
when
I'm
in,
like
forward
most
of
us,
have
been
working
on
this.
This
first
part
of
reducing
costs
with
little
bit
of
a
reduction
in
accuracy
and
at
least
in
my
mental
model.
Well,
first
of
all,
let's
say
like
if
this
was
all
we
were
doing
just
this
first
part
I'd
be
a
little
bit
uninspired
because.
A
That,
like
in
my
mental
model,
I'm
also
looking
forward
to
this
next
step
of,
we
can
also
now
essentially
run
larger
networks,
larger,
sparse
networks.
They
can
actually
outdo
the
original
one
like
seeing.
This
is
the
first
step
of
a
stage
of
then
being
able
to
then
reinvest
the
savings,
yeah
yeah
and
do
better
like
another
way
as
I
could
have
tried.
You
know
we
could.
E
A
A
C
D
Question,
sir,
and
you
may
be
wisely
said
on
the
x-axis
that
costs
right,
then
you
know
not
all
connections
or,
like
you
know,
log
scale
weights
or
something
because
in
your
manuscript
you
make
this
point
that
you
know
like
the
way
that
it
works
with
having
a
way
to
K
that
you
know
like
pushes
against
the
existence
of
weight
and
and
having
prominence
that
pushes
it
forward
for
the
way
to
exist,
and
then
it
becomes
a
balance
between
that
decay
and
I.
Could
reconfigure
you're
talking.
D
Think
you
make
the
point
that
this
obviously
this
this
works
to
reduce
the
number
of
ways
that
exist,
but
not
necessarily
directly
the
cost
and
the
sense
of
every
weight
that
you
have.
If
you
wanted
actually
optimized
all
computer
cost,
rather
than
the
weights
the
number
of
weights,
which
is
an
indirect
way
of
optimizing
computers.
How
would
you
do
that?
Well,.
A
E
A
There
this
is
a
table
that
goes
deep
into
detail
about
which
which
layers
it
achieve,
which
sparsity
and
so
on,
and
one
of
the
points
I
make
in
the
discussion
here
is.
That
is
that,
if
you
look
at
number
of
weights
so
this
top
table,
the
vast
majority
of
the
weights
are
up
in
fully
connected
layers.
But
if
you
look
at
number
of
multiplies,
the
vast
majority
of
them
are
in
the
convolutional
layers,
and
so
like
the
answer
to
your
question,
you're,
asking
of
how
do
you
optimize
for
multiplies?
A
It's
one
thing:
I
didn't
mention
before
the
way
I'm
sparse
spying
is
of
the
permanence
azar,
always
decaying
over
time.
I
just
didn't
mention
that
explicitly
misleading
these
presidents
are
always
decaying
and
I.
What
all
we
need
to
do
is
make
it
where
the
permanence
is
in
different
layers.
Decay
at
different
rates,
I'll
prioritize
the
convolutional
letters,
and
that
will
really
that
will
be
the
way
to
reduce
number
at
all
to
place
right.
A
Obviously,
it
does
not
automatically
mean
that
layer
should
be
10
times
like
to
have
the
decay
no
right.
No.
What
it
means
is
that
I'm
gonna
have
a
hyper
parameter,
search
and
I'm
going
to
use
multiplies
as
my
optimize.
It
optimization
all
right
and
then
then
I'm
just
gonna,
let
it
set
whatever
permanence
decays
it
chooses
to,
and
it's
gonna
optimize
it
for
me,
but
I
make
my
hyper
parameter
search
is
gonna.
Do
all
that
thinking
for
me,
yeah
yeah.
B
B
C
D
Guess
but
I
wouldn't
be
cool
to
see
just
assuming
that
a
little
little
bit
is
to
close
that
loop
right
if
I
were
to
think
of
that
is
to
instead
of
optimizing
to
sort
of
like
lower
the
cost
right
sort
of
like
immediately
reinvest
all
the
savings
into
an
additional
layers
right
or
like
by
their
networks
and
then
see
what
is
there,
you
know,
and
then
you
can
optimize
back.
You
see
I
mean.
D
B
Ultimately,
the
world
of
learning-
that's
probably
not,
and
you
know
that's
the
way
the
machine
learning
world
works
today
you
know
0.01%
was
better,
looks
like,
but
reality
that
sounds
really
works.
You
know
we
also
don't
really
know
what
is
the
actual
ultimate
possible
performance
metric.
Maybe
things
that
networks
already
there?
Maybe
they
you
can't
improve
them
any
further.
C
A
A
B
A
Sit
one
thing:
I'll
just
point
out
where
this
was.
This
is
a
little
bit
of
a
different
kind
of
experiment
than
we
often
do
so
these
these
excess,
often
when
we
have
an
x-axis
and
a
y-axis
like
and
and
these
these
dots
essentially
like
float
around
and
I,
define
different
sets
of
parameters
to
find
something
that
was
around
this
far.
A
So
you
find
another
ones
around
this
diversity,
so
that
was
like
a
different,
difficult
task
of
like
difficult,
hyper
parameter,
search
where
I
was
just
choosing
all
these
things
to
find
this
cloud
of
points
and
to
show
you
the
best
ones.
My
name
is
very
typical
things
that
we're
done
here,
the
yeah,
the
the
points
that
were
optimal
by
some
definition
of
optimal.
C
B
B
C
A
A
Buffering
but
for
instance,
do
you
still
like
sampling,
very
important
question,
but
yeah
I
didn't
know
pretty
much.
That's
only
answered
in
the
code,
so
yeah
yeah.
So
it
is.
It's
always
useful
when
you
have
something
like
this,
that
your
inference
is
still
deterministic,
so
you
don't
want
a
sample
randomly
and
what
I
found
worked
best
and
I
found
this
months
ago
and
I've
just
been
sticking
with
it.
I
assume
it
still
works
for
us.
A
If
you
sum
all
these
Thetas,
you
get
the
expected
number
of
synapses
on
this
unit,
and
so
what
I
do
is
I
sum
all
of
these
to
get
this
expected
number
of
synapses
and
then
I
select
the
top.
Ok,
all
right
rather
like
say
say,
say:
k
equals
some.
That's
suppose
that
was
a
summation.
What
a
Keagle
summation
of
like
they
don't
it's
expected
number
of
enabled
synapses
and
it's
choosing
the
ones
that
are
kind
of
the
best.
A
A
C
E
E
So
hey
thanks
for
watching.
Please
hit
the
like
button
for
me
all
you
guys
14
people
watching
right
now
or
whatever
it's
helpful
to
get
like
some
of
the
videos
and
I
just
want
to.
Let
you
know
that
I'm
gonna
be
not
streaming
for
a
couple
weeks,
starting
on
the
19th,
because
I've
got
a
medical
thing.
E
I
got
to
deal
with,
but
I
will
be
back
hopefully
soon
and
back
to
streaming
as
soon
as
I
can,
yes,
I
will
put
on
youtube
eventually
sound
good
thanks,
guys
thanks
everybody
thanks
for
watching
I'm
going
to
cut
the
stream.
Take
care,
see
you
on
the
forums
it
should
go
to
HTM
forum
join
HTM
form.
If
you
want
to
chat
with
it,.