►
Description
Ares Fisher discusses dendrites in machine learning, and reviews the paper “Improved Expressivity Through Dendritic Neural Networks” by Wu et al.
Paper: https://proceedings.neurips.cc/paper/2018/file/e32c51ad39723ee92b285b362c916ca7-Paper.pdf
B
So
briefly,
the
the
idea
of
using
dendrites
and
machine
learning,
the
earliest
reference
I
could
find
that
was
also
cited
by
a
lot
of
subsequent
papers
was
by
this
paper
by
when
I
got
the
point
of
Z
and
Mel,
and
what
they
did
is
basically,
as
this
little
hand
shows
you
is
that
they
could
approximate
the
firing
rate
of
a
pyramidal
CA
one
year
on
by
modeling
it
as
a
two
layer
neural
network
with
a
sigmoidal
output.
It
was
a
sigmoidal
activation
function,
so
that
was
like
their
best
fit
model
basically
and
they're.
B
Like
oh
look,
so
basically,
the
neuron
is
complex
enough
that
the
that
each
little
each
dendritic
segment
is
like
a
little
neuron.
Basically,
a
little
artificial,
Europe
and
I
didn't
show
the
HTM
you're
on
here,
because
it's
it
has
a
little
binary
things
and
I
thought
that
would
confuse
people
and
but
basically,
what
people
have
been
looking
at
since
is
this
sort
of
separation
between
like
apical
and
so,
and
they
don't
call
the
basal
usually,
but
mainly
like
a
feed-forward
dendrites
that
strongly
influences
so
mo.
B
C
I
mean
because
in
general,
when
you
get
well
this,
maybe
they
made
out
to
be
thinking
about
the
critic.
Spikes.
But
generally
you
can't
get
a
cell
the
fire
from
activating.
You
know
a
section
of
a
yeah.
C
C
It's
not
even
firing
rate
I.
Guess
it's
okay!
So
anyway
these
are.
They
were
trying
to
show
how
the
these
distal
dendrites
would
affect
firing
rate.
Not
fire
I
mean
because,
because
that
little
picture
they
are
showing
the
the
some
of
you
know.
Some
of
these
two
later
all
network
would
imply
that
I.
C
D
Even
under
Donny,
because
that
was
the
whole
premise
behind
the
HTM
sequence-
theory
good
memory-
so
all
of
this
was
done
on
models,
not
actual
recordings.
They
were
actually
Bartlett
mill
actually
had
a
paper
earlier
than
this,
and
then
Yoda
and
Bartlett
published
this,
and
these
were
this
modeling
work
was
really
what
led
to
the
experimental
literature
experimental
work
that
came
after
them,
so
these
guys
actually
predicted
some
of
this
beforehand,
but
this
is
all
completely
modeling
work
as
I.
Remember,
oh
you're,.
C
A
D
A
B
E
B
D
E
C
B
C
C
We
were
just
talkin
she's,
just
asking
about
those
alphas
and
they
show
them
as
little
dots
at
bigger
and
smaller
or
something
there's
something
some
sort
of
learned.
Wait
there
Rob
is
putting
out
thin
and
a
real
life
parameter,
so
I,
don't
think,
there's
any
everyday.
Well,
I
would
say:
there's
no
where
it's,
because
it's
somebody's
someplace,
they
said
something,
but
in
general,
that's
not
considered
there
there's
no
strong
evidence
that
there's
like
a
learn
the
weight
of
a
dendritic
branch
like
that,
but
the
outfits
are
so
that
they're
made
up.
B
B
So
basically
inspired
from
this
I
presume
because
they
cited
a
lot.
A
number
of
papers
have
recently
looked
at
its
using
these
separate
segments
to
to
leverage
them
in
their
models.
So
this
is
from
a
review
by
deep
mind.
Is
they
really
liked
the
idea
that
there's
feedback
projections
going
to
apical,
dendrites
and
feed-forward
prediction
is
going
to
basal,
dendrites
and.
B
Sorry
at
the
apical
activity,
what
they
focus
on
is
this
plateau
potential.
So
when
you
have
the
green
little
thing
here
is
the
dendritic,
the
apical
dendrite
voltage.
So
if
you
have
like
a
spike
coming
and
if
with
feed-forward
activation
you'll
get
like
an
attenuated
voltage
increase
in
the
calcium
active
zone,
basically
over
here
is
you
go
to
the
apical
site,
but
if
you
have
the
feedback
and
a
feed-forward
combination?
Basically,
if
you
have
activation
of
the
apical
dendrites,
you
have
this
plateau
and
this
plateau
might
have
really
interesting
properties.
B
I
mean
it
does
have
real
interesting
properties.
So
what
the
you're
given
Lily
crap
from
proposed
is
that
they
have
is
that
this
could
lead
to
like
a
much
cleaner,
a
much
nicer
way
of
separating
your
error
in
in
these
more
biologically
plausible
neural
network
models.
So
they
compare
this
with
previous
loops.
Excuse
me
with
previous
with
previous
models.
Where
you'd
have
you
know
back
propagated
you'd?
B
Have
feedback
connections
going
to
you
know:
output,
neurons
to
the
hidden
layer,
neurons
and
you'd
have
like
an
error
pathway
that
separately
activated
them
and
this
ads
model
complexity,
because
you're
adding
a
lot
more
parameters
to
the
model
and
what
I
mean.
This
also
adds
a
lot
of
parameters
to
the
model.
But
they
don't
think
this
is
biologically
plausible,
but
they
think
this
is
and
they
think
that
basically
having
feedback
projections
from
let's
say
your
output
layer,
those
can
go
into
the
apical
dendrite
side
and
sorry
into
the
apical
dendrite
side
and
then
trigger.
B
C
A
C
All
these
things
are
really
tightly
here
in
them.
So
they're
saying
the
feedback
goes,
the
great
black
bee
that
goes
just
the
apical
one.
I
just
took
me
awhile
to
realize
that
that's
what's
behind
it,
it's
not
two
layers
of
neurons.
It's
a
look.
They
showing
two
dots
per
neuron,
basically
and
they're
great
great
dogs
right.
B
So
they're
they
have,
they
have
like
two
separate
pathways.
They
have
a
feed-forward,
this
feedback
pathway,
as
he
said,
and
the
feedback
pathway
goes
onto
the
apical
component
and
they
think
this
is
more
biologically
plausible
than
one
of
the
other
solutions
which
is
this:
having
a
separate
error
pathway
or
having
feedback
weights
that
are
symmetrical
to
the
feed-forward
weights.
But.
C
B
D
B
The
so
they
recently
proposed
the
apical
dendrite
as
way
to
integrate
feed-forward
and
feedback
information.
So
the
the
rationale
starts
with.
How
do
you
integrate
feedback
information
to
create
like
a
local
learning
rule
and
their
solution?
Is
you
create
an
apical,
dendrite
layer
and
maple
dendrite
segment?
B
So
there's
the
other
work
by
dr.,
Zen
and
others
has
has.
You
know,
also
leverage
the
idea
of
apical
dendrites
and
they
use
like
a
form
of
predictive
coding
where
basically,
the
there
was
always
the
every
diagram
I
found
out.
That
model
was
a
little
bit
complex.
So
sorry,
this
is
a
simplest
one.
I
found-
and
it's
still
not
super
super
obvious,
but
basically
what
they
do
is
they
have
there's
a
feed-forward
input
that
comes
in
to
the
soma
and
you
can
ignore.
Assuming
this
comes
into
the
basal
dendrites,
you
can
ignore
this.
B
B
So
basically
the
the
feed
forwards,
the
feed-forward
weights,
basically
nudge-nudge,
the
nuts-
are
somewhat
towards
the
feet
towards
this
towards
this
reversal
potential
and
then,
in
the
absence
of
any
imagine
like
you
include
the
dendrite
not
for
now
in
the
absence
of
any
dendritic
stuff
happening,
the
the
this
was
all
the
summer
would
do.
It
would
try
to
like
go
to
this
reversal
potential,
but
dendrite
has
with
it
with
its
plateau
and
they
assume
like
a
type
strong
coupling
between
the
dendrite
and
the
soma.
B
The
dendrite
has
you
know,
information
about
the
target
and
basically
for
the
sole
month
to
now
go
towards
this
for
the
somewhat
to
go
towards
the
reversal
potential,
which
is
like
at
zero
activation.
It
needs
to
it
needs
to
update
its
weights
such
that
they
would
do
that
effectively.
So
what
this
is
is
a
form
of
predictive
coding
really
where
they
compare.
You
know
they
subtract
the
dendritic
from
the
somatic
potential
or
vice
versa.
B
B
C
Can
keep
coming,
I
know
this?
Yes,
when
you
done
yeah,
oh
okay,
well,
I'm
a
basic
question,
and
maybe
it's
just
a
more
neural
network
question
and
not
fill
it
specific
for
this
thing
and
to
understand
how
the
target
signal
is
represented.
If
you
have
one,
you
have
a
layer
of
neurons,
you
have
some
output
representation,
you
want,
but
there's
a
lot
of
large
connectivity
between
the
layer
of
neurons
and
the
output
representation,
and
you
got
this.
You
know
feedback.
How?
How
is
that
I
mean?
C
How
would
I
mean
don't
we
want
to
differentially
modify
each
of
those
neurons
in
the
hidden
layer
and
how
is
it
I
had
a?
How
was
it
differentiation
occur?
I
mean
how
does
the
target?
How
was
the
target
the
differentially
affecting
different
neurons
in
the
hidden
layer?
It's
a
basic
question:
I,
don't
understand
about
neural
networks,
I,
understand
it
and
bring
this
by
understand
here,
like
that
green
arrow,
how
does
it
beam
out
differ
between
each
of
the
hidden
units
or
doesn't
what
does
it
look
like?
What's
the
signal
look
like
they
agree
now.
B
B
C
C
C
C
And
then,
and
then
there's
an
assumed
that
the
each
day
purple
dendrite
has
its
there's
a
weight
associated
with
that.
So
how
that
affects
you
to
neuron
is
different.
Is
that
right?
Sorry,
can
you
repeat
the
last
part
I
just
don't
understand
if
I'm
a
neuron
and
the
hidden
layer
and
I'm
getting
his
feedback
on
my
now
on
my
apical
dendrite?
Is
it
just
a
scalar
value,
I'm
getting
and
every
neuron
is
getting
that
same
scale.
Every
neuron
in
the
hidden
layer
is
getting
the
same.
Scalar
value.
C
B
They
also
get
different
sensory
inputs,
so
some
you're
on
some
fraction
of
the
neurons
will
be
able
to
converge
towards
like
minimizing
this
error.
With
this
learning,
algorithm
and
I,
don't
know
what
they
do,
because
they've
only
read
it
their
implementation
of
like
how
to
do
this
learning
locally
in
one
year
on
I,
don't
know
what
they
do.
They
make
like
a
big
network.
Out
of
this,
maybe
some
of
you
have
like
read
deeper
into
this
I.
C
E
B
B
So
that
this
is
equivalent
to
back
propagation
so
that
you
can
actually
differentiate
this,
you
know
using
this
learning
objective.
You
could
differentiate
it
easily
and
leverage
the
strengths
of
back
propagation
and
credit
assignment,
and
so
this
has
been
like
pretty
interesting
work
and
then
some
of
the
work
that
you're
given
Lily
Cup
proposed,
which
I
think
is
later
I,
think
this
is
2017
yeah
thousand
seventeen
they,
basically
they.
B
They
created
this
little
diagram
on
the
left,
with
the
speed
forward
and
feedback
separation
and
the
reason
they
wanted
to
do.
That
was
because
they
wanted
to
solve
credit
assignment
and
biologically
plausible
neural
networks.
So
credit
assignment
is
the
idea
that,
like
okay,
you
know
this.
Given
that
there's
an
error
in
a
desired
output,
you
know,
or
something
like
I
want
to.
I
want
to
reach
a
desired
output
and
I
want
to
learn
where
do
I
put
blame
on
which
synaptic
weights
are
in,
which
neurons
do
I
put
blame
for
being
like
hey.
B
B
B
They
they
don't
provide
credit
assignment
these
hebbian
rules.
So
they're
like
how
do
we
do
credit
assignment?
So
let's
leverage
this
dendritic.
You
know
this
apical
dendrite
component
and
look
at
the
Pluto's,
so
they
have
these.
They
actually
use
like
a
spiking.
Real-Time
model
and
I
mean
the
plateaus,
don't
spike.
Sorry,
the
apical
dendrite
some
spike
in
the
model,
but
basically
create
this
plateaus
right.
So
this
is
like
you
know
at
some
at
some
time
at
some
time.
B
T
you
know
the
the
potential
and
the
the
potential
in
the
apical
dendrite
evolves,
and
then
they
average
this
and
in
in
each
phase
of
training.
So
there's
two
phases
in
training
here:
sorry
I
could
have
ordered
this
in
a
much
clearer
way,
so
basically
have
two
phases
in
training.
One
is
like
this
feed-forward
component,
which
is
basically
they
put
the
data
and,
let's
say
the
image,
the
amnesty
age
they
feed
through
the
network,
and
this
leads
to
a
calcium
plateau
right
and
then
they
have
the
bit
where
the
feet
forward.
B
So
there's
still
feed-forward
input,
but
they
they
also
include
the
feedback
input,
and
this
leads
to
a
different,
calcium,
flutter
and
the
feedback.
Basically,
what
the
feedback
does
is
it
tells
it
creates
the
instruction
signal
it
creates
the
instructions.
That's
what
signal
is
like
this
is
a
plot,
so
you
should
have
if,
if
you
were
perfectly
representing.
A
B
Know
if
you
were,
if
you're,
if
you
were
like
perfect
contributing
to
minimizing
the
error
on
this
task
and
basically
because
of
their
setup,
they
can
differentiate
this
difference
between
the
target
Plateau
and
the
plots
the
feed-forward
what's
up,
and
so
they
update
the
weights
are
using
background
to
to
to
match
this.
As
everyone
like
clear.
C
B
B
B
D
Maybe,
as
a
baby
about
I
think
just
to
be
clear,
these
have
not
been
tried
in
machine
learning.
The
goal
for
these
is
to
basically,
they
start
by
saying
back.
Prop
is
the
best
you
can
do,
and
the
goal
here
is
to
see.
Okay
do
biological
networks?
Does
the
neocortex
implement
some
form
of
back
propagation
yeah.
C
D
Yeah
or
maybe
they're,
you
know
they're
trying
to
understand
cloud
the
brain.
This
is
a
way
for
them
to
try
to
understand
how
the
brain
might
do
more.
Complex
learning
like
back
propagation,
does
yeah
and
back
propagation
solves
this
credit
assignment
problem
with
deep,
with
hidden
layers
and
they're
saying
well,
the
brain
has
to
solve
the
same
problem.
How
could
it
possibly
do
that
because
pure
have
been
learning
won't
do
it
so
these
kind
of
structures
they
show
that
it
basically
approximates.
What
back
propagation
will
do?
Yeah.
C
C
B
B
D
Of
these
papers
are
doing
that
here
so
far,
and
the
other
direction
is
to
say:
okay,
machine
learning
is
wonderful,
yeah
back
propagation
is
wonderful.
The
brain
must
be
doing
some
former
back
proper
solving
a
similar
problem
and
how?
How
could
we
have
biologically
plausible
networks
that
solve
the
credit
assignment
problem
with
hidden
layers
and
that's
what
these
papers
are
trying
to
show.
E
B
D
But
they're
not
trying
to
they're
not
trying
to
get
the
machine
learning
community
to
use
these
they're
looking
at
techniques
in
machine
learning
and
trying
to
get
the
neuroscience
community
to
appreciate
it
more
and
saying
hey.
This
is
a
way
you
can
understand
the
brains
more
and,
of
course,
they
implement
the
model
and
they
have
it
running
in
simulations.
But
the
goal
is
to
have
biologically
plausible
models
are
backprop
mm-hmm.
D
B
Yeah-
and
there
is
like
there's
an
implicit
goal
of
this-
is
that,
if
you
can
is
like
saving
things
like
computation
power,
if
you
have
like
neuromorphic
chips
and
analog,
you
know
low
power,
low
power
computation
that
you
can
do
with
spiking
spiking
models,
and
then
you
know,
spiking
models
have
this
problem
with
the
stdp
etcetera.
Were
that
you
know
it's
unclear,
do
they
don't
perform
as
well?
So
if
you
can
find
some
sort
of
architecture
where
you
can,
where
you
can
do
this
with
some
continuous
models.
C
D
C
D
C
C
D
B
But
I
did
I
I
do
find
it
especially
this
model.
I
find
pretty
interesting
because
it
has
like
some
interesting
parallels
with
the
with
biological
data
and
even
though
I'm
not
sure
like
if
the
neuron
locally
create
calculates
a
prediction
error.
But
you
know
the
fact
that
you
have
predictions
and
I
think
that
you're,
like
I,
think
Timothy
Lucas
excited
about
this
idea
too,
that
the
feed
for
the
feedback
connections
or
predictions
basically
yeah.
D
B
D
D
No,
that
is
the
key
thing,
is
that
it's
not
directly
available
that
you
have
to
be
able
to
do
this
in
hidden
layers.
That-
and
that
is
that
is
the
key
thing
here-
that
they're
showing
these
local
circuits
can
do.
The
same
thing
as
backdrop
can
do,
which
is
to
have
good
prediction:
error
signals
in
hidden
layers
yeah.
B
B
B
So,
but
basically,
what
it
is
is
just
max
out
with
sparse
weights.
So
this
is,
the
left
is
a
diagram
of
a
normal,
no
two
layer,
neural
network,
fully
connected
weights
and
some
kind
of
non-linearity
here
and
then
their
their
network
basically
has
these
dendritic
branches
and
each
one
projects
to
one
year
on
right.
So
each
neuron
can
have
any
sum
any
number
of
these,
these
branches
and
they're.
The
other
thing
they
stress
is
that
each
branch
in
a
given
neuron
will
receive
a
mutually
exclusive
input,
so
there
will
be
no
overlap
in
the
input.
B
B
Yeah,
it's
a
little
for
sparsity
that
they
also
are
really
excited
about.
So
it's
not
a
random,
completely
random
sparsity.
It's
not
like
a
random
mask.
They.
They
create
like
an
index
for
each
for
each
branch
for
each
of
these,
like
output
weights
here
and
that's
how
they
also
proposed
to
like
minimize
the
computational
load
of
doing
this.
So
instead
of
multiplying
this
by
one
big
mask,
that
is
like
a
number
of
input,
neurons
times
number
of
branches.
B
C
Not
that
it
matters
here,
because
this
is
just
a
neural
network,
but
that
that's
not
a
way
that
real
neurons
work
so
really
a
real
large.
You,
wouldn't
you
wouldn't
find
a
single
cell,
making
multiple
synapses
on
a
single
branch
like
a
dendritic
segment,
a
segment,
but
you
do
find
the
same
axon,
making
multiple
synapses
on
different
segments
of
the
same
branch.
So
this
you
know
like
the
ischium
there
are
it's
not
differentiated
by
branch
is
differentiated
by
segment
of
dendrite,
which
are
many
more
segments.
C
Then
there
are
branches
mm-hmm,
and
so
the
idea
that
I'm
just
pointing
out
it's
not
detracting
from
this
model
law,
because
this
models
trying
to
do
a
neural
network
but
I'm
just
pointing
out
for
those,
are
interested
that
you
wouldn't
see
that
kind
of
segregation,
an
axon
would
you'd
be
one
of
these
colored
lines
from
the
input
neurons
would
actually
make
often
not
always
but
often
will
make
multiple
synapses
on
different
segments
of
the
same
branch.
You
just
don't
see
them,
making
multiple
synapses
next
to
each
other
on
the
same
segment.
It
just.
D
D
C
C
Same
branch,
the
term
branch
refers
to
the
major
segments
that
attach
to
the
soma,
so
prayer
on
their
own
may
have
three
or
four
branches
of
which
each
one
has
many
segments.
So
it's
just
it's
just
a
different
level
of
parsing
and
again
it's
it's
not
important
for
their
model.
It's
just
a
biological
detail.
That's
all!
Oh
yeah,.
C
C
And
and
and
we
you
know
a
general
model,
you
said:
oh,
we
have,
we
have
a
soma,
we
have
a
proximal
input
and
then
we
have
multiple
segments.
Each
segment
is
like
an
independent
little
coincidence.
Detector,
our
model
is
not
complete
either
because
we
don't
the
HTM
neuron
model,
doesn't
try
to
make
any
distinction
about
the
dynamics
of
Anurag
and
Hritik
branch
in
terms
of
all
the
different
places.
The
branch
split.
C
B
C
Because
because
the
the
integration
zone
on
a
dendrite
is
not
the
branch,
it's
the
segment,
it's
in
the
HTM
neuron
in
the
or
neuron
paper
in
the
neuroscience
says
that
the
the
inputs
to
a
dendrite
are
combined
over
a
fairly
short
distance.
Like
typically
they
say
40
microns,
the
forty
thousands
of
an
inch
in
you
might
have.
You
can
have
up
to
let's
say
forty
synapses
in
that
distance,
so
there
may
be.
There
may
be
dozens
of
Bragman
and
thousands
of
synapses,
but
the
integration
zone.
What
would
be
the
equivalent
of
like
a
little
neuron?
C
It's
only
a
small
section
of
a
branch,
it's
or
a
segment.
So
if
you,
if
you
have
a
synapse,
if
you
imagine
having
this
one,
a
generating
branch
and
this
it's
Forks
out
like
a
tree
in
this
there's
hundreds
or
thousands
of
synapses
all
over
room,
if
you
ran,
if
you
just
activate
several
of
those
synapses
on
there,
if
they
don't
some
at
all,
they
just,
but
only
they
only
some.
C
Don't
think
so,
I
well
I
guess
you
could
maybe
they
call
it
that
with?
Maybe,
maybe
that's
for
me
to
look
at
then
they're
saying.
Okay,
these
neurons
have
two
segments
which
is
very
limited,
I
sort
of
interpreted
it.
They
really
meant
a
branch,
but
maybe
I
guess
you
could
look
at
that
way.
You
could
say
if,
for
example,
I
you
could
look
at
that.
What
I
guess
I
don't
have
any
actual
neurons
here,
but
a
real
segment
on
a
real
neuron
again
could
sort
of
maxes
out
at
forty
two
synapses.
C
C
G
Have
one
question
so
you
mentioned
that
the
if
you
look
at
where
the
branches
diverged,
what
you're
calling
branches
diverged
that
the
that
the
the
model
of
how
those
interact
is
is
complicated.
Do
is
there
a
simplifying
way
of
looking
at
that
understanding
the
segments
represent
individual
compartments,
but
when
they
converge
on
actually
the
the
edges
converge.
Do
we
know
what
happens
well.
C
C
The
junction
is
it's
a
different
impedance
match
if
you
will
a
different
impedance
function
for
the
junction,
and
so,
if
you
try
to
match
a
voltage
gradient
going
through
the
junction
it
gets
lost,
it
doesn't
work
like
it
up
into
the
junction.
You
can
model
it
as
this,
like
leaky
pipe,
but
the
junction
all
kinds
of
crap
happens.
So
you
see
a
lot
of
models
about
that.
C
However,
our
believe,
well,
our
model
in
this
evidence
to
this
is
that
once
you
have
a
dendritic
spike
like
an
end
of
the
a
spike,
it
is
able
to
travel
through
the
junction.
It
doesn't
get
highly
transformed.
So,
if
you're,
if
you're
imagining
inputs
from
different
brand,
if
er
integrants
are
getting
combined
downstream
after
going
through
these
junctions,
it
doesn't
look
like
that's
easy
or
it
happens,
or
it's
just
weird.
C
Yeah,
well,
that's
the
way
we
model
it
and
my
my
point
is
that
some
neuroscientists
would
disagree
with
that,
because
if
you
study
how
the
complex
dynamics
of
those
Y
junctions-
and
they
say
well-
it's
really
complex
because
we
did
all
these
studies
and
but
the
idea
that
that
it's
not
nearly
as
complex,
if
you
assume
a
dendritic
Graham's,
so
I
mean
assuming
a
dendritic
spine.
So
we
model
it
as
if
they're
non-existent.
There's
other
there's
other
evidence
that
suggests
there
are
some
things
going
on
there.
C
C
So
there's
some
there's
some
evidence
that
if
I
start
at
the
very
end
in
activated
indirect
spike
and
then
activate
and
whether
gender
expects
partly
along
the
way
and
another
one
that
the
whole
signal
propagates
better
then
you've
got
it
then
so,
like
there's
an
order
of
preference,
then
if
you
do
it
backwards,
we
don't
model
that
you
haven't
tried
to
accommodate
that
at
all.
So.
G
C
B
Right
thanks
for
out
there
right
so
realistically
or
not,
this
is
their
their
model,
and
this,
like
I,
said
it's.
Basically,
the
operation
the
dendrites
are
doing
is
max
out,
so
max
out
is
basically
the
fact
that
you,
every
neuron
the
instead
of
having
some
kind
of
deterministic
nonlinear
activation
function
such
as
a
sigmoid
or
a
low
or
whatever
they
take
the
they
have
a
number
of
branches
that
come
in
and
they
each
you
know,
have
like
different
different
input
and
they
select
the
one
that
has
the
highest.
So
what
this.
So.
B
This
is
a
little
like
a
learned,
activation
function
and
there's
people
here
who
are
a
lot
more
expert
at
max
out
and
other
like
learned
activation
functions.
So
please
correct
me
if
I
say
something
wrong,
what
it
can
do
is
learn
piecewise,
linear
functions
and
like,
depending
on
the
number
of
you
know,
let's
quote
unquote
branches
you
get
per
per
output
unit.
It
can
become
like
an
increasingly
complex
piecewise
linear
function,
so
this.
So
this
is
a
generalization
of
the
rectified
linear
unit
to
several
other,
like
piecewise
linear
functions.
So
the
rectifier
is
just
this.
B
It's
the
max
of
you,
know
the
max
of
X
or
0
right.
So,
if
it's
smaller
than
0,
then
you
return
0.
But
you
know
if
it
ends
up
doing
something
like
the
absolute
value
you
can
get
like
for
negative
x.
You
can
get
like
this
guy
here
and
right
and
so
like
it.
This
would
be
one
branch,
and
this
could
be
another
branch
and
have
multiple
branches
could
give
you
something
like
a
quadratic
function.
Is
this
more
or
less
correct.
F
D
B
Right,
that's
true
yeah
here.
The
difference
is
that
each
one
and
with
K
winners
like
in
the
max
pooling
layer
we
have
is
that
these
there's
far
sways
from
whatever
these
are
to
the
neurons.
And
then
these
the
top
K
of
these
get
go
through
with
K
equals
1.
Here
they
have
like
full
connectivity
from
the
branches
to
the
neurons,
but
there's
far
sway
to
the
branches.
So
that's
like
a
little
bit
about
like
a
structural
difference
that
would
influence
what
the--.
E
B
So
and
they
partially
motivate
the
idea
of
having
the
sparsity,
which
they
stress,
is
different
than
other
types
of
sparsity
and
I've
been
tried
up
to.
The
point
is
my
motive
by
the
idea
of
the
model
in
sampling,
where,
if
you
combine
like
random
features
or
random
components
of
a
model
or
something
like
drop
out,
basically
you
get
better
performance
and
hopefully,
better
generalization.
B
But
though,
though
this
is
similar
to
drop
out
they
in
their
experiments,
they
do
not
compare
this
to
a
maxout
activation
with
drop
out
and
max.
That
was
designed
to
be
compatible
with
drop
out.
Basically,
is
the
motivation
so
performance
wise
in
the
paper
itself.
They
only
show
training
loss
and
not
not
test
lost.
You
know
just
only
in
the
sub
lines,
but
they
basically
make
these
graphs
where
they
show
the
number
of
branches
per
neuron.
You
know
for
a
given
network
size
given
layer
size
and
they
combined
their
model.
B
The
dendritic
neural
network
with
this
is
like
layer
normalization
with
an
R,
lol
or
Bachelor
Malaysian
in
reloj,
and
this
is
the
training
loss
basically
so
like
they
effectively
like.
How
quickly
does
the
training
loss
draw
and
they
find
like
there's
a
sweet
spot
for
this
task,
at
least-
and
this
is
on
the
fashion
inist
and
they
did
a
few
different
experiences.
They
did
on
the
C
for
10
C
for
100,
and
then
the
UCI
data
set
at
the
end
and
they
find
like
a
sweet
spot
of
the
number
of
branches
right.
B
B
A
So
when
you
plot
accuracy,
you're
kind
of
threshold
and
you're
saying
like
what
was
that,
what
you're
saying
the
network
had
one
guess
it's
a
class
of
Ida
is
that's
one
thing.
Training
loss
captures
confidence
in
it.
If
it
was
really
confident
and
the
right
thing,
it
gets
a
lower
loss
if
it
if
it
gets
something
wrong,
but
it
assigned
50%
probability
to
the
correct
answer.
Then
it
doesn't
get
penalized.
E
E
C
B
B
B
These
are
like
layer,
normalization
or
rationalization,
and
one
thing
you'll
find
interesting
is
that
these
are
on
the
same
tasks,
these
different
networks
trained
on
the
same
tasks,
but
these
thoughts
are
different
in
both
in
both
plots,
so
I'm
not
sure
what
like
the
comparison,
is
supposed
to
be
here,
because
if
I
was
making
splat
I
would
have
these
constant
and
just
show
the
different
different
activations
here
alongside
them
for
reference.
Wait.
B
B
B
D
C
F
Why
why
do
you
think
leading
to
plot
the
whole
curves
for
this?
Rather,
they
held
the
number
of
branches
fixed
for
the
ones.
Oh.
B
C
G
C
B
But
they're
ordering
is
also
different.
It's
not
just
the
relative
spacing
right,
so
it's
a
little
bit
strange,
but
yeah.
So
the
these.
These
models,
these
the
diamond
and
triangle
models.
They
do
not
have
a
concept
of
branches
per
neuron,
they're
normal,
fully
connected
models
with
one
hidden
layer
and
the
the
number.
E
C
So
earlier
you
said
that
it
was
was
it
16
and
64
dendritic
branches,
but
here
there
are
two
here
they're
starting
at
64,
so
it's
drawing
out
a
higher
number.
Is
that
correct,
like
the
black
line?
Is
the
en
and
sixty-four
ice?
And
that
means
no,
that's,
not
branches
for
non.
The
branches
plotted
down
below
nevermind
I
got
it
think
about.
This
is
the
hidden.
This
is
also
the
hidden
layer.
Yeah
I
got
it
right,
so
the
branch
Minar
it's
down
below
and
16
is
like
a
sweet
spot.
Yeah.
B
E
B
Instead
of
like
max
pooling
where
you
take
the
highest
one
or
whatever
yeah
yeah
yeah,
so
they
find
this
trend
that
you
know
increases
with
the
number
of
branches
per
neuron
so
max
out,
yeah,
so
relu,
plus
average
pooling
increases
with
the
number
of
branches
per
neuron.
Basically,
I
guess
you're
deciding
more
parameters
and
there's
a
sweet
spot
for
the
branches
per
neuron
in
in
their
model,
and
little
max
out
with
this
florists
incoming
whites.
B
A
C
Anyway,
according
to
this,
if
I
interpret
this
correctly,
that
these
is
this
model,
homing
really
does
better
than
existing
models
under
the
scenario
of
like
64
segments
under
those
two,
those
two
lines
on
the
bottom:
we've
got
it
blue
and
the
dotted
green.
Those
are
the
only
two
that
sort
of
perform
capably.
That
is,
that
the
correct
interpretation
right.
B
Highest
biggest
hidden
layer,
yeah
yeah
yeah,
and
they
find
that
you
know,
depending
on
the
branches
when
you're
on
that
you
know
how
it
approximates
the
performance
of
these
other
networks
that
don't
have
to
so
there.
They
don't
get
better,
better
test
accuracy
or
anything
like
in
any
of
their
results.
But
the
the
title
is
improved,
x4
70,
so
that
was
something
that
I
needed
to
spend
a
little
time
or
expressiveness
looking
up.
So
basically,
what
a
neural
network
does
is
this
right?
B
So
let's
say
you
had
like
a
non
non
separable
region
like
the
X
or
problem,
or
these
two
overlapping
sorry
wait
for
it
goes.
You
know.
This
thing
is
not
linearly
separable.
What
in
what
a
hidden
layer
does?
Is
it
twists
the
space
of
the
problem
such
that
these
become
linearly
separable
right,
and
this
post
of
Chris
Nolan's
book
is
really
cool
and
there's
there
was
a
paper
in
2017
that
sort
of
tried
to
quantify.
B
C
B
Oh,
how
many
spirit,
how
many
different
stretches
you
can
make?
Okay,
I
think
is
the
idea,
basically
how
many
of
these
lines
you
can
create.
So
if
you
have
like
on
the
input
layer,
let's
say
you
have
one
two,
three
four
four
units
when
you
train
it.
These
are
the
lines
through
this
face.
If
it's
a
2d
space
like
in
this
in
this
image,
these
are
the
lines
that
it
can
separate
through
right.
So
this
is
like
you
could
separate
this
one
separates.
B
So
for
every
unit
that's
active
in
the
previous
layer,
it
will
drive
a
different
permutation
of
these
lines
in
the
next
layer,
so
so
each
so
this
doubles
called
the
quote/unquote
expressiveness
of
the
of
the
network
because
you
can
have,
for
you
know
you
have
a
much
higher
space
of
separation
right
so
for
for
this
guys
activation
you,
let's
say
one
neuron
can
do
this
piece,
this
piece,
wise,
linear
separation
and
for
different
for
this
activation.
Let's
say
it's
the
same:
neuron
does
this
one?
D
C
C
Yeah,
you
know
it's
funny,
I
get
that,
but
I've
never
understood
that
it
was
a
so
I
would
support,
vector
machines.
I
said
no,
you
got
all
these
divisions
right,
but
I
always
thought
they
were
I
thought
it'd
be
more
like
you,
the
green.
The
black
lines
went
all
across
the
figure
than
the
green
lines.
I
got
all
across
bigger
and
then
the
lines
go
all
the
across
before
they're,
not
they're.
Only
each
successive
one
is
only
going
drawing
the
line
between
one
of
the
previous
segments.
Well,.
G
C
And
that
but
I
guess
I
didn't
understand.
There
was
a
progression
like
this
I
was
support,
vector
machine
introducing,
oh,
we
just
we
just
divided
up
the
space
into
a
lot
of
you
know,
planes
essentially
and
and
then
you
got
a
bunch
of
components,
but
this
is
sort
of
saying
that
you're
saying
this
is
a
progression.
You're
you're
you're
to
hierarchy
of
divisions
going
on
here
in
some
sense.
G
B
G
A
G
G
So,
basically,
what
they're
saying
is
the
reason
why
we're
winning
with
the
higher
the
I
mean.
Well,
that's
one
of
the
things
we
found
in
sparsity
is
that
we
have
greater
width.
We
have
if
you
wish
more
expressive
power,
so
the
notion
of
when
you
create
those
when
you
had
that
representation
of
where
you
know
you
can
model
various
functions
and
then
you
can
get
you
know,
piecewise,
linear.
G
Basically,
that's
saying
that,
with
the
more
with
the
greater
width,
you
can
have
more
and
more
complex
if
you
wish
separation
planes
in
some
sense,
but
you
know,
then
that
has
more
ability
to
in
the
in
the
next
slide,
we're
showing
all
those
stretches.
Basically,
it
means
that
you
can
kind
of
thread
the
thing
in
between
those
distributions
better
is
that
kind
of
fair.
B
But
the
measure
complexity
doesn't
yeah
the
measure
of
complexity
here,
and
this
paper
has
like
a
number
of
different
measures,
but
one
of
them
is
basically
this.
Let's
say
you
train
the
network
and
you
you
present
an
input,
so
you
go,
you
go
through
a
trajectory
and
one
of
the
one
of
the
measures
is
the
number
of
piece
while
the
number
of
linear.
B
How
do
you
call
transitions?
Number
of
linear
transitions
per
trajectory?
Is
a
measure
of
expressive
expressiveness.
So
here
you
have
like
every
time
when
you
switch,
you
know
in
the
input
space
you're
gonna
affect
you,
know
these
these
activations
here
and
that
will
lead
to
a
different
activations
here
and
then
the
purple
ones,
so
you're
gonna
have
more
linear
transitions.
So,
for
example,
here,
if
you
your
trajectory,
let's
say
you
start
here
and
you
just
go
around
right,
so
every
at
every
T
present
like
a
different
part
of
this
input.
B
B
Between
and
yeah,
so
that's
what
they
call
expressiveness
and
they
find
that
the
the
width
isn't
actually
the
very
big
contributor
to
this.
After
some
point,
there
might
be
a
little
bit
of
a
trend
the
beginning,
but
what
seems
to
be
much
more
of
much
more
predictive
of
it
as
Network
depth,
so
assuming
the
assuming
on
nonlinear
units
in
each
layer.
The
network
that
is
like
very
you
know,
strongly
predicts
the
number
of
transitions,
so
quote-unquote
model
expressiveness.
B
So
by
this
regard,
this
paper
outperforms,
you
know
equivalent
max
out
or
standard
relu
Network
by
Counting.
You
know
having
these
transition
counts
or
with
a
given
trajectory
now
so
I
I,
don't
exactly
know
what
the
trajectory
is.
They
showed
how
they
did
it
like,
as
in
you
know,
like
you,
have
a
a
plus
Delta,
you
know,
and
they
they
point
out
the
theorem
and
they
prove
well.
They
they
demonstrate
that.
You
know
this.
B
Their
their
architecture
is
also
Universal
approximator,
like
a
standard,
you
know
one
hidden
layer
network,
but
they
the
transition
counts
for
the
same
trajectory
are
higher
in
their
network
than
the
max
out
network.
With
the
same
number
of
units
in
the
hidden
layer,
end
of
a
network
now
the
thing
that
they
didn't
do
as
I'd
mentioned
before,
has
compared
this
with
max
thought
that
also
had
drop
out.
So
in
the
other
networks,
they
did
batch
normalization
and
other
tricks.
That
would
improve
the
accuracy
of
the
networks,
but
they
did
not
drop
out.
B
B
Full
nur
I
think
FN
n
stands
for
like
fully
connected
neural
network
for
the
max
of
network
yeah
they
so
for
the
dens
day.
They
have
like
a
different
branch
number
for
the
max
out
network.
They
increase
the
number
of
kernels
in
the
back
song
unit.
Okay,
yeah,
that's
the
difference.
Is
this
varsity?
So
it's
like
a
fully
connected
layer
with
max
out
and
then
for
the
relu
dr.
B
yeah,
actually
I'm
not
clear
about
this.
They
don't
really
mention
what
the
what
happened.
So
I
was
curious
about
this
too
I
think
it's
supposed
I,
don't
think
it
would
be
straight,
but
it's
not
unless
this
is
an
illusion.
That's
going
down
by
the
fight
that
there's
curves
going
outwards
right.
Is
it
actually
actually
I
think
it's
actually
straight
yeah,
so
I
think
this
is
supposed
to
be
like
a
set
value
and
they're.
Just
it's
a
visual
aid
for
comparison?
Okay,
so
it
does
it
look.
Does
this
green
line?
B
D
B
I
think
yeah
I
think
it
might
be
an
illusion,
that's
kind
of
cool
right.
So
that's,
basically
the
main
driving
point
of
the
paper
is
that
oh
look.
We
have
this
a
higher
number
of
linear
transitions
per
trajectory,
so
this
model
is
more
expressive
than
senator
max
up.
But
as
I
said
again,
they
don't
compare
it
with
dropout
and
they
don't
compare
it
with
other
types
of
sparsity
either
and
by
Texas
Parks
came
in
like
how
they
implement
it.
B
So
in
conclusion,
you
know:
I
presented
a
couple
different
ways
of
using
quote:
unquote
dendrites
and
your
machine
learning
models,
and
you
know
some
ideas
or
to
separate
the
feedback.
/
error
pathways
and
that's
one
thing
you
can
leverage
using
dendrites
or
you
know,
do
you
increase
the
capacity
and
expressiveness
and
there
have
been
some
papers
made
by
you
know:
Boyd,
Ozzie
and
Mel,
and
for
NASA's
lab
data.
You
know
they're
having
a
number
of
nonlinear
dendrite
segments.
Inclusive
increases
the
the
memory
capacity
of
each
unit,
so
yeah
with
this.
E
B
Like
structurally
yeah
yeah,
so
the
model
one
of
the
models
I'm
working
on-
has
like
a
similar
concepts
feed-forward
and
has
a
number
of
dendritic
segments
per
per
unit
and
there's
also
sparse
weights
coming
into
it.
So
the
motivation
for
this
was
increased
capacity.
I
wanted
each.
You
know
for
each
branch
to
having
different
branches,
leading
to
the
neuron
being
able
to
respond
differently
to
completely
different
inputs,
so
this
would
be,
for
example,
this
would
be
for
continual
learning
where
you
know
you'd
have
different
tasks.
B
So
let's
say
here
you
know
every
branch
would
correspond
to
like
a
task
roughly
right
and
then
one
thing
I
wanted
to
do
on
top
of
that
is
use
a
type
of
feedback.
So
this
would
be
you
know,
feedback
from
the
output
here,
but
in
reality
it's
just
another
feed-forward
layer,
which
was
the
input
number
being
like
the
number
of
number
of
classes,
or
you
know
you
can
design
it.
However,
you
like
that,
has
where's
the
annotation
thing
there
we
go.
B
B
You
know
etc.
You
can
imagine
and
the
idea
being,
that
there'd
be
some
kind
of
gating
or
some
type
of
coincidence,
detection
such
that
you
know
when
some
feed-forward
component
or
feature
is
co-occurs
with
the
relative,
the
the
categorical
input.
So
let's
say
if
this
isn't
clear
to
people-
let's
say
you're
training
on
on
task
one,
and
this
is
like
the
task
one
categorical
neuron,
so
task
one
projects
to
here
and
here
right.
B
So
when
this
input
comes
in
to
you
know
this,
these
dendritic
segments
and
the
co-occurs
with
this
categorical
input
that
you
own,
that
you
only
drive
the
only
drive,
wait,
update
or
learning
that
and
the
idea
is
that,
because
it's
varsity,
you
would
not.
You
would
get
for
each
one
of
these
tasks
you
wouldn't
get.
You
would
get
minimal
overlap
between
between
the
dendritic
segments
they
projected
so
that
you
would
hopefully
have
a
different
population
of
dendrite
segments
per
task.
B
D
There's
one
big
difference
that
you
kind
of
mentioned
this
is
that
the
way
the
sparsity
is
that
not
very
different
here
they're
each
dendrite
is
mutually
exclusive
inputs
to
every
other
dendrite
on
that
same
neuron,
whereas
in
our
case
we
want
to
distribute
it
encoding.
So
we
don't
have
that
restriction
exactly
so.
B
E
D
Definitely
in
the
input
in
the
categorical,
if
it's
one
hard
encoding,
then
it
wouldn't
matter,
but
if
it's
a
distributed
encoding
yet
and
then
we'd
want
to
distribute
it
sparse
encoding
in
both
cases.
But
here
even
for
the
input,
the
inputs
are
mutually
exclusive
per
branch,
which
is
kind
of
a
weird
they're
all
doing
it
only
because
of
computational
efficiency.
There's
no
algorithmic
reason
to
do
it.
I
think
that
way.
Yeah.
E
I
think
there
is
a
biological
reasoning,
the
paper
that
you
know
an
axon.
It's
only
gonna
like
you
only
have
so
each
narrow
each
nyrians
only
gonna
connect
to
another
neuron
one.
So
you
could
couldn't
go
back
to
two
branches,
I
think
yeah,
I
think
what
they
say
is
that
one
axon
from
one
year
is
not
gonna
connect
you
to
branches
for
the
same
newer,
the
same
output
mirror
and
they.
B
D
B
Because
it
has
this,
you
know
this
cool
property
and
in
their
introduction
they
mentioned
how
like
a
lot
of
the
recent
successes,
you
know
a
lot
of
the
improvement
of
deep
learning
in
the
past
few
years
has
been
due
to
learn.
Learn
herbal
ekta
vation
functions
such
as
myself,
so
they
wanted
to
use
one
of
those
yeah.
D
D
E
E
D
Yeah,
maybe
we're
splitting
hairs
yeah,
it's
definitely
learning
the
weights,
which
has
an
impact
and
the
max
month.
The
activation
function
is
fixed
right
like
like
when
we
have
a
sigmoid
with
a
bias
function.
No
one
says
we're
learning
an
activation
function,
even
the
sigmoid.
The
bias
is
actually
moving
the
zero
point
back
and
forth
right.
D
B
B
B
D
A
So
I
think
a
point
here.
Part
of
this
is
just
language
and
how
we're
choosing
to
talk
about
it.
But,
like
the
point,
is
these
neurons
here?
These
little
mirror
Sigma
neurons
here
can
learn
complicated
like
shapes
and
the
input
space,
whereas
a
normal
neuron
is
always
just
like
a
hyperplane
or
it's
always
just
like
a
point
in
some
direction
and
a
threshold.
But
here
the
threshold
is
kind
of
curvy
like
these,
because
they
have
these
multiple
branches
and
that's
so.
This.
A
D
E
To
go
back
to
current
question,
I
also
believe
no
I'm,
not
sure
I
read
this
paper
like
two
weeks
ago,
but
I
also
think
they
have
a
biological
motivation
for
the
Maxo
that
you
only
need
one
brain
too
far
for
the
neuron.
Far
you
don't
you
don't
need
more
than
one
so
I
think
the
max
out
there.
It's
also
biologically
motivated.
D
D
B
B
E
C
C
D
B
Yeah,
so
I
didn't
have
the
time
to
read
a
lot
through
this
paper,
but
they
basically
do
a
morphological,
100
morphological
learning
rule
which
is
similar
in
spirit
to
HTM.
You
know
like
you're,
given
a
certain
condition.
Do
you
increment
or
decrement,
or
in
this
case
they
they
create
or
subtract
new
binary,
synapses.