►
From YouTube: SDR Classifier
Description
Yuwei at Numenta describes the SDR Classifier and lots of discussion ensues. This was recorded at the Numenta office at an internal meeting Summer 2016.
A
So
this
is
a
technical
presentation
for
the
ICR
classifier
I
will
establish.
The
Gribble,
reveals
the
problem
of
classification
and
prediction
with
HTM
and
talk
about
the
classifier
which
agency
forward
classification
Network
and
how
I
might
the
weights
in
the
network
on
linear
algorithms,
there
will
be
some
algorithmic
description
of
the
classifier,
and
some
comparisons
with
the
Odyssey
are
reclassified
as
amateur
presentation
so
for
typical
use
of
HTM.
A
They
have
some
streaming
data
fitting
into
Royal
encoders
that
into
the
digital
cementation
HTM
model
can
other
people,
and
we
can
do
a
bunch
of
useful
to
attract
detection
prediction
or
classification.
So
in
this
part,
I
will
focus
on
this
part
classifier
part
and
how
we
use
the
higher-dimensional.
The
sparsity
two
presentations
in
HTM
prediction,
classification
task.
A
So
currently
there
are
three
classifier
and
their
nuclear
algorithms.
First
one
is
a
senior
neighbour
classifier,
which
is
typically
used
for
categorical
classification.
It
maintains
a
set
of
template
strd
memory,
but
that
not
evaluate
the
full
support,
pretty
predictive
distribution.
It
just
give
you
the
best
match
using
care.
It's
very
simple,
but
may
not
work
very
well
both
on
line
production
tasks.
The
CRE
classifier
is
the
one
we
have
been
using.
A
B
C
D
A
A
So
here's
a
set
up
as
the
our
classification
problem
and
the
goal
is
to
map
a
sequence
of
high
dimensional
SDR
people
as
X
here,
kids
time
to
a
distribution
over
a
set
of
K
classes.
So
this
is
the
output
of
the
classifier,
which
is
also
changing
over
time,
and
so
this
is
predictive.
Distribution
should
sum
to
one
at
any
time
point
and
the
goal
is
to
to
have
high
prediction
probability
for
the
true
class
label,
which
is
the
or
there
is
the
training
data
like
for
typical
prediction.
That's
maybe
five
steps
ahead
in
time.
B
B
It's
not
you're,
not
actually
classifying
this
sequence,
you're
really
going
to
classify
the
STR
STR,
the
single
St
on
the
idea
that
not
the
sequence,
it's
not
a
yeah.
Okay,
the
same
idea
your
to
take
whatever
the
current
state
is,
and
yet
you're
saying
that
state
was
derived
from
a
series
like
this
yeah.
How.
A
Maybe
it's
the
clarify
a
switch,
so
this
is
the
current
state
of
the
HTM.
Imagine
is
hiding
mission.
Sd
are
a
big
showcase
here,
so
I
want
to
match
it
to
a
set
of
classes
and
say:
I
have
three
classes
and
I
want
to
know
what
is
the
probability
that
the
currently
input
lies
in
that
class,
so
here,
in
the
probability,
distribution
assigns
to
one
and
they're
going
to
along
the
target.
How
high
prediction
probability
so
I
have
basically
goes
to
encoding
the
kind
of
state.
This
is
the
output
of
my
classifier
and
I.
A
So
a
please
forward.
Classification
network
is
just
like
that.
So
it's
a
linear,
so
each
unit
here,
first
I
take
a
linear
summation
of
weighted
summation
of
all
the
inputs,
so
the
weight
matrix
tableau
is
only
parameter.
I
mean
that
the
parameters
in
the
model
and
because
it's
a
preserve
distribution
here,
the
non-linearity
there
is
additional
non-linearity,
called
the
stop
the
max
to
make
sure
that
the
prediction
probabilities
comes
to
one.
So
it
basically
takes
the
exponential
of
each
input
and
then
divided
by
all
the
other
inputs.
So
it's
a
normalization
that
here.
A
Is
a
very
well-known
thing:
yeah
and
question:
how
should
we
learn
those
connection
weights
such
that
the
prediction
probability
matches
the
data,
so
we
use
the
maximum
likelihood.
Estimation
likelihood
is
basically
a
metric
of
how
well
you
are
MA
during
the
data.
So
here
we
are
trying
to
predict
the
data
they
hear.
The
true
label
and
likelihood
is
simply
the
probability
of
observing
this
to
the
data
and
there
the
predicted
distribution.
So
here
is
my
model
predicted
distribution
I
want
to
make
sure
that
the
true
data
is
more
likely
to
occur.
A
A
Typically,
people
use
words
known
as
an
additive
localizable.
It's
simply
the
lost
logarithm
of
this
and
we
were
tricked
by
the
smile
so
because
logarithm
is
a
monotonic
function.
Maximizing
likelihood
is
equivalent
at
minimizing
the
Maglev
likelihood
function
in
this
case
and
the
way
we
do
that
is
to
do
gradient
descent
using
this
loss
function.
Basically,
you
calculate
the
gradient
of
the
loss
function
with
respect
to
all
the
parameters
in
the
model.
A
That's
the
connection
rate
matrix
the
four
divisions
available
in
this
document,
but
the
after
the
division
to
run
it
was
very
simple:
it's
basically
the
difference
between
your
model,
actual
output
and
the
target
output
times
the
input.
So
this
is
the
connection
waste
from
the
IT
input
to
the
J's
class.
You
adjust
it
by
a
proportional
to
this.
This
gradient,
it's
somewhat
intuitive,
to
see
how
how
this
is
developed.
Basically,
this
part
isn't
using
the
chain
rule.
This.
E
A
D
C
A
A
binary,
sparse
vector
as
the
input
and
the
target
output
is
also
zero
one.
At
any
time.
Pony.
We
have
one
target
output,
the
only
notion
through
class
label,
so
this
is
algorithmic
description
that
the
our
classifiers
are
basically
three
phases.
First
initialization
and
inference
and
learning
in
that
initialization
is
simply
to
initialize
the
connection
weight,
matrix,
W
IJ
to
be
zeros
everywhere.
That
implies
that
all
classes
occur
with
equal
probabilities
before
this
is
obvious.
A
If
you
look
at
the
activation
function
in
the
class
in
the
classification
Network,
basically
a
kata
Alderaan's
before
learning
and
you
have
the
same
probabilities
for
our
classes,
an
inference
is
to
calculate
the
model
predicted
class
probability
for
each
input.
Pattern:
X,
I,
here,
okay,
using
the
same
equation
and
learning,
involves
adjust
the
connection,
wait
woj
in
proportional
to
the
gradient,
since
we
consider
a
binary
inputs
years,
XII
either
zero
one.
A
So
basically,
we
will
adjust
the
weight
if,
for
me,
for
the
actually
inputs,
so
we
don't
need
to
adjust
all
those
with
connection,
so
only
at
any
time,
only
a
very
small
fractions
with
matrix
is
updated,
because
this
input
is
my
response
and
the
traditionary
additionally
for
scalar
value
prediction
will
keep
on
an
average
of
actual
values
that
correspond
to
each
class.
This
is
the
same
as
the
oldest.
A
They
are
a
classifier
just
to
make
a
prediction
a
little
bit
more
accurate
and
the
time
complexity
of
this
algorithm
is
proportional
to
the
number
of
a
cube
beads
at
any
time
points
times
the
number
of
classes.
So
this
is
very
easy
to
see
because
here
that
the
summation
here
involved
with
this
time,
complexity
is
how
long,
how?
How
does
the
algorithm
scale
with
respect
to
the
number
of
the
input?
With
respect
to
the
number
that
the.
A
C
C
A
So
compared
to
the
theory
classifier,
this
is
small,
expensive,
because
also
the
Doceri
classifier
time
complexity
is
only
the
edge
times
ends
at
the
number
of
activities.
Here
we
need
to
additionally
times
the
number
of
classes,
because
we
want
to
evaluate
evaluate
the
full
distribution
at
any
time.
Point.
A
A
This
is
a
very
important
and
that,
because
of
that,
we
only
need
to
update
a
very
small
fraction
of
the
weight
and
also
a
side
benefit
is
because
a
lot
of
the
ways
are
not
tuned
at
any
time
practically
seems
to
be
less
prone
to
overfitting
compared
to
networks
that
used
any
input
vectors.
That's
just
an
observation.
B
C
A
Finally,
because
HTML
supposed
some
opinions,
orientation
of
multiple
predictions,
so
I
see
our
classifier
also
evaluated
the
full
predict
distribution
so
and
it's
reinforced
crack
predictions
and
also
penalized
incorrect
predictions.
So
the
second
part
is
not
in
the
theory
classifier
and
that's
the
reason
why
I
write
theory,
the
old
area
classifier
occasionally
gives
you
outliers,
because
only
unit
area,
classifiers
according
to
the
voting
scheme,
is
only
the
correct
prediction
that
involve
reinforced,
not
the
incorrect
one-
and
here
is
a
simple
experiment
of
classifying
random
STR.
A
B
Sorry
I
don't
understand
instead
of
jerk,
because
you
say:
you're
20,
labeled,
STR
yeah.
What
you're
doing
in
a
streaming
scenario
I
mean
you're
labeling,
the
the
state
of
the
HTM
at
any
point
in
time,
yeah
yeah
are
you?
Are
you
streaming
in
a
sequence
and
then
classifying
it
you're
classifying
in
each
step?
I'm.
We.
A
D
C
A
B
A
B
Okay,
it's
not
really.
The
data
itself
is
not
simple
data
at
this
point.
Oh
it's
just
randomly
STRs
and
yeah
and
you're
just
updating
every
every
day
to
contract.
Don't
you
mean
yeah
if
I
could
have
said
this
predictive
label
in
an
online
learning
fashion,
yeah?
That
would
have
been
the
same
thing
yes
and
clearer
for
me.
Okay,
if
you
think.
C
E
D
B
A
B
A
A
I
could
do
that
yeah,
so
this
is
trampling
they'd
have
had
to
be
noisy
data,
so
here
I'm
showing
you
performance
as
a
function
of
noisy
level,
noise
level,
so
that
see
ours
are
forty
out
of
2,000,
typical
STRs.
So
now
it's
level
forty
will
be
completely
random,
but
as
you
can
see
that
it's
guys,
the
new
at
their
classifier
is
still
perfect
actor
certify
was
the
beach
helping
is
immediately.
B
D
C
B
D
D
C
E
A
E
A
A
Getting
worse,
as
you
all
know,
as
you
have
not
so
this
will
be
the
effect
showing
and
the
privacy
right
here.
It
really
increased
noise.
This
is
different.
Sketch
butter
and
the
third
experiment
is
a
continuous
learning,
so
I
trained
it
after
it's
a
stable
performance
and
then
I've
switched
to
a
different
data
set
and
see
how
much
does
a
have
attached
to
the
new
data
set.
So
the
SGR
classifier
takes
about
the
same
amount
of
time
for
it
to
get
perfect
performance.
A
We
were
actual
the
ferret
ratifiers,
somehow
never
recovers
to
the
premise
based
on
I
think
they
still
because
of
a
lot
of
the
false
predictions.
Still
there
it's
not
penalized,
so
you
learn
the
new
new
ones,
but
the
old
ones
are
still
there.
So,
basically,
your
your
distribution
gets
like
the
precise.
D
A
D
A
A
So
we
also
use
the
SDR
classifier
for
the
taxi
passenger,
some
prediction-
and
this
is
in
the
neural
computation
paper
and
again.
The
topic
is
to
predict
future
taxi
demands,
so
here
I'm
using
encoder
sequence,
memory
and
STR
classifier
network.
So
it's
classify
actually
classifying
the
states
of
each
gem,
so
I
don't
have
the
underlap
is
the
steroid
classifier.
So
if
you
use
the
traditional
root,
mean
square
error
metric
it's
on
file,
we
write
for
the
STR
transfer
response
rate.
It's
much
better,
then
you
can
use
your
traditional
metrics,
not
moving
likelihood.
A
Also,
the
prediction
looks
much
cleaner,
a
lot
of
so,
as
you
can
see
here,
there
are
a
lot
of
false
predictions.
Occasionally
you
even
get
a
dramatic
outlier
here,
which
has
a
very
big
impact
on
Ramon.
Explain
what
those
red
things
are?
Okay,
so
so
the
black
is
the
data.
It's
likely
the
true
data
we're
going
to
predict.
The
blue
is
the
the
best
prediction
according
to
the
classifier
and
the
right
is
the
underlying
distribution
predicted
distribution
of
the
data
according
to
the
classifier,
so.
D
A
B
A
E
A
C
A
A
C
B
C
B
B
Well,
there
are
some
things
you
slightly
different
because
of
the
sparsity
of
the
yeah
of
the
patterns
are
classifying
I,
just
curious.
How
would
it
react?
You
know
a
typical
HTML
sequence
memory.
You
often
have
all
these
union
of
states
and
at
once,
and
so,
if
you
try
to
classify
that
what
you
can,
what
how
this
behave
in
that
situation,
if
you.