►
From YouTube: Sparsity in Neural Networks (Brains@Bay Meetup)
Description
Some presentation slides posted on the meetup page: https://www.meetup.com/BraIns-Bay/events/263945823/
0:00 Subutai Ahmad - Sparsity in the Neocortex
24:05 Lucas Souza - Literature Review
43:46 Hattie Zhou - Deconstructing Lottery Tickets
1:19:00 Gorden Wilson - Sparsity in Hardware
A
Okay,
well
just
get
started.
Lucas
asked
me
to
talk
about
sparsity,
so
this
meetup
is
all
about
intersection
between
neuroscience
and
machine
learning,
and
we
can
learn
from
neuroscience
to
impact
machine
learning,
and
so
today's
topic
is
sparsity
and
what
I
thought
I'd
do.
It
just
gave
a
review
of
sparsity
and
then
your
cortex
and
what
I'll
focus
on
is
just
that
experimental
data
and
then
the
rest
of
the
talks,
I
think
are
more
machine.
B
A
A
A
A
A
A
A
B
F
A
A
A
So
that's
kind
of
what
I
was
showing
there
and
how?
If
you
look
at
a
collection
of
neurons,
how
many
are
active
right
now
and
generally
it's
a
small
percentage
of
actually
then
there's
this
weird
one
called
lifetime
sparsity,
which
is,
if
you
look
at
a
single
neuron
over
its
lifetime
and
all
the
stimuli
is
getting
come
up
and
there's
an
actually.
A
D
B
A
A
B
F
A
A
E
E
A
E
F
F
A
F
A
F
A
B
A
A
A
A
K
A
A
A
A
How
do
you
learn
and
well
you
learn
by
changing
the
network
structure,
so
the
cortex
learns
by
adding
and
dropping
connections
all
the
time,
and
so
the
connectivity
is
very
sparse,
but
the
connectivity
is
dynamic.
It's
constantly
changing.
This
is
a
result
that
shows
how
kind
automatic
this
is.
So
here
it's
looking
at
one
of
these
dendritic
segments
over
many
days
and
the
red
triangle
shows
these
synapses
of
connections
that
were
either
added
or
cropped
and
what
you.
L
A
B
A
A
A
Mystically
shows
this
kind
of
happening
you
can
see.
This
is
a
neuron
and
dendrites,
and
what
you'll
see
is
that
eventually
you'll
see
these
axons
that
are
actually
growing
and
make
a
connection
there's
an
axon
coming
in.
So
it's
an
output
from
another
neuron,
that's
coming
in
and
it's
forming
a
connection.
So
this
is
the
what
our
brains
are
doing
every
day,
because
these
neurons
are
moving
around
and
the
outputs
are
changing
and
actions
and
dropping
other
connections.
Things
like
that.
So
it's
pretty.
A
L
L
L
A
A
So
one
thing
I
was
curious
about
is
when
I
was
doing
these
slides
its.
How
many
ways
is
the
neocortex
sparks,
so
I
started
off
with
three
different
types
of
sparsity
subpopulations
varsity
means
a
small
percentage.
Neurons
are
active
right
now:
lifetime
sparsity
specific
cells,
don't
fire
that
often
the
dynamic.
C
A
I
A
A
A
A
Okay,
so
leave
you
with
this
since
we're
gonna
switch
to
machine
learning,
so
I
tried
to
convince
you
that
the
neocortex
has
extremely
sparse
connectivity,
extremely
sparse
activations,
sparse
learning,
sparse
weight,
values
and
very
sparse
energy
usage.
I
would
say
the
neocortex
is
an
existence,
proof
that
an
extremely
sparse
dynamic
system
can
operate.
It
operate
more
intelligently
than
any
dance
machine
learning
system
in
existence.
Today,
okay,.
A
Sparked
is
possible
for
such
a
really
sparse
system
to
do
very
one.
It's
an
existence
proof.
So
the
question,
though,
is,
is
this
really
required?
This
is
an
interesting
topic.
I
think
this
type
of
dynamic
sparsity,
like
the
neocortex,
is
required
for
building
intelligent
systems.
If
you
really
want
to
do
have
anything,
that's
reasonable
scale,
you're
going
to
have
to
have
very
efficient
energy
usage
to
be
able
to
continuously
learn.
E
A
B
B
A
A
So
last
time
we
did
like
a
brief
literature
review.
I'm,
not
an
expert
I
was
not
an
expert
in
for
to
learning
and
also
not
an
expert
in
parsing,
but
I'm.
The
research
in
machine
learning
I
went
to
the
papers
and
it
contains
a
lot
of
my
opinions.
Well
in
this
beautiful
reduce
is
really
great
to
stop
me
in
this.
A
H
A
This
is
a
lot
of
stuffs
happening
between,
and
this
is
a
newer
work
on
pruning,
which
is
referencing
most
of
the
new
world,
it's
by
hand
2015,
and
he
was
able
to
reduce
storage
and
computation
by
an
order
of
magnitude
without
the
fact
in
a
twist,
just
by
learning,
which
connections
are
important
and
20
years
is
mainly
the
magnitude.
So
the
approach
at
hand
follows
like
a
three-step
approach,
efforts,
trains
at
our
network
and
then
you
produce
it,
and
then
he
trains,
the
entire
network,
transfer
a
fine.
A
A
A
A
A
And
carving
he's
using
and
he
resets
the
way
to
the
initial
values
while
different,
comparing
he
just
finally
to
initialize
the
weights
and
you
can
get
the
same
type
of
package.
So
both
of
these
works
are
showing
that
if,
by
the
way,
it's
not
that
brother
than
as
we
used
to
think
about,
and
so
after
locking.
F
A
A
A
Asked
is
what
extent
neural
networks
alone
of
learning
a
perimeter
of
any
solutions
for
a
given
task
may
when
he
approaches
his
problems,
that
he
has
one
single
edge.
Wait
for
all
the
ways,
all
the
ways
have
exact
same
value
and
it
is
not
trying
to
network
and
he's
betting.
This
network
against
the
past,
it's
a
report
in
13
tasks
and
he's
using
the
reward
as
a
signal
for
a
genetic
algorithm
that
learns
how
to
evolve
his
network.
A
A
A
A
C
A
A
A
C
B
B
A
B
B
F
C
Hey
guys
I'm
Patti
glad
to
be
here
thanks,
Lucas
very
muddy
week,
so
today,
I
will
be
talking
about
the
paper.
That's
mentioned
earlier,
called
deconstructing
lottery
tickets,
its
ton
of
other
folks
that
over
air
loves,
including
Jason
who's,
sitting
there
yeah.
So
this
work
is
building
on
the
idea
of
your
network
printing
and
the
lottery
ticket
hypothesis
and
sweet
mail.
Your
network
pruning
is
a
popular
way
of
reducing
the
size
of
merit
networks
and
the
typically
follows
a
standard
procedure.
C
C
C
Lucas
mentioned
in
high
elevation:
you
can
actually
get.
Networks
are
ten
times
smaller,
with
no
drop
in
accuracy.
So
if
friend,
Ian
works
so
well,
why
don't
we
just
try
and
prove
network
from
the
start.
The
answer,
of
course,
is
that
it
doesn't
work
if
you
randomly
reinitialize
the
weights
and
you
train
a
print
network.
It
does
not
reach
the
same
accuracy.
C
So
recently,
this
paper
by
Frank
and
the
carbon
called
the
lottery
ticket
hypothesis
show
that
you
can
actually
train
and
that
printer
work
from
scratch.
So
only
if
you
maintain
the
same
original
initialization
for
different
ways,
so
they
proposed
a
variant
of
the
pruning
algorithm
which
I'll
call
the
lottery
ticket
algorithm.
C
So
first
you
randomly
initialize
the
network
and
then
you
train
it
to
convergence,
and
you
prove
ways
that
have
the
smallest
fine
of
magnitudes.
So
up
to
this
point
is
the
same
as
what
we
talked
about
before,
but
they
do
this
special
stuff
which
rewinds
the
remaining
weights
back
to
their
original
initialization
of
values,
and
then
they
try
the
network
from
this
point.
C
So
we
do
it
in
the
iterative
fashion,
we're
repeating
steps
2
to
4
iteratively
and
removing
an
additional
person
the
way
the
network
each
time
so
using
this
procedure
they
show
some
interesting
results
here.
The
y
axis
shows
test
accuracy
if
the
networks
and
the
x
axis
shows
the
love
of
printing,
which
is
basically
the
percentage
of
weights
remaining,
and
here
we'll
look
at
a
convolutional
network
training
on
c--
part
10.
The
black
line
represents
the
original
accuracy
of
that
network
without
any
pruning.
C
So,
as
we
move
from
the
left
side
to
the
right
side
of
the
x
axis,
we
proven
that
work
moremore
and
we
see
that
the
pro
numbers
actually
perform
even
better
than
the
original
network,
and
if
we
continue
to
prune
the
performance
drops
back
down
to
match
the
original
full
network.
But
at
this
point
we've
aggressively
printed
network.
So
we
can
have
about
five
percent
of
eights.
E
C
E
C
C
C
So,
based
on
these
results,
they
proposed
the
lottery
ticket
hypothesis
that
randomly
initialized
in
sneer
networks
contain
some
networks.
That's
initialized
such
that
when
training
isolation
can
match
the
performance
of
the
original
network,
they
call
these
some
networks
winning
tickets
and
suggested
that
it's
a
combination
of
their
initialization
and
structure
that
makes
their
training
particularly
effective.
B
C
We
only
us
in
the
original
lines
ever
only
looked
at
see
pertinent,
and
this
I'm
saying
there
are
new
work
that
show
you
don't
really
see
this
on
innocent
and
what
you
actually
mean
on
imagenet
instance
of
rewinding
to
original
initialization.
You
rewind
back
to
some
initializing
value
at
say,
epoch.
One.
B
J
C
C
G
It's
because
normally
in
your
classification,
later
you're
going
from,
however
many
neurons
in
the
layer
before
that
to
maybe
just
ten
right,
if
your
cipher
I'm
just
going
at
10
labels.
So
if
you
look
if
you're
using
95%
sparsity,
you
only
have
I,
don't
know
the
map,
that's,
but
a
very
small
number
of
parameters.
So,
if
you're
doing
just,
if
you
only
have
two
deep
parameters
that
final
classification.
C
C
We
observed,
while
training
these
networks
and
I
promise
that
will
become
relevant
to
that
question.
Later.
So
imagine
you
have
imagine
you
initialize
a
network
randomly
and
you
apply
it
on
the
MS
data
set
without
training
the
network.
How
well
do
you
think
it
would
do?
Well,
if
you
don't
try
to
network,
you
would
expect
no
better
than
chance
performance,
which.
C
C
F
C
Why
that's
in
our
title
mind
it
is
super,
mask
this.
Well,
it
turns
out
that
answering
our
first
question.
We
can
also
provide
explanation
for
them,
so
the
printing
procedure
performs
two
actions.
It
sets
the
privilege
to
zero
and
then
during
retraining
we
can
decouple
the
effects
of
these
two
actions
by
running
a
simple
experiment.
C
So,
instead
of
sending
the
crew
ways
to
zero,
we
can
freeze
them
at
their
randomly
initialized
values.
If
the
value
of
the
Pruitts
don't
matter,
then
this
should
perform
similarly
well,
however,
we
see
that
that's
not
the
case,
so
if
we
send
them
for
ways
to
their
initial
values,
the
performance
is
significantly
worse.
C
So
this
seems
to
suggest
that
the
value
of
the
Pruitts
do
contribute
to
the
overall
performance
of
the
and,
and
it
seems
that
zero
is
a
particularly
good
value
for
them.
So
to
see
why
that
might
be,
we
need
to
take
a
closer
look
at
the
mask
criteria.
We're
using.
We
can
think
of
different,
a
skirt
area
as
regions
on
a
2d
plane,
with
the
x-axis
being
the
value
of
the
initial
weights
and
Y
being
valued.
The
final
weights
this
looks
represents
a
distribution
of
weights
from
human
layer.
C
It's
also
because
the
initial
value
sincerely
so
the
the
mascots
are
using
a
lottery
to
get
algorithms
he's
wits
with
the
largest
final
magnitudes,
regardless
of
what
their
initial
values
are.
We
refer
to
this
as
large
final,
so
it
sets
weights
with
the
smallest
magnitudes
to
zero.
So
what
this
is
actually
doing
is
setting
to
zero
weights
that
are
end
up
closes
to
zero
at
the
end
of
the
training
process.
C
C
Its
final
values
to
test
this
hypothesis,
we
can
run
the
second
experiment
for
any
way
to
be
pruned.
We
set
it
to
0
only
if
it
moves
toward
zero
over
the
course
of
training
and
we
freeze
it
as
initial
values.
Otherwise,
using
this
treatment,
we
get
networks
that
perform
just
as
well
as
the
larger
ticket
networks,
even
though
we
did
not
set
all
of
the
previous
to
zero.
C
C
B
C
Even
our
view
on
asking
this
training
an
obvious
thing
we
can
try
is
instead
of
cumulus
with
the
largest
finer
magnitudes.
We
can
keep
with
that
increase
in
92,
the
most
very
training.
So,
let's
illustrate
it
here,
call
it
magnitude
increase,
and
this
mask
argyria
was
basically
explicitly
set
waist
to
0
that
moved
most
toward
0
during
training.
C
C
Well,
luckily,
yes,
as
we
might
expect,
so
that's
the
green
line
compared
to
the
original
large
final
area,
so
this
works
and
sometimes
work
significantly
better
and
large
final
in
the
paper,
we
have
also
tried
a
bunch
of
different
mass
criterium,
not
all
which
we
expect
to
work,
but
we
wanted
to
check
our
understanding
or
at
a
good
time.
I
won't
really
go
through
all
of
them,
but
we
see
some
criteria
that
can't
produce
Walker
tickets
and
then
they
can
matric
see
the
performance
of
the
original
network
and
a
bunch
that
don't.
C
The
weights
initial
value
has
to
component
the
magnitude
and
the
sign.
Is
it
a
combination
of
the
two
that
we
must
meet
and
enough
for
me
back
to
this
clause?
He
remember
this:
is
the
reinitialize
the
weights
randomly
so
to
see
which
component
is
important?
We
can
try
a
variant
of
this
where
we
reinitialize
the
waist,
but
then
after
that,
for
some
ways
to
have
the
same
sign
as
their
original
initialization
and
that's
true
by
the
solid
yellow
line
here,
and
the
black
line
is
a
baseline,
which
is
random.
C
C
So
the
other
thing
you
might
think
that
since
we,
you
know
that
initial
values
since
early
with
final
values
and
we're
keeping
waist
large
final
values,
then
the
initial
values
of
the
templates
may
not
have
the
same
distribution
as
overall
network,
and
that's
illustrated
here
so
the
blue
represents
the
cat's
weights
initial
balance.
So
perhaps
if
we
follow
this
distribution
to
reinitialize
the
ways
that
could
work
better
and
the
way
we
do,
that
is
basically
you
shuffling
the
values
inherently.
C
So
that
also
does
not
work
and
that's
shown
by
the
dashed
line
here.
Sometimes
it
gets
pretty
unstable
as
long.
However,
if
we
maintain
a
sign,
we
see
that
the
network
works
much
better
and
it's
pretty
close
to
the
so
this
seems
like
the
sign.
Yes,
there
is
a
pattern
here,
so
the
scientists
me
thanks.
C
C
C
C
C
Interestingly,
if
you
converted
all
the
ways
to
constant
values
similar
to
the
mass
one
experiments,
we
can
never
set
up
work
even
better
up
to
86
percent
on
CMS
and
three
wonders
housekeeper.
So
this
network
pays
with
me
all
the
values
here
are
here:
zero
or
a
plus
minus
proper
tone,
and
we
also
wanted
to
see
if
we
can
push
the
performance
of
this
by
learning
a
mess
correctly,
so
we're
using
resistance
to
learn.
C
C
A
C
C
F
C
A
F
F
H
G
K
K
Start
pretty
pad
level,
so
in
the
interest
of
time,
I
will
not
spend
too
much
time
like
on
all
the
very
high
level
motivation
stuff,
but
you
know
typically
give
this
presentation
to
broad
range
of
audiences.
You
all
are
some
probably
something
more
technical
once
so.
What
hopefully,
will
be
the
further
the
beginning,
a
little
bit
refresher.
C
F
K
It's
a
fun
presentation
I
like
to
give,
but
well
definitely
dive
into
the
meat
of
it
too,
and
since
everybody
did
it
all
just
get
started
all
right:
I'm,
the
CEO
founder
of
rain,
Durham
or
phix.
We
are
just
down
the
streets,
the
five
minute
walk
and
we
have
a
few
of
our
folks
Jack
our
CTO,
the
Bronx,
our
first
employee.
You
know
how
on
Raj
so
who
all
came
over
and
walked
over
to
to
join
for
this
meeting
today.
So
we
build
processors
for
artificial
intelligence
that
are
inspired
with
the
brain
and.
K
Of
Jack's
really
and
one
of
the
books
that
Jack's
read
the
Jack
read
five
years
ago,
was
unconscious
the
more
than
five
years
ago
now,
which
was
well
as
core
inspirations
to
invent
this
technology,
we'll
get
right
to
it.
So
our
mission
is
to
go
the
first
hardware
that
can
power
brain
to
scale
intelligence.
So
we'll
talk
a
lot
about
scaling
like
neural
computes
and
basically
how
the
existing
paradigms
the
scale
and
what
are
the
limitations
to
that
and
how
we
are
seeking
to
improve
upon
that.
That.
K
K
Florida
Gainesville
Florida,
we
moved
back
here
a
year
ago
or
aboard.
We
have
the
guy
who
course
lies.
Gps
chips
he's
cool,
we
some
of
our
investors,
our
biggest
investor
as
a
CEO
open,
and
we
like
Tom
mater
last
summer,
Sam
Colvin,
as
well
as
some
folks
that
blocks
sparse
paper
and
our
key
partners
who
work
with
our
TSMC
and
overly
hi.
K
K
Number
six,
but
we'll
certainly
go
to
the
first
five
kind
of
giving
a
history
kind
of
the
motivations
behind.
You
need
to
have
a
fundamental
paradigm
shift
in
scaling
and
early
computes
and
then
talked
about
our
hardware
and
how
sparsity
is
really
kind
of
the
core
concepts
that
underlies
you
know
the
motivations
of
our
hardware
and
works
so
the
first
part
we'll
go
through
quickly
but
and
it's
Hardware
injured
in
intertwine
history.
The
first
part
we're
pushing
there
me
on
at
noon
was
called
the
mark.
K
F
K
Spoiler
words,
but
you
know
over
the
course
of
the
last
sixty
years,
we've
seen
these
herbs
and
flows
of
summers
and
winters,
but
one
thing
that
we
see
across
the
board
is
that
there
was
really
kind
of
fundamentally
defined
by
indicating
power
that
people
had
available
at
the
time
always
had
really
powerful
imaginations
and
create
for
creative
in
thinking
about
the
algorithms
we
could
create,
but
they
were
limited
in
the
hardware
they
had
running
it.
So
the
first
summer
was
for
56
to
74.
K
It
was
marked
by
this
kind
of
untethered
undirected
funding
from
the
governments
in
the
US
and
UK
that
ended
for
our
first
winter
and
some
1980,
and
notably
the
MLP
model
that
was
named
during
the
first
summer.
They
had
vocabulary
because
support
was
20,
words
he's
the
best
computer,
random
memory
right
and
Hans
Moravec
during
though
that
winter
had
said
that
we
need
about
1
million
times
better
computers
and
funny
enough,
if
you
track
Moore's
law
scaling,
really
quite
they're,
1
million
times
right
now.
So.
K
F
K
Really
Moore's
law
scaling
it
up
and
the
benefit
of
moore's
law
meant
that
people
could
build
rules-based
AI
systems
that
really
started
to
capture
people's
imaginations
of
what
we
could
actually
knowledge.
So
this
was
when
DARPA
robot
when
the
DARPA
Grand
Challenge
is
worked
for
the
first
time
with
Stanford
robot
going
out.
Ibm
Watson
I
was
on
Jeopardy
those
types
of
things
and,
of
course,
we
get
to
2012
or
the
GP
for
evolution.
That's
where
we
are
today
and
that's
what
the
modern
deep
learning
is
all
defined
around.
K
So
part
two
we'll
talk
about
AI,
soak
and
today,
like
what
are
we
using
and
why
are
they?
Is
this
the
de-facto
hardware
for
trading
and
interests?
You
don't
tell
you
but
Clara.
It
looks
like
a
spaceship
themselves
kind
of
churning
themselves
at
you,
the
deep
learning
company
and
if
you
buy
something
from
Nvidia
today,
it
looks
like
this
is,
if
you
want
use
the
top-of-the-line
GPU
for
training
networks
in
a
datacenter
cost
about
$10,000
and
you're,
using
it
to
run
neural
networks.
K
K
K
Just
hasn't
really
changed
very
much,
it's
being
used
to
perform
matrix
multiplication.
This
is
the
core,
not
non-cooperation
underlies
all
Evers,
so
whether
you're
looking
at
graphics.
In
the
case
of
that
your
matrix
and
correspondent,
pixels
or
polygons,
runs,
then
you
can
modify
by
some
type
of
transformation
or
if
it's
in
a
neural
network
there
are
matrix,
corresponds
to
the
weights
right
and
that
vector
is.
The
activation
is
moving
from
one
letter
to
the
next,
but.
B
K
F
K
Uses
GP
to
train
neural
network,
oh,
but
it
was
there's
the
first
time
someone
used
multiple
GPUs
to
divide
the
training
and
and
use
that
to
paralyze
to
speed
up
the
training
to
their
own
networks.
Did
you
have
a
larger
model?
This
was
Alex
Nets
and
this
broken
edge
worn
by
so
this
really
kicked
off
their
renaissance
for
deep
learning
that
were
in
them
today
and
since
then,
the
amount
of
computer
throwing
that
these
models
is
just
increasing
like
crazy.
So.
I
K
K
D
K
Ass,
while
variations
in
architecture
can,
if
incrementally
improve
model
performance,
it
seems
if
you
just
throw
more
compute
at
a
model
maker
model
bigger,
it
seems
to
perform
better
off
the
board.
The
exact
quote
from
rich
Sutton
who's,
the
father
of
reinforcement,
learning
and
he
might
open
up
at
Kennison
advantageous
for
him.
K
K
Bit
of
review
to
a
particular
all
you
folks
are
in
the
machine
learning
world,
but
they're
with
me,
so
we're
seeing
every
month
these
incredible
advancements
in
works
in
the
three
areas
that
were
most
excited
about
it
rain
our
contemporary
models.
You
know
this
definitely
captured
people's
imagination
in
the
last
few
months,
the
face
out
while,
but
there
are
all
types
of
debates
and
things
that
are
interesting.
K
You
know
there
are
other
interesting
uses
of
this
like
companies
that
you
generates
new
clubs,
many
types
of
routine
structures
that
are
effective
at
fighting
disease.
So,
incredibly
promising
to
me,
of
course,
we
enforcement
learning
this
after
the
world's
imagination
with
Health
ago,
but
I
just
spoke
with
someone
from
the
Google
brain
is
saying
that
I
need
to
find
the
paper
tissue,
but
in
case
they're
saying
that
impact
arms
at
increased
after
see,
there's
a
team
that
inclusive
CC
and
and
ninety
percent
from
last
year.
So
these
are
these
reinforcement.
K
Learning
models
are
good
and,
of
course,
natural
language
processing.
So
this
is
the
transformer
models
burst.
Gb
teach
you.
It's
really
amazing
to
see
what
these
models
are
capable
of
doing.
Right
now,
both
in
terms
of
generative
measures,
as
well
as
just
questioning
answering,
but
these
are
some
of
the
biggest
models
that
we
have
seen,
and
the
issue
is
that
these
models
can
be
boys.
K
B
E
K
C
B
K
K
K
K
F
K
B
K
K
I
K
F
B
K
F
K
K
So
this
is
the
worst
kind
of
back
to
the
question
of
always
how
raised
today,
mythologists
or
games
get
it
there,
but
I
see
it
or
not
these
score
operations
and,
of
course
we
move
to
the
brain
for
inspiration
and
the
brain
we
see
is
fundamentally
different
and,
of
course,
will
come.
You'll
see
that
all
of
these
points
have
a
theater
there's.
K
K
F
F
K
Emphasizing
the
value
of
small
world
connectivity
and
why
this
is
kind
of
a
very
special
type
of
sparsity.
You
know
if
you
have
two
opposing
paradigms
of
fully
connected
versus
locally
connected.
If
you're
going
to
fully
connect
a
system,
you'll
have
a
very
short
path:
life,
the
path
length
of
one
from
any
point
to
any
other
point,
but
you're
going
to
spend
a
lot
on
all
the
wires
right,
all
all
the
connections.
If
you
just
want
to
be
locally
connected,
you
will
save
on
your
wiring
cost.
K
K
This
was
a
paper
that
we
published
last
summer,
so
this
means
that
we
can
build
a
new
type
of
AI
processor
right,
so
instead
of
we're
using
analog
computation,
we're
using
voltages
and
resistances
to
perform
this
matrix
multiplication,
but
instead
of
the
crossbar,
which
had
only
your
neurons
on
edges,
we
fill
the
entire
chip
with
these
neurons
speaking
of
edge
to
edge
close
knock
them
together.
So
we
can
have
a
huge
density
of
neurons,
and
on
top
of
that,
we
overlay
this
random
mesh
of
metal
wires.
Connecting
them
is
a
small
world
network
so
front.
K
F
B
K
There
were
definitely
some
eyebrows
raised,
but,
what's
what's
so
exciting
is
that
they
actually
go
back,
and
this
was
intuition.
This
notion
of
randomness,
you
know,
is
something
that
we
that
a
lot
of
people
have
been
comfortable
with
for
a
long
time
we
randomly
initialize
weights,
we
can
randomly
select
neurons
to
drop
out
and
they
randomly
connected
neurons
in
the
mark.
K
B
K
K
E
L
K
B
K
B
K
80,000
sparse
matrix
multiplication
with
about
99%
sparsity
we
are
and
again
artisan.
That's
will
be
at
65
nanometer
cloning
Stratos
is
the
GP
100
was
in
12
nanometer,
but
we're
operating
at
two
orders
of
magnitude
faster
than
speed
and
power,
and
so
we
want
to
start
at
you
orders
of
magnitude,
but
we
have
a
roadmap
to
again.
K
Emphasize
again
the
scale
in
comparison
when
you're
working
digital
logic,
you
have
order
n-squared
scaling
in
time
when
you're
working
with
analog
physics,
you
have
order
n-squared
scaling
in
space,
but
because
we
are
removing
what
we
believe
are
the
redundant
connections.
Fundamentally,
in
these
networks
we
can
achieve
order
and
in
a
space
and
time
and
we're
the
only
ship
architecture
that
we
know
of
that.
Can
yourself.
C
K
Operation
we
want
to
do
really
well
and
because
we'll
just
be
doing
that
one
operation
and
we're
actually
pretty
agnostic
to
the
compiler
ecosystem.
That
means
we
don't
have
to
build
these
massive
and,
if
CUDA
layers
for
this
first
product-
and
we
just
want
to
get
it
out
into
people's
hands,
so
we
can
start
exploring.
What
can
we
do
when
this
is
so
fast
and
so
efficient
and
we're
starting
out
with
a
100x
improvement
on
both
speed
and
energy
and
and.
K
F
K
F
K
K
Supports
a
few
ranges
of
models,
and
ideally
even
more-
and
this
is
obviously
this
is
looking
further
into
the
future
beyond
just
our
first
product,
but
initially
we'll
be
able
to
demonstrate
this
in
the
next
six
months.
Is
this
cold
reservoir
Cancun?
Basically,
you
have
a
giant
space
of
neurons
that
are
randomly
connected
and
it's
you've
been
projected
and
put
into
there
and
then
you
put
that
input
into
a
higher
dimensional
space.
So
then
it's
more
easily
separable
by
a
linear
classifier,
but
basically
you
don't
have
to
Train
any
of
your
weights.
K
You
don't
have
to
have
any
control
over
the
memory
stirs,
but
it's
not
your
a
really
powerful
way
to
do
time.
Series
classification
we
want
to
support
back
propagation.
We
do
support
back
propagation,
that
we've
tailored
a
type
of
back
propagation
that
works
just
on
an
ARM
chip,
but
the
goal
here
is
that
we
want
to
really
just
make
it
easy.
So
it's
one!
So
it's
just!
K
But
also
about
the
algorithms-
and
there
is
this-
we
just
filed
a
provisional
patent
on
this
last
week
or
there's
a
reduction
of
practice
of
an
algorithm
and
we're
very
excited
about
this.
But
it's
energy
based
models
and
what
energy
based
models
do
it's?
You
define
an
energy
function
and
you
tie
the
minimization
of
that
energy
to
the
minimization
of
loss
in
your
network
and
because
we're.
K
A
physical
network
of
resistors
we
have
a
physical
energy
and
it's
the
dissipation
of
energy
on
this
chip.
So
much
racing
this
patient
of
energy.
On
this
trip
on
this
chip,
we
actually
can
understand
how
to
train
it
and
where
the
we
get
the
gradient
for
free,
so
to
speak
by
observing
the
dissipation
of
energy,
it's
an
incredibly
powerful
thing,
the
mode.
The
immediate
consequences
of
this
are
that
we
would.
We
could
be
able
to
avoid
the
analog
digital
conversion
from
layer
to
layer
which
currently
for
analog
approaches
for
networks.
K
You
need
to
do
that's
incredibly
energy
energy
hungry,
so
we
can
have
very,
very,
very
low
energy
consumption.
Like
imagine.
A
billion
parameters
on
a
boreback
privity
to
you,
but
probably
the
most
exciting
thing
here,
is
that
by
using
energy
based
models,
support
a
physical
system
and
using
physics,
we
now
have
the
tools
of
physics
to
rigorously
invest
in
those
networks.
We
don't
have
a
rigorous
analytical
tool,
sets
to
really
break
down
our
cheap
neural
networks
today,
but
imagine
we
have
like
Maxwell's
equations,
actually
understand.
K
A
G
D
J
J
C
F
K
Become
the
next
platform
on
which
all
artificial
neural
networks
are
built,
we're
certainly
not
wanting
for
ambition
and
yeah.
We
we
envision
a
world
where
massive
models
doctor
well,
that's
a
world.
We
already
live
in
already,
because
those
are
our
brains
right.
We
have
Ilyas
and
billions
of
neurons
and
synapses
and
we
do
it.
So
why
can't
we
so
that's
all
I
have
I,
do
have
like
these
market
slides
and
they
feel
a
little
itchy
announcing
all
this,
but.
H
K
F
K
K
B
F
K
K
Is
at
random
into
being
a
little
bit
different,
provided
you
have
enough
high
enough
density
and
even
enough
distribution
of
the
waters
across
the
ship,
a
majors
in
effective,
effective
resistances,
and
it's
about
the
effective
resistance
is
between
any
two
electrodes
and
you
can
just
nudge
the
weights
as
we
do.
We
pulse
voltage,
is
to
raise
the
resistances
of
the
remember.
Esters
are.
I
I
B
K
K
D
J
D
K
J
Get
away
with,
to
a
large
degree,
fixed
varsity
if
you
exploit
the
topology
of
the
deep.
So
if
you
look
at
con,
Lucian's
compositions
are
close,
but
what
they
do
is
they
reflect
a
topology
every
nail.
There's
a
2d
structure
to
your
images
and
convolutions
are
just
breaking
that
up
in
this
farce,
way
to
naturally
operate
on
things
that
are
most
likely.
So
I
think
that
if
you,
if
you
map,
if
you
make
your,
we
have
this
like
2d
structure
in
if
you
create,
if
you
like,.