►
Description
GPU Kernels for Block-Sparse Weights
https://openai.com/blog/block-sparse-gpu-kernels/
https://d4mucfpksywv.cloudfront.net/blocksparse/blocksparsepaper.pdf
Discussion at https://discourse.numenta.org/t/openai-paper-review-gpu-kernels-for-block-sparse-weights/6440
A
Yeah
so
today
we'll
move
over
this
paper.
Briefly,
the
GPU,
kernels
or
blocks
are
sweets
still
st.
charity
perm
duties.
What
I
do
to
Prasad
great
Allen
record
and
he
came
up
from
Albany
I.
They
also
have
an
Associated
blog
post
in
here
I'm,
just
going
to
focus
on
the
paper.
I
think
that
blog
post
is
this.
Is
the
papers.
B
A
A
But
the
basic
idea
here
is
I
mean
very
related
to
us
that
we're
interested
in
just
how
do
you
get
sparse
networks
running
fast
and
their
whole?
Take
is
I.
Think
it's
a
really
good
discussion
of
velocity,
which
is
I,
think
sort
of
the
limit
Nouveau
Gigi's
party,
so
I
put
down
a
bunch
of
some
high-level
questions
that
we
can
go
over
and
I
know.
C
A
Basically,
the
idea
is
here:
you
think
about
layer,
two
layers
of
neurons,
so
there
it
was
here
at
al
Arif
of
the
neurons
here.
The
typical
scenario
is,
you
have
completely
dense,
connectivity's
everything
everything
here
is
gets
input
from
all
other
neurons
at
various
big
values,
and
you
can
display
that
as
the
matrix.
So
you
have
input
units
and
this
layer,
you
have
an
N
by
n
matrix
and
the
problem
there
is
that
the
waste
scales
quadratically
with
love,
sparse
weights,
are
shown
here
and
basically
you
have.
A
A
C
A
C
C
A
C
C
D
A
D
E
C
My
assumption
is
that
once
you
pick
these
blocks,
you
got
to
stay
with
them
right
I
mean
you
can
reconfigure
them
as
well.
Okay,
yeah!
So
so
let
me
ask
credit
so
like
we
on
our
side,
the
left
side.
There
we
have
this
sort
of
some
bit-by-bit
sparsity.
If
you
oh
I,
can
I
translate
that
into
8x8
blocks
a
32-foot.
A
C
A
A
A
D
E
C
Yeah
yeah
my
question
I
mean
just
go
back
as
I
did
this
sort
of
a
transformation
here
is
I
mean
so
we
have
some
randomly
sparse
connectivity
here
talking
about
it
looks
kind
of
red
emission
outside
it's
not
blocking
and
what
you
know
can't
you
just
take
that
remap.
Maybe
one
of
those
connections
or
bits
into
a
block
do
all
your
processing
and
then
map
them
again
yeah.
B
Physics
problems:
you
can
row
and
column
permutations
to
kind
of
get
them
all
clustered,
but
there
are
some
sparse
matrices
that
lend
themselves
to
that.
When
you
do
the
row
permutation
here,
it
upsets
the
order
over
here.
So
the
simple
example
is:
if
you
have
you
know
one,
you
know
one:
zero
zero
one
doesn't
matter
which
we
permute.
It's
going
to
be
the
same
thing.
Yeah.
C
Of
it
you're
just
kind
of
pushing
the
problem
here
so
well,
I
guess
maybe
the
question
is:
is
how
often
do
you
doing
this?
These
transformations
of
the
network
give
the
blocks
and
the
block
mappings
and
the
block
sizes
don't
change
very
rapidly,
then?
Is
it
possible
set
this
up?
So
that's
the
system
is
very
efficient.
You
know
once
you
get
it
going,
inference
could
be
fast.
Something
like
that
mind.
Reading
is
wrong
on
I.
Think.
A
A
A
That's
really
where
the
power
of
this
is
it.
You
don't
typically
see
too
many
too
much
too
many
fully
connected,
there's
and
that
works
today.
Instead,
you
have,
when
is
this
actually
helpful
and
real
Network?
So
one
is
a
fully
connected
feed-forward
network-
and
that's
where
know
this
so
the
most
common
one
is
CNS
and
they're.
Already
you
have
kind
of
lost
each
and
each
neuron
doesn't
connect
everything
else.
It
connects
to
a
small
subset
of
these
neurons
and
then
you
kind
of
copy
that
so
this
is
this.
A
A
Because
they
have
a
current
all
of
these
guys
project
back
to
themselves,
yeah
and
and
units
here
you
have
N
squared
weights
and
that's
analysis
more
than
that,
the
recurrent
networks
yeah.
That's
here,
really
limited
in
how
large
you
can
and
I
don't
know.
People
had
looked
at
convolutional
structures
for
these
recurrent
weight,
but
it
makes
less
sense
here
and
so
the
block
sparsity
allows
you
to
add
much.
E
A
Centrally
simple
order:
dimensional
space
yeah,
so
this
is
being
a
real
limiting
factor
for
our
nets
in
HD
ends.
We
also
have
with
current
structure
like
this,
but
our
connections
are
extremely
sparse.
Okay,
so
we
have
in
our
default
new
big
thing.
There's
65,000
cells
here,
but
each
active
dendrite
might
only
have
20
connections
so.
A
C
A
C
G
A
A
The
number
of
times
you
have
to
pass
information
through
to
go
from
any
random
one,
so
the
path,
length,
misery's
and
so
there's
a
trade-off
between
how
much
how
many
steps
you
have
to
go
before
you
know:
information
the
versus
the
sparsity
and
in
a
small
world
network
you
can
you
can
balance
those
two
lofts
I.
Think
it's
like
n
long
ends.
D
A
A
A
A
A
A
E
Way,
I'm
visualizing,
mrs.
Bute,
essentially
a
window
already.
We
know
where
the
output
units
and
then
you
have
sort
of
a
dense
or
sparse
contribution
to
Besant
a
block
is
it's
sort
of
like
a
chunk
of
English
talk
about
weight
and
then
all
those
can
you
make
sure
each
other
which
I
was
like
a
cliquing,
a
small
network.
So
is
that
actually,
what
is
that
we're
a
small
world?
This
is.
E
A
A
A
So
here
they're
using
a
pretty
large
matrix,
twelve
thousand
by
thousand
and
a
block
size
of
32,
which
I
think
is
their
sweet
spot
and
they
show
that,
as
you
have
hiring
our
varsity,
you
get
a
speed
up.
This
kind
of
curve
is
what
you
would
expect
for
our
perfect
and
a
speed
up,
but
it's
a
little
hard
to
read
this.
A
A
B
E
B
A
A
Understand
I
think
it
still
has
an
impacted.
You
have
learn
multiple
layers
and
you
want
to
be
able
to
communicate
information
computing
any
other
mode.
Here
you
it's
not
one
small
world
network,
but
you
need
multiple
small
world
connections,
I
think
to
to
increase
the
probability
of
that
there
could
be
a
cascade
effect
if
it's
really
sparse
here.
There
might
not
be
any
connection
up
to
here,
but
if
it's
small
world
with
where
the
path
length
is
three,
let's
say
you
could
I
think
it
was
like
that,
like.
E
A
A
E
D
A
A
E
A
B
F
The
cou
sparse
what
I
don't
have
an
exact
number
for
that,
but
for
us
who
sparse
has
been
valuable
because
we're
modeling
we're
using
matrices
that
are
just
larger
than
what
can
fit
on
GPU
memory,
so
maybe
50,000
by
50,000
and
I'm
with
like
99
98
percent
sparsity,
and
that
just
can't
be
done
with
the
standard.
It's
just
a
memory
just.