►
From YouTube: Quantization in Neural Networks - May 27, 2020
Description
Subutai gives a basic overview of Quantization in Neural Networks, and then reviews the paper “And the Bit Goes Down: Revisiting the Quantization of Neural Networks” by Stock et al., 2020.
http://arxiv.org/abs/1907.05686
A
A
I
have
a
couple
of
really
basic
things
on
just
the
background
on
quantization
and
why
it's
important
a
little
bit
of
background
and
well.
How
do
you
even
think
about
quantizing,
a
floating-point
number
and
then
a
couple
of
kind
of
specifics
on
neural
nets,
specific
things
sort
of
how
do?
How
do
we
think
about
quantizing
and
neural
network?
A
What
are
some
of
the
approaches
and
that'll
be
a
very
high-level
overview
and
then
I'll
go
into
one
example
paper,
which
is
a
state
of
the
art
paper
that
just
literally
came
out
very
recently,
and
it's
it's
a
nice
paper,
because
it
incorporates
a
lot
of
different
techniques
into
it.
So
the
idea
here
is
not
to
give
a
comprehensive
review
of
lots
of
different
papers,
but
I'll
just
use
that
paper
as
a
as
an
example.
So
that's
that's.
Basically
the
presentation
shouldn't
last
too
long.
A
Okay.
So
what
is
quantization
for
those?
You
know
that
that
the
kind
of
dictionary
definition
is
it's
the
division
of
some
quantity
into
a
discrete
number
of
small
parts.
So
you
can
think
about
this
example
here
of
a
sine
wave.
This
is
taken
from
Wikipedia
you
could
you
know
it's
got
lots
of
real
value
numbers
and
you
could
imagine
quantized
again
into
eight
different
values,
as
shown
here.
A
So
the
red
line
is
that
full
precision
floating
point
representing,
but
full
precision,
real
representation
and
then
the
blue
staircase
things
are,
if
you
quantized
into
different
values
and
at
each
point
since
there's
eight
different
values,
you
can
index
in
using
three
three
bits
into
these
eight
different
values.
So
now
you
can
quantize
every
point
in
the
sine
wave
using
a
three
bit
quantity.
Okay,
so
that's
an
example
of
quantization,
as
should
be
somewhat
obvious.
A
Accuracy
increases
exponentially
with
the
number
of
bits
that
you
have,
and
typically
with
neural
network
training,
we
use
32-bit
floating-point,
so
it
has
something
like
four
billion
possible
values,
and
many
hardware
implementations
actually
focus
on
eight
bits
for
inference.
So
you
only
have
twenty
fifty
six
values,
even
such
a
said,
into
eight
representation,
and
so
as
you
as
you
can
see,
as
you
kind
of
chop
off
bits,
you
think
the
problem
gets
exponentially
harder
if
you
to
really
represent
the
full
precision
of
values.
A
A
Okay,
so
why
is
quantization
important,
so
floating-point
operations
are
expensive
and
slow
on
many
chips
and
Intel
and
binary
operations
tend
to
be
much
faster.
Another
thing
has
the
reason
has
to
do
it.
Kind
of
just
the
overall
size
of
the
system
and
memory
usage,
so
quantizing
from
FP
thirty-two
to
intake
improves
the
size
and
speed.
You
know
if
I
factor
four
and
sometimes
more,
depending
how
you
you
quantized
and
so
with
quantized
ways.
You
can
get
to
much
smaller
and
faster
networks,
so
this
table
is
from
this
nice
review
paper
by
guo.
A
It
shows
kind
of
the
number
of
parameters
and
kind
of
modeled
modern,
neural
networks,
and
these
things
are
increasing
really
really
rapidly
and
Reza
has
sixty
point.
Two
million
weights,
remember
parameters
and
the
number
of
floating-point
operations
that
you
might
use
for
doing
a
single
inference
pass
here
would
be
you
know
about
eleven
billion
in
there,
and
you
can
see
you
know
if
you
look
at
image
net
on
the
right
as
you've
added
more
and
more
parameters.
A
A
Okay,
some
other
things
as
energy
usage
is
also
lower
for
intake
than
floating-point
that
it
goes
with
the
smaller
and
faster
and
fewer
resources.
Quantized
networks
could
also
be
a
little
bit
more
robust
in
noise,
because
you've
kind
of
bucketed
the
values
so
small
perturbations
are
less
likely
to
have
some
change
in
the
output.
So
that's
another
reason.
Sometimes
people
have
quantized
and
some
even
for
adverse
adversarial
robustness.
Some
people
have
used
quantization
as
a
technique
to
defeat
kind
of
adversarial
methods
as
well
so
Quan.
A
Is
it
deep
networks
really
rely
on
high
precision
for
their
training
and
and
offer
their
inference,
and
so
the
question
is
you
know
how
do
you
best
quantize
the
numbers?
You
know
all
these
millions
of
numbers
in
a
deep
network
to
8
bits
or
sometimes
lower
and
have
kind
of
the
minimal
impact
on
the
error
rate?
A
A
One
is
the
most
obvious
thing
is
the
is
uniform
quantization,
so
you
can
take
the
number
line
and
you
chop
it
up
into
a
discrete
set
of
buckets
and
and
for
each
bucket
you
associate
the
real
valued
number
and
it's
a
here.
Delta
is
kind
of
the
size
of
the
bucket
and
in
our
scalar
encoder
we
used
to
do
you
know
this
kind
of
quantization
so
and
here
you
can
go
back
to
the
real
number
from
any
buckets.
A
If
you
look
at
the
you
know
some
bucket
it's
associated
with
some,
you
know
multiplied
by
some
scale
factor,
so
you
have
energy
associated
with
the
bucket.
You
multiplied
by
some
scaling
factor
and
a
Tobias,
and
the
bias
here
would
be
like
half
the
bucket
with
that
would
give
you
kind
of
the
number
associated
with
it
and,
of
course,
anything
within
a
single
buck.
You
can't
tell
the
difference
between
numbers
within
a
bucket
okay.
First
up
and
rounding
is
one
example
of
a
way
to
do
this,
but
there's
many
different
ways
to
do
this.
A
The
more
kind
of
powerful
method
is
non
uniform
quantization
and
the
problem
with
uniform
quantization
is
that
the
the
numbers
that
you
care
about
may
not
be
uniform
in
the
number
line.
They
may
be
many
more
numbers,
let's
say
near
zero
and
many
fewer
outside
of
zero.
It's
an
example:
it's
a
non
uniform
quantization
lets.
A
You
have
buckets
that
are
of
different
sizes
and
then,
with
each
bucket
you
associate
kind
of
the
canonical
number
associated
with
it,
and
so
this
is
like
a
clustering
kind
of
technique,
and
so
to
do
this,
usually
I
have
a
codebook
that
will
map
some
index
ki
into
the
actual
real
value
associated
with
it.
Why
I
and
clustering
is
you
know
one
approach
so
that
you
know
each
region
has
roughly
the
same
number
of
values.
So
if
you
had
a
lot
of
values
near
zero,
you
would
have
more
buckets
near
zero
and
purer
outside.
A
So
that's
kind
of
one.
One
approach
to
this.
A
pretty
kind
of
powerful
version
of
this
is
to
not
think
about
one
number
in
isolation,
but
to
look
at
a
vector
of
numbers.
So
weight
matrices
are
not
single
numbers
they're.
You
know
they're
multi-dimensional.
If
you
look
at
a
vector
of
weights,
you
can
do
vector
quantization,
so
you
can
do
clustering
in
a
high
dimensional
space.
A
A
You
still
have
an
integer
number
of
codebook
entries
and
but
each
integer
now
points
to
a
vector
of
values
in
some
high
dimensional
space,
and
this
is
the
much
more
common
kind
of
technique
used
in
deep
learning.
And
then
these
approaches,
kind
of
the
overall
size
of
the
codebook
can
determine
the
number
of
bits
you
need.
So
if
you,
if
your
codebook
has
20
56
entries
and
you
need
8
bits
for
each
15
code,
each
quantized
number
and
you
can
kind
of
choose
the
size
of
the
codebook.
Depending
on
how
many
bits
you
have.
A
Okay,
so
how
you
quantize
our
neural
network,
and
so
there
are
a
bunch
of
things
you
could
imagine
quantizing.
So
there's
the
input
data
that's
coming
in
so
that
needs
to
be
quantum.
Aight
have
to
be
quantized.
There
are
all
the
weights
and
parameters
in
a
network,
those
have
to
be
quantized.
There's
the
activation
values,
the
actual
you
know,
dynamic
values
that
are
floating
through
the
network.
A
Those
have
to
be
quantized,
and
if
you're
going
to
do
training,
then
the
back
propagation
error
gradients
also
have
to
be
quantized
and
in
the
literature
it's
looks
like
weight.
Quantization
is
by
far
the
most
common
and
and
most
papers
actually
ignore
the
other
stuff
and
there's
I
think
they're
just
concerned
with
compression
like
how
do
you
compress
the
network
into
the
smallest
network
and
they
don't
care
about
some
of
these
other
things
and,
like
I,
said
if
you're
interested
in
training
than
the
gradients
have
to
be
quantized
as
well.
A
One
kind
of
minor
detail
is,
you
know
we
use
batch
norm
a
lot
and
for
inference
batch
norm
is
usually
folded
back
into
the
weights.
It's
just
a
linear
operation.
After
the
activation-
and
so
you
can
just
fold
it
back
into
the
weights-
and
you
don't
have
to
worry
about
quantizing
the
batch
norm
for
inference
in
there
okay,
so
there
are
many
many
ways
to
quantize
a
neural
network.
A
A
Architectures
now
support
mix,
precision
quantization,
so
different
layers
will
have
completely
different
Precision's
and
this
can
be
exploited,
and
some
of
these
techniques
pruning
is
a
is
a
lot
of
techniques
do
pruning.
First,
if
you
can
set
some
weights
to
zero,
that's
the
easiest
way
to
ionization
and
that
will
reduce
the
size
of
the
network.
So
this
is
something
we've
we've
done
quite
a
bit
of
as
well:
clustering
weights
using
vector
quantization,
a
lot
of
techniques
through
training
and
fine
tuning
along
with
quantization.
It's
not
like
a
one-step
thing.
A
It's
you
kind
of
treat
it
as
part
of
the
core
loop.
There
are
couple
papers
that
have
done
SEP,
similar
to
what
Marcus's
has
discussed
before,
adding
or
simulating
noise
during
training,
variational
techniques
and
their.
Although
I
don't
think
this
is
a
very
well
explored
space.
There's
some
really
sophisticated
ones.
I
saw
one
that
uses
reinforcement
learning.
A
Basically,
then
it
actually
proposes
some
quantization
runs
it
on
an
actual
hardware
platform,
looks
to
see
what
the
result
is
and
tries
to
build
a
predictive
model
of
how
how
a
particular
quantization
technique
will
actually
work
on
a
given
hardware
architecture.
There's
different
hardware
architectures
have
very
different
limitations
and
and
characteristics.
So
this
is.
This
was
a
pretty
powerful
technique.
I
saw
so
it's
you
can
see.
There's
tons
of
different
things
you
can
do
this
table
gives
kind
of
a
nice
break
down.
A
This
is
from
this
goal
paper
of
the
different
things
he
kind
of
splits
it
up
into
deterministic
versus
stochastic
quantization
and
in
the
deterministic
category.
Rounding
is
the
most
basic
way.
It's
kind
of
one
of
the
ways
of
the
uniform,
quantization,
there's
vector,
quantization
and
kind
of
a
interesting
one
is
quantization
as
optimization
where
you
can
treat
quantization
as
part
of
the
overall.
A
A
Okay,
I
said
this:
this
slide,
I
thought
was
kind
of
fun.
This
picture
was
a
nice
picture.
This
is
from
work
on
something
called
deep
compression.
This
kind
of
puts
into
a
single
picture
a
lot
of
the
different
techniques
that
I've
been
tried
and
most
papers
seem
to
follow.
You
know
some
aspect
of
this,
so
the
idea
would
be
here.
A
You
either
start
with
an
untrained,
Network
or
a
full,
fully
trained
network,
and
then
you
go
through
some
sort
of
pruning
step,
and
this
could
be
an
iterative
step
where
you
involve
training
as
part
of
it,
and
then
you
prune
the
connections
down
to
some
much
smaller
set,
and
then
you
do
kind
of
the
hardcore
quantization
step,
which
is
this
middle
here,
and
the
basic
idea
is
here:
are
you
know
very
common
that
you
cluster
the
weights
using
some
clustering
algorithm?
You
generate
this
code
book
and
now
you
go
through
this
iterative
process.
A
Can-Am
process
to
constantly
review
once
you've
generated
a
codebook.
You
reek
want
eyes
the
weights
with
the
code
book.
Then
you
update
the
code
book
itself
and
then
you
do
this
as
a
loop,
so
you're
sort
of
updating
the
code
book
and
the
weights
simultaneously
and
what
they
they
don't
show
here.
Is
that
often
there's
retraining
of
the
entire
network
also
into
and
done
as
part
of
this
yeah.
A
So
the
middle
piece
is
kind
of
the
core
quantization
piece
and
then
there's
some
sort
of
if
you're
interested
in
compression,
then
there's
some
sort
of
encoding
process
at
the
end
anywhere,
and
the
idea
here
is
to
retain
as
much
of
the
original
accuracy
as
possible
throughout
this.
Don't
worry
about
these
reduction
factors
that
they
have
here.
This
is
for
a
different
network
scheme.
A
D
A
A
lot
of
its
just
on
compression,
which
is
a
little
surprising
to
me,
because
that's
only
one
of
the
TAS,
you
know
they
almost
all
make
the
point
that
you
know
running
on
hardware.
Architectures
is
also
super
important,
of
course,
but
big
very
few
do
anything
more
than
quantize
the
weights,
which
is
kind
of
interesting
yeah.
D
A
Yeah,
so
the
technique
I'll
talk
about
doesn't
have
that
but
yeah.
That's
a
lot
of
a
lot
of
the
techniques
actually
go
to
binary
or
even
ternary.
You
know
maybe
minus
one
zero
and
one
kind
of
values,
so
that's
kind
of
the
kind
of
extreme
end
of
it
and
in
our
HTM
stuff
work
in
the
past.
We've
always
used
binary
synapses
and
the
binary
activations
as
well,
but.
A
Yeah
yeah,
that's
right!
You
know.
Michelle
Covell
talked
about
that
in
her
paper
and
they
spent
quite
a
bit
of
effort
figuring
out
what
the
right
quantization
for
the
activation
function.
It's
too
similar
to
dealing
with
if
you're
dealing
with
value,
it
doesn't
matter
too
much.
But
if
you're
dealing
with
10
H
or
some
other
activation
functions
and
it's
important
I
just.
F
Want
to
mention
quickly
a
very
quick
anecdote.
Three
bits
is
not
terrible.
It
used
to
be
that
the
CDC
172
supercomputer
at
Oregon
State
University,
had
a
console
with
a
3d
bit
deck
and
so
the
test
program
to
boot.
The
machine
included
a
three
bit
recording
of
Merle
Haggard
and
it
sounded
like
Merle
Haggard.
D
E
A
Yeah,
it's
interesting,
it's
you
know
so
much
of
deep
learning.
Also
is
you
know
it's
just
the
deep
learning
system
is
a
black
box
and
you
just
use
PI
torches
tensorflow
and
run
on
these
big
GPUs,
and
this.
This
whole
aspect
is
you,
you
know
quite
ignored
and
forgotten,
but
it's
a
you
know
getting
things
running
in
practice.
It's
it's
pretty
important.
A
Okay,
what
yeah
this!
This
is
a
some
figures
from
Michelle's
cabel's
paper
and
belugas
paper.
This
actually
shows
the
histogram
of
weight,
values
and
I.
Think
this
case
is
I
missed,
but
this
applies
to
a
lot
of
different
networks.
So
the
x-axis
is
different.
Buckets
of
weight
values
and
the
y
axis
is
the
frequency
and
you
can
see
it's
pretty
non-uniform
and
the
distribution
changes
as
a
function
of
training
I
think
they
said
it's
it
in
these
pictures
it
looks
Gaussian,
but
it's
closer
to
a
laplacian
distribution.
So
this
is
a
typical
distributor
things.
A
Unfortunately,
non
uniform,
quantization
methods,
as
best
as
I
can
tell,
are
not
well
supported
in
hardware,
and
some
of
you
may
know
better
than
I.
Do
you
know
you
need
to
store
this
code
book
which
can
be
clearly
different,
four
different
layers
of
different
parts
and
you
hardware
architectures,
don't
always
have
good
support
for
that.
F
A
F
A
Okay,
so
I
just
wanted
to
focus
a
one
example
paper
and
now
I'll
be
done,
and
this
came
out
pretty
recently
and
I
thought
it
was
a
nice
paper.
There's
some
nice
ideas
in
there.
It's
also,
if
you
read
the
paper,
there's
a
lot
of
different
techniques.
They
use
in
the
middle
as
well
to
get
their
results,
and
so
I
think
this
was.
This
is
kind
of
a
nice
paper
that
encompasses
a
lot
of
different
techniques
and
they
also
have
the
code
available
by
torch
code
available
down
there,
so
the
core.
A
So
this
came
out
of
Facebook.
So
the
core
idea
is
the
following:
they're
going
to
do
non-uniform
clustering
of
weights,
but
the
clustering
method
is
going
to
emphasize
classification
error,
not
closeness
to
the
original
weights.
So
what
I
the
clustering
technique
I
mentioned
earlier
with
vector
quantization?
It
just
looks
to
have
a
kind
of
a
uniform
distribution
of
weight
values
assigned
to
the
clusters
and
it's
trying
to
minimize
the
reconstruction
error
of
the
weights.
A
So
if
you
were
to
go
through
the
clustering
and
then
figure
out
what
the
the
quantized
weight
values
are
and
compare
it
against
the
original
weight
values
you're
trying
to
minimize
the
difference
here,
they
say:
well,
that's
not
really
what
we
care
about.
We
care
about
the
end
accuracy
of
the
network
or
the
performance
of
the
network,
and
so
this
figure
is
kind
of
a
nice
figure
that
walks
through
this
idea.
A
So
what
this
shows
is
there's
kind
of
in
domain
inputs
and
out
of
domain
inputs,
so
let's
say
you're
trying
to
classify
dogs
versus
cats,
so
in
kind
of
this
big
gray
region
here
are
all
the
possible
dogs
and
cats
outside
of
egg.
You
know,
of
course,
it's
a
very
high
dimensional
space
and
I'm
gonna
look
like
this,
but
this
is
conceptual,
but
outside
of
the
dotted
region
will
be
all
other
possible
images
and,
most
of
which
will
be
completely
random
noise.
A
But
all
we
really
care
about
is
the
subspace
of
cats
versus
dogs.
So
let's
look
at
kind
of
the
original
Network.
So
this
is
this
side
here
in
the
gray,
with
the
Gregg
boundary,
so
the
gray
boundary
here
classifies
cats
versus
dogs,
this
kind
of
complex
nonlinear
things
on
one
side
of
it,
I
think
below.
It
are
all
the
dogs
and
above
it
are
all
the
cats
but
outside
it,
but
it
it
has
values
outside
of
this
dotted
region.
A
You
know
the
accuracy
of
the
weights
themselves
and
so
most
of
what
it's
trying
to
do
is
match
what's
happening
outside
of
this
region,
because
most
of
the
region
is
outside
of
the
region.
Most
of
the
volume
is
outside
of
this
region
and
they
don't
ignore,
what's
happening
inside
of
the
regions,
they
might
end
up
that.
If
you
just
try
to
match
the
shape
of
this
curve,
you
will
match
it
very
closely
outside
it,
but
it
might
be
that
inside
the
region,
you
don't
match
it
well
and
because
of
that
you'll
miss
classify.
A
Let's
say
this
husky
dog
or
this
cat.
As
an
arm,
even
though,
if
you
look
set
back
and
look
at
the
entire
space,
its
modeling,
the
decision
region
pretty
well,
but
the
INRI
in
domain
region
is
actually
a
very
small
part
of
it.
So
instead,
what
they're
trying
to
do
is
what
this
green
boundary
shows.
They
don't
care,
what's
happening
outside
of
the
region,
they're
just
trying
to
model
the
decision
boundary
within
the
in
domain
region.
Okay,
so
that's
only
within
the
subset
that
you
care
about
so
they're.
A
Only
good
they're
going
to
specifically
look
the
Reconstruction
era
for
Indo
many
inputs
and
they're
going
to
use
that's
kind
of
the
essence
of
their
idea
and
then
they're
going
to
use
fine,
fine
tuning
the
weights
and
the
codebook
after
quantization,
and
it
turns
out,
they
also
use
knowledge
distillation.
That
Lucas
has
been
looking
into
as
and
this
ends
up
being
pretty
critical
in
the
fine-tuning
step.
And
so
this
is
the
basic
idea,
their
technique
with.
A
E
C
C
A
You
could
it
would
help
this
problem,
but
it's
really
hard
to
do
properly
cover
the
out
of
the
main
stuff,
because
these
remember
these
are
million
dimensional
spaces,
and
so
it
the
the
volume
of
the
space,
is
huge
it.
It's
really
hard
to.
You
know
exhaustively
characterize
the
out
the
stuff,
that's
not
at
the
main,
okay,
so
here's
kind
of
they're
they're
clustering
method.
They
have
a
couple
of
tricks
that
they
do.
A
If
you
look
at
the
convolutional
filters
here,
you
have
your
input
features
and
each
input
feature
has
a
K
by
K
kernel.
So
let's
say
3
by
3,
and
then
you
have
you
have,
you
know
see
out
of
these
filters
and
what
they
do
is
they
split
each
weight
vector
into
smaller
sub
vectors
to
make
it
kind
of
easier
to
cluster,
so
typical,
vector
quantization,
might
take
this
entire
volume
as
a
single
vector
and
cluster
that,
but
they
split
it
into
smaller
K
by
K
sizes
K
by
K
size
sub
vectors.
A
A
E
A
That's
so
they,
the
first
step
is
that
they
take
these.
Oh
I
should
say:
they're
gonna,
do
this
layer
by
layer
so
for
a
given
layer,
they
take
the
weight
volume
and
they
create
these
little
sub
vectors,
and
then
they
do
this
clustering
technique.
So
the
first
step
is
the
assign
each
of
these
weight
values
to
the
to
a
cluster
and
the
clusters
might
be
randomly
initialized,
the
beginning
or
initially
initialized,
using
some
random
subset
of
these.
A
These
are
these
weights
and
then
what
they
do
is
they
pick
some
subset
of
the
training
set
X
and
they
try
what
they
do.
Is
they
look
at
the
difference
between
the
true
weight
back
sub
vector
and
the
cluster
the
cluster
index
in
each
of
these
I'm?
Sorry,
each
of
these
code
book
entries
they
project
that
difference
using
the
entries
of
that
training
set,
and
they
try
to
minimize
that
that
error.
So
they
try
to
pick
codebook
entries
that
minimize
the
projection
of
the
true
training
set
and
I
should
mention
that
this
X
here
is.
A
C
A
Yeah
they
start
with
the
lowest
layer
and
they
move
to
successive
layer.
So
you
know
imagine
this
is
some
layer
in
the
middle
of
the
network,
you've
already
quantized
the
stuff
below
you
now
you're,
looking
at
the
activations
coming
in
from
some
subset
of
the
training
set
and
now
you're
trying
to
minimize
the
you're
trying
to
reproduce
the
those
input
samples
as
faithfully
as
possible
in
this
projection?
Okay,
so
that's
how
they
assign
a
codebook
entry
to
each
weight
value
through
this
projection,
and
then
they
do.
This
other
step
is
okay.
A
They
do
this
in
the
following
way,
so
each
layer
is
basically
retrained
after
quantization,
so
here
they
keep.
The
assignment
of
the
weight
vectors
to
the
cluster
is
fixed
and
they're.
They're
fine-tuned,
actually
they're
the
codebook
entries
again
and
what
they
do
is
they
run
the
network
through
the
training
set
all
the
way
to
the
top,
they
compute
the
error
and
they
back
propagated
and
they
compute
the
average
gradient
coming
into
each
cluster
and
update
the
clusters
using
SGD.
A
So
this
day,
I
think
they
do
this
for
an
epic
or
two
after
the
previous
quantization
step,
and
they
found
that
if
they
did
this
using
knowledge
distillation
using
the
uncompressed
network,
as
the
teacher
that
gave
them
significantly
better
result
than
if
they
just
use
the
training
set
to
do
this
fine
tuning.
So
they
do
this
for
each
layer.
A
The
way
they
show
their
results
is
based
on
compression
factor
how
much
compression
they
get
and
they
have
different
block
sizes
for
the
size
of
the
code
book
entries
and
they
show
you
they
do
a
bunch
of
different.
They
show
a
bunch
of
different
stuff
and
basically
their
technique
seems
to
be
at
least
at
this
kind
of
compression
range
seems
to
do
better
than
any
other
published
technique
by
a
pretty
pretty
large
margin,
and
so
what's
an
example
you
can
see
here.
You
know
with
K.
Here
is
the
number
of
code
book
entries.
A
So
if
you
look
at
K
equals
1024,
you
would
need
ten
bits
to
represent
each
code
book
entry
plus
1024
code
book
entries
itself,
so
using
that
they
can
compute
kind
of
the
overall
size
of
the
network,
and
that
gives
them
a
compression
factor
and
that
compression
factor
there
are
several
percentage
points
above
some
of
the
best
existing
work
on
that
I
think
when
they
get
to
kind
of
smaller
compression
factors
they
seem
to
be.
You
know
close
to
the
state
of
the
art.
A
It
may
be
they're
a
little
bit
worse,
it's
hard
to
tell
from
this
graph,
but
at
higher
compression
factors
they
seem
to
be
quite
a
bit.
They
hold
up
the
accuracy
quite
a
bit
better.
This
particular
one
is
without
knowledge,
distillation,
here's
another
one
where
they
had
with
in
dollars,
distillation
and
here
they're,
showing
that
you
know
for
ResNet
50.
They
can
maintain
accuracies
up
to
about
76
or
77
percent.
Seventy
seven
point:
eight
percent
error
with
pretty
decent
compression.
C
D
C
C
A
A
A
Here
are
two
other
papers
that
do
a
nice
review
of
colonization
methods.
The
Glo
one
is
kind
of
more
sweeping,
and
this
is
I
haven't
really
looked
at
the
Krishnamoorthy
one
in
detail
yet,
but
it
has
a
lot
of
very,
very
specific
techniques.
It's
a
little
more
kind
of
concrete,
so
this
might
be
a
nice
one
to
look
at
as
well.