►
Description
Just present some observations that will be helpful if you want to dive in deeper someday. Most networks / objective functions can be translated into the language of variational inference, and doing so often provides useful insights. I’ll show an example: how Gaussian dropout can be described in this language, and how this tells us something interesting about quantization. (This observation comes from the variational dropout paper http://papers.nips.cc/paper/5666-variational-dropout-and-the-local-reparameterization-trick)
Oct 18, 2019
A
Here
I
am
talking
about
something:
that's
not
new
to
researchers
like
plenty
of
people
know
what
I'm
gonna
point
out,
but
it's
something
that
a
lot
of
people
just
know
quietly
and
asterisk.
It's
a
cool
fact
about
variational
inference
and
it's
linked
to
a
lot
of
the
ways
we
train
deep
networks
and
the
first
half
of
this
I
didn't
mention,
but
the
most.
The
first
part
of
this
is
material
from
this
paper:
practical
variational
inference
for
neural
networks.
This
is
by
Alex
graves,
mr.
Minnick,
2011
and
and
then.
A
A
Variational
inference
as
a
taxonomy
for
training,
deep
networks,
I,
don't
use
the
word
taxonomy
very
often,
but
it's
useful
here
when
somebody
talks
about
a
taxonomy,
an
example.
They'll
often
use
as
the
periodic
table
is
a
taxonomy
for
elements
and
the
discovery
of
the
periodic
table.
A
lot
of
stood
a
range
everything
in
a
way
where
we
were
able
to
start
looking
for
gaps
in
between
them.
A
A
Here
so
you're,
given
a
set
of
inputs,
a
set
of
labels
and
you're
learning
from
that
you
can
think
of
training,
a
neural
network
as
being
given
a
set
of
and
put
some
labels.
It
is
that
X's
they've,
set
inputs
and
Y
is
the
labels
for
those
and
inferring
a
probability
distribution
over
what
the
weights
for
that
model
might
be,
and.
A
If,
if
your
model,
if
your
model
outputs
probabilities,
if
what
it
does
is
it
takes
inputs-
and
it
says
this
is
a
coffee
cup
with
probability
a
it's
a
marker
with
probability,
B,
etc,
that
you
can
now
invert
this
probability,
distribution
to
figure
out
what
weights
would
have
been
the
best
at
giving
those
classifications,
though,
in
doing
that,
and
doing
that
you
either
implicitly
or
explicitly
have
some
some
prior
over
the
weights.
You
have
some
initial
estimate
of
them
or
or
you
have
some
notion
of
what
weights
are
more
likely
to
be
correct
than
others.
A
So
this
is
where
the
word
regularization
comes
in
this
notion
of
having
a
having
this
idea
that,
like
hey
large
weights,
are
less
problems
in
small
ways
or
or
networks
are
sparse
or
you
know
all
sorts
of
you
can
have
these
priors
encoded
and
too
into
this
term.
But
the
point
is,
if
you
can
explicitly
say
with
your
prior,
is
on
the
weights
and
if
your,
if
your
model
outputs
a
probability
distribution
like
this,
then
this
is
a
little
kind.
It's
this
idea.
B
A
Joint
distribution,
this
big
thing,
like
you
know
it
is,
it
is
assigning
certain
probability
to
the
matrix
as
a
whole,
and
then
it's
assigning
a
different
probability
to
a
different
matrix
as
a
whole.
It's
a
big,
complicated
thing
that
you
can't
visualize
and
in
variational
inference
you
take
some
much
simpler
probability,
distribution
and
that
you
can
visualize
and
you
and
you
tune
it
and
the
core
thing
you're
doing
here
and
I
just
want
I
wanted
to
make.
This
seem
simple.
A
C
A
This
is
with
up
with
with
a
neural
network.
Blog
likelihood
is,
is
how
you
is,
how
is
your
error
function,
and
so
this
is
very
much
like
training,
a
neural
network,
and
if
you
also
want
to
minimize
the
divergence
from
your
prior
over
the
ways,
this
is
a
regularization
term.
This
is
how
how
different
are
the
weights
I
arrived
at
from
my
prior
distribution
over
the
weights,
so.
C
A
A
A
C
E
E
E
A
E
A
A
E
G
C
B
A
As
it's
not
like,
you
have
actually
solved
how
to
classify
these
objects,
because
your
prior
might
not
be
right
or
some
other
way.
You've
set
up
the
problem,
everyone
so
yep.
So
this
gives
me
into
so.
You
have
two
decisions
to
make
when
you,
when
you
perform
variational
inference
to
two
high-level
decisions,
one.
C
A
A
I'll
define
this
I'm
sure
a
lot
of
people
have
seen
it,
but
a
delta
distribution
which
is
sort
of
a
half.
It's
sort
of
a
cheeky
way
of
saying
that,
like
hey,
you
have
a
probability:
distribution,
I'm
gonna,
give
you
a
delta
it's
a
way
of
providing
a
non
distribution.
It's
a
way
of
saying
like
okay,
you
want
a
probability
distribution,
I'm,
going
to
put
infinite
probability
mass
on
a
specific
set
of
weights.
I
mean
I'm,
just
gonna.
It's
not
a
proper
distribution.
A
It's
like
it's
like
performing
gradient
descent
on
your
entire
data
set.
It
is
how
would
we
trying
to
networks
or
the
the
basic
way
of
doing
it
if
you're,
using
if
you're,
using
negative
log
likelihood
that's
your
loss
function,
which
we
usually
do
if
you
using
soft
max?
That's
typically
what
you
do.
That's
basically
saying
well,
there's
no.
D
A
A
C
A
A
B
D
A
A
D
Then,
but
even
in
so
what's
even
in
those
cases,
people
tend
to
initialize
the
weights
with
a
uniform
distribution,
so
you're
kind
of
initializing
your
weights
with
something
that
does
not
manage
your
prior
on
the
weights,
which
is
kind
of
bizarre
people.
Still
when
you
didn't
wake
a
devil
still
do
this,
whatever
came
in
initialization
of
what
I
like?
That's,
that's,
that's
uniform,
but
it
seems
kind
of
bizarre
dude
in.
D
A
Now
briefly,
I'll
point
out
that
that
now
there's
this
whole
other
set
of
methods
available
to
us,
you
can
now
put
a
more
complicated
distribution
in
here
and
anyway,
a
set
of
other
possibilities
become
possible.
The
the
paper
I,
the
graves
paper
about
variational
inference
is
kind
of
where
I
got
all
this
information
idea.
That
did
all
of
this
all
of
this
boxes.
Down
onto
these
things,
we've
already
been
doing
this.
A
D
C
C
C
A
C
A
A
B
A
C
C
D
C
E
B
D
E
A
A
G
B
E
A
A
C
D
D
E
D
E
Are
the
what
are
the
ways
that
I
thought
of
dropout
was
that
it
basically
tested
you
know
by
removing
certain
inputs,
it
kind
of
made
the
whole
network
more
robust
against
kind
of
failures
in
a
kind
of
you
know,
hard
fashion
to
be
I'm.
Not
sure
I
could
relate
that
to
yeah
you
achieve
that
by
making
the
distribution
of
cowshit.
A
A
D
E
A
A
Paper
came
as
one
of
those
he
does
a
lot
of
these
papers
in
variational
inference.
Oh
the
second
one
pointed
out
that
if
you
do
something
like
this,
if
you
do
something
like
this,
where
every
weight
has
a
as
a
mean
and
a
variance
one
trick
for
computing,
this
is
to
take
these
variances
and
rather
than
having
the
weights
be
noisy.
Have
the
cells
themselves
be
noisy,
have
to
have
that?
A
A
Your
standard
way
of
training
and
neural
network
with
adding
dropout
is,
is,
in
a
sense,
performing
variational
inference
where,
where
you
were,
every
weight
has
well
weight,
but
it
hasn't
mean,
and
it
also
has
a
variance,
but
the
variance
is
held
fixed,
but
it's
kind
of
fixed
in
the
special
way
of
where
it
is
its
variance.
The
weights
variance
increases
as
the
way
increases
this.
A
A
A
E
B
A
They
have
a
lower
variance
and
if
they're,
if
they're
larger,
they
might
vary
over
a
larger
range
and
I'll
just
go
in
the
order
here.
So
the
implication
here,
I'll
talk
about
the
prior
and
like
20
seconds,
the
implication
here:
networks
with
dropout
are
roughly
equivalent
to
networks
with
Gaussian
weights.
They
wear
the
standard
deviation
or
the
variance
is
proportional
to
the
the
weight
itself
to
the
mean
of
the
way.
A
C
A
Holding
fixed
some
noise
constant:
this
is
just
used
across
everything
you
are
keeping
connected
with
this
again.
If
you
are,
you
are
choosing
to
do
an
optimization
where
you're,
holding
this
completely
constant
this
this
term,
here
by
by
because
you're
holding
alpha
constant
this
term
here
is
being
held,
is
never
changing
in
your
optimizing.
This
and
your
alpha
term
is
essentially
choosing
how
how
precisely
you're
encoding
your
weight.
G
A
G
A
Because
this
is
set
up
in
this
way,
where
the
variance
just
automatically
scales
with
them.
You
know
that
this
is
how
you
get
this
property.
We're
changing
the
means
changing
the
waves
does
impact
this,
but
it
doesn't
impact
this
that
that
property
holds
when
you
have
the
log
uniform
product
and
that's
the
only
that's
the
only
distribution
for
which
that
property
holds
very.
A
C
A
The
a
conclusion
or
a
on
a
yeah,
a
guess
that
can
that
kind,
this
kind
of
votes
on
I,
don't
know
the
right
way
to
say
this
is
that
is
so
if
this
method
that
is
working
empirically
implicitly
assumes
this,
that
implies
that
an
optical
encoding
scheme
for
wigs
will
be
to
distribute
your
encoding,
see
here,
I'm
talking
about
like
how
you
encode
weights,
using
integers
or
floating-point,
or
whatever
it's
going
to
distribute
its
codes.
Binary
codes
uniformly
across
this
line
so
and
conveniently
floating-point
does
exactly
that.
That's
what
floating-point
does
yeah
floating.
C
A
C
A
Is
capturing
you
can
train
networks
of
different
offices
to
get
different,
Precision's
and,
and
just
very
briefly,
I'll
point
out
that
this
paper
goes
into
now
full-on
variational
dropout,
where
you're
allowed
to
train
all
these
parameters,
you
can
learn
them
all
and
you
can
go
even
further.
You
can
have
a
different
alpha
for
every
synapse
for
every
weight
and
that's
all
possible,
and
the
second
later
paper
I
forgot
to.
A
A
D
C
A
A
B
E
Just
wondering
how
they
transform
this
thing,
because
what
this
is
arguing
here
is
floating
point
is
essential
in
training,
at
least
in
these
classifications
and
quantization.
You
know
these.
You
know
patients
of
quantization
but
I'm,
just
kind
of
curious,
saying
advanced
right
here,
but
it
was
just
an
observation
that
you
know
there
might
be
a
transformation
of
the
soul.
You
change
them.
G
G
So
it
does
decay
to
zero
in
finite
distance
and
there
are
pumps
if
you
look
at
these
distributions,
strong
sign
up,
C's
acquired
with
a
specific
weight
and
they
cannot
grow
beyond
that
because
of
an
aesthetic
effects
in
exactly
how
that
is
fine,
inventory
weights.
But
they
tend
to
be
a
lot
more
sort
of
like
very
strong
inhibitory
weights
and
we
can
throw
away
so
I'm
thinking
this
actually
probably
a
gap
between
the
distribution
on
that.
E
D
It
seems
to
me
that
if
you
were
to
pick
fits
in
a
floating-point
representation
randomly
you
would
get
this
property
yeah,
but
that's
not
what
we
do
in
typical
floating
point,
math
and
typical
the
way
it's
used
in
the
deep
learning
system.
It's
not
exploiting
this
prior,
oh,
so
it's
actually
the
fact
that
we're
it's
not
like.
Oh
just
by
magically
using
floating
point
we're
going
to
be
in
this
regime.
You
still
have
to
explicitly
take
that
into
account.
E
D
A
D
Don't
you
show
because
we're
doing
it
uniformly
between
soared
one
or
whatever
it
is,
or
we're
not
actually
picking
bits
in
the
floating-point
representation
randomly
that's
what
you
need
to
get
this
property
right
and
when
we
do
the
math
we're
not
we're
fighting
against
this
property
Rosen's
right.
So
all
of
the
fact
that
we're
using
floating-point
representation
does
not
make
this
any
easier.
It's.