►
From YouTube: 12 - Generative Models - Emily Denton
Description
Deep Learning for Science School 2019 - Lawrence Berkeley National Lab
Agenda and talk slides are available at: https://dl4sci-school.lbl.gov/agenda
A
Yeah,
so
generative
modeling
is
a
really
really
massive
field.
I'm
gonna
try
and
fit
a
lot
of
different
content
into
this
hour-and-a-half.
My
strategy
is
really
to
try
and
give
you
some
good
intuitions
give
you
kind
of
like
a
big
picture,
look
at
the
different
sort
of
frameworks
that
people
have
come
up
with
for
different
for
doing
generative,
modeling
and
then
also
provide
a
lot
of
different
kind
of
links
to
papers
and
other
tutorials
and
references.
So
just
to
start
out.
A
The
basic
kind
of
thing
that
maybe
you've
learned
so
far
is
supervised.
Learning
where
you
have
a
data
set
and
your
data
might
contain
some
inputs
X
and
some
labels
Y,
and
you
want
to
learn
a
function.
Mapping
X
to
Y
another
way
of
framing
this
is
learning
a
probability.
Distribution
over
the
labels
conditioned
on
some
input
X.
So
this
is
the
kind
of
what
traditional
set
up
unsupervised.
Learning,
in
contrast,
really
just
looks
at
the
data,
so
we
typically
don't
have
any
labels.
A
The
goal
is
just
to
uncover
some
kind
of
hidden
structure
in
the
data,
and
so
this
goal
is
super
vague
and
that's
kind
of
on
purpose.
Really
it's
the
the
end
goal
of
what
unsupervised
learning
is
is
is
different
depending
on
what
you're
trying
to
do,
but
the
basic
idea
is
to
try
and
understand
some
kind
of
structure
your
data,
so
generative
models
fit
under
this
overall
paradigm
of
unsupervised
learning.
A
So
today,
we're
gonna
be
focusing
mostly
on
parametric
generative
models.
So
there's
a
lot
of
different
types
of
nonparametric
generative
models,
for
example,
in
the
image
domain,
you
have
models
that
might
copy
patches
from
training
images
and
kind
of
synthesize
these
in
different
ways
not
going
to
looking
at
those
at
all.
A
Instead,
we're
going
to
be
looking
at
classes
of
models
that
are
parameterize
by
some
function
and
want
to
estimate
the
parameters
from
some
data
set,
so
the
basic
setup
is
we
have
some
data
set
in
this
case,
a
data
set
of
faces,
and
we
have
some
kind
of
prior
knowledge
that
we're
going
to
inject.
You
know
this
could
be
as
little
as
this
is
what
my
function
class
is.
A
We
could
also
add
additional
structure
to
our
network
network,
whatever
we
want
and
then
we're
gonna
learn
in
some
way
which
I'll
go
through
in
a
bit,
but
basically
the
goal
of
learning
can
be
framed
in
a
of
different
ways.
So
one
thing
we
might
want
is
for
samples
from
our
data
set
to
have
high
likelihood
under
the
model
that
we've
learned.
So
this
is
kind
of
a
density
estimation
type
framework.
We
also
might
want
samples
that
are
synthesized
from
our
model
to
reflect
the
structure
of
the
data
distribution.
A
Obviously
these
two
things
are
intimately
connected,
but
sometimes
we'll
actually
end
up
optimizing
more
for
one
or
the
other.
So
just
to
like
motivate
this
a
little
bit,
why
do
we
care
about
generative
models?
So
there's
a
lot
of
different
reasons,
so,
first
generative
models
are
really
good
at
helping
to
uncover
hidden
structures
in
data
sets.
So
this
is
an
example
of
a
paper
that
came
out
a
couple
of
years
ago
called
info
Gann
and
they
learned
a
generative
model.
A
In
this
case,
they
trained
the
generative
model
in
these
little
like
chair
images,
but
in
a
purely
unsupervised
way
the
model
was
able
to
uncover
really
high-level
factors
like
the
chair
pose
and
the
chair
structure
and
the
chair
widths,
and
things
like
this,
and
so
this
this
image
here
just
kind
of
shows
synthesized
images
that
are,
you
know,
varying
along
these
different
factors
of
variation.
There's
a
lot
of
really
cool
image,
editing
applications
that
have
more
emerged
from
generative
modeling
in
recent
years.
So
this
is
an
example
of
super
resolution.
A
Where,
on
the
right
here
we
have,
can
you
see
my
cursor
I
know
you
can
here
we
go?
Oh
you
can
so
here
we
have
the
original
image,
and
then
here
this
is
this
or
super
res
version
provided
by
the
network.
Generative
models
can
also
be
used
to
synthesize
predictions
of
the
future.
Given
some
data
about
the
past,
so
here
in
the
top
I'm
showing
some
predicted
future
frames
of
a
video
of
a
robot
arm,
that's
pushing
around
different
objects
on
a
table,
so
models
I
can
perform.
A
This
kind
of
future
prediction
are
really
useful
when
building
agents
that
need
to
have
a
reason
about
the
effects
of
their
actions
in
the
world.
People
have
also
used
these
types
of
models
to
help
with
sort
of
exploration
you
can
kind
of
imagine
being
like,
oh,
like
you
know,
do
I
have
a
good
idea
of
what's
gonna
happen
when
I
perform
this
action,
if
not,
maybe
I
should
try
and
explore
this
space
a
little
bit
better,
so
density
modeling
is
also
another
way
of
performing
outlier
detection.
A
So
here
you
can
imagine
if
you
have
a
generative
model
of
this
like
data
set
of
Street
images
and
you
get
some
new
image,
you
can
look
at
the
likelihood
of
that
image
under
the
model
that
you've
learned
and
try
and
understand.
Is
this
a
really
likely
event
or
really
unlikely
event
and
then
use
this
in
some
other
downstream
application?
A
Generative
models
are
also
really
useful
tools
for
artists.
This
is
some
really
cool
stuff
that
came
out
of
a
Google
team
called
magenta
and
it's
basically
a
music
synthesis
model,
and
so
this
is
they've.
They've
turned
into
a
whole
bunch
of
different
tools
that
artists
can
use,
and
it's
just
really
fun
and
cool
so
play
with
that.
If
you
want
okay,
so
now,
I'm
just
going
to
go
into
some
like
background
material
that
will
come
up
throughout
the
course
of
this
talk.
A
So
kale
divergence
is
a
measure
of
how
how
far
apart
two
distributions
are:
it's
not
a
proper
distance
metrics,
it's
not
symmetric,
but
it
does
kind
of
measure
the
the
difference
between
two
distributions
and
the
reverse
KL.
So
you
can
see
in
the
top
and
bottom
here
we
have
K
LOVE,
P
and
Q,
and
on
the
bottom
we
have
K
LOVE,
Q
and
P,
and
these
are
two
different
ways
of
measuring
distances
between
different
distributions
and
I.
Bring
it
up
here,
because
you
can
see
that
the
different
distances
sort
of
emphasize
different
things.
A
So
we
might
end
up
with
higher
quality
samples
if
we
were
sampling
from
this
model,
because
we
wouldn't
end
up
sampling
from
regions
where
there
is
no
data,
but
we
may
draw
a
mode
so
I'm
just
throwing
this
out
here,
because
it'll
come
out
later
when
we're
talking-
and
this
is
a
good
intuition
to
have-
and
then
they
Jensen's
Shannon
divergence
is
a
third
divergence.
It'll
come
out
through,
and
this
is
a
symmetric
distance
metric.
It
also
tends
to
emphasize
not
putting
any
density
where
there
is
no
data
at
the
expense
of
dropping
modes.
A
Okay,
so
another
thing
that's
going
to
come
up
is
this
difference
between
explicit
and
implicit
models
so
likelihood
based
methods,
also
known
as
prescribed
probabilistic
models
typically
provide
an
explicit
parameterization
of
a
log
likelihood
function
and
parameter
estimation
here:
proceeds
in
a
very
standard
way
that
I'm
sure
you've
learned
through
it.
The
you
know
earlier
price
of
the
summer
school.
A
We
just
do
maximum
likelihood
estimation,
so
we're
estimating
the
parameters
that
maximize
the
likelihood
of
the
data
that
we
have
access
to
through
our
training
set
under
this
model
that
we've
defined
and
contrast
implicit
probabilistic
models,
don't
need
to
define
an
explicit
likelihood
function.
Instead,
they
just
define
a
sampling
procedure,
and
the
intuition
here
is
that
we're
going
to
learn
this.
A
This
generative
distribution
by
just
comparing
samples
from
our
generated
distribution
and
the
training
distributions
we
have
access
to
so
another
concept
I
want
to
introduce
is
the
idea
of
latent
variables,
so
in
two
intuitive
level,
latent
variables
can
be
thought
of
as
explaining
the
structure
in
a
given
data
instance
by
some
latent
variable
Z.
So
throughout
all
of
this
I'm
going
to
use
X
to
refer
to
the
data.
This
is
referred
to
as
observed
because
we
have
access
to
it
at
training
time
and
then
Z
is
gonna
refer
to
these
latent
variables.
A
These
sort
of
unobserved
factors
that
sort
of
are
causally
related
to
the
things
in
the
world.
So
again,
the
idea
is
that,
like
these
are
gonna
describe
the
underlying
factors
of
variation
your
data
set.
So
if
you
have
a
data
set
of
faces,
as
described
here,
you
might
refer
to
the
factors
of
variation.
As
things
like
you
know,
lighting
condition
suppose
face
hairstyle
identity.
The
person
things
like
that.
The
idea
is
that
these
latent
variables
are
gonna
kind
of
concisely,
represent
those
different
factors
of
variation
in
your
data
set.
A
We
have
some
prior
distribution
over
latent
variable
Z.
Typically,
this
prior
distribution
is
going
to
be
defined
as
something
that
is
tractable.
A
really
common
choice
is
just
a
diagonal.
Gaussian
there's
lots
of
more
sophisticated
choices
you
could
make,
but
in
most
of
the
examples
where
I
work
through
today,
we're
just
going
to
assume
simple
Gaussian
and
then
this
P
theta
of
x,
given
Z.
A
This
is
typically
referred
to
as
an
observation
model,
and
so
this
model
here
again
is
typically
taken
to
be
something
that
is
tractable
easy
to
compute
easy
to
sample
from,
and
it's
in
all
of
the
cases
we're
going
to
look
through
here.
This
is
going
to
be
parameterized
by
a
deep
neural
network
and
this
kind
of
sampling
procedure,
where
you
first
sample
a
latent
code
from
the
prior
distribution
and
then
sample
data
instance
from
this
observation
model
conditioned
on
this
latent
variable,
that's
called
ancestral
sampling.
A
So
don't
worry
if
this
is
all
really
fast,
we're
gonna
kind
of
come
back
to
it
throughout,
but
just
want
to
like
plant
the
seeds
and
then
so.
Another
thing
that
I,
we
might
think
about
doing
with
with
generative
models,
is
actually
trying
to
infer
given
some
data
instance.
What
are
the
latent
variables
that
caused
this
data
instance?
So
this
distribution
here
P
of
Z,
given
X,
is
two
Billy
referred
to
as
the
posterior
distribution
over
the
latent
variables.
A
It's
often
really
hard
to
do
this
exactly
as
well
see
see
throughout
our
models,
but
this
can
be
useful.
For
example,
if
you
like
a
generative
model,
and
then
you
want
to
actually
use
the
latent
space
that
you
learned
for
some
downstream
tasks,
maybe
some
discriminative
task,
maybe
some
kind
of
clustering
task,
what
so
forth
so
cool.
So
that's
all
that
kind
of
like
background
stuff
that
we
need
to
know
now
basic
idea.
We
have
a
data
set.
A
We
have
a
parametric
model,
P
theta
and
there's
a
couple
questions
we
might
ask
right,
so
we
can
say,
like
you
know,
how
are
we
actually
going
to
represent
this
this?
This
probability
like
what
is
the
function
class
we're
going
to
use
and
how
are
we
going
to
actually
learn
the
parameters
and
again,
as
I
mentioned
earlier,
there's
a
lot
of
different
sort
of
goals
that
we
might
have
moving
forward
here
right.
We
may
want
samples
from
the
data
set
to
have
high
likelihood
under
the
model
that
we've
learned.
A
We
might
want
samples
from
our
model
to
kind
of
reflect
the
structure
of
the
data
distribution
and
again
this
notion
of
like
reflecting
the
structure
and
being
good
quality
samples.
This
is
often
hard
to
quantify,
and
then
we
also
might
care
about
you
know
representation
learning,
so
this
kind
of
comes
back
to
you
know.
Often
generative
models
are
learned
with
the
goal
of
actually
learning
a
nice
clean,
latent
space.
That
can
be
useful
for
something
else.
So
that's
something
we
might
want
to
think
about
when
we're
actually
training
the
model
okay.
A
So
this
is
my
attempt
to
kind
of
summarize
different
types
of
models:
I've
broken
it
up
into
explicit
density
models
and
implicit
density
models,
again
the
ones
with
an
explicit
density
we're
going
to
learn
through
maximum
likelihood
estimation
these
implicit
density
models,
we're
going
to
learn
by
basically
comparing
samples
from
the
data
set
and
samples
from
our
generated
distribution
and
then
within
these
kind
of
explicit
density
models.
There's
two
classes
of
models,
there's
ones
that
define
a
tractable
likelihood.
A
So
this
means
that
the
log
likelihood
of
the
data
under
our
model
is
something
that
we
can
optimize
exactly
and
there's
a
couple
different
examples
here
and
then
non
tractable
models
or
ones
where
we
we
have
a
likelihood
function,
but
we
can't
optimize
it
exactly
and
so
we're
going
to
rely
on
some
kind
of
approximation
so
of
these
models.
These
are
the
ones
that
I'm
going
to
I'm
going
to
cover
the
most
autoregressive
models,
flow
based
models,
variational
auto-encoders,
generative
adversarial
networks
and
matching
networks.
A
I
chose
these
because
they,
you
know
kind
of
represent
for
the
most
part,
the
state
of
the
art
and
generative
modeling
you'll
see
them
a
lot
and
yeah.
You
know
cover
this
so
cool,
so
I'm
gonna
start
with
variational
auto-encoders,
so
high-level,
summary
of
variational
auto-encoders.
These
are
directed
latent
variable
models.
They
rely
on
likelihood
based
learning.
The
exact
likelihood
is
intractable,
but
we're
going
to
derive
a
lower
bound
on
the
likelihood
there's
an
efficient
ancestral
sampling
procedure.
Again.
A
This
means
we
sample
Z
from
our
prior
and
then
X,
given
Z
and
there's
an
approximate
inference
game
so
just
kind
of
pictorially.
What
this
looks
like
again,
this
is
the
generative
process.
This
is
the
standard,
ancestral
sampling
and
this
likelihood
is
intractable,
so
we're
gonna
optimize
a
bound
instead.
A
So
what
this
means
is
we're
gonna
have
some
Network
mu,
which
is
going
to
take
in
a
latent
code
and
produce
a
mean,
so
this
mean
is
going
to
be
the
same
dimensionality
as
our
data
instances,
and
then
that
is
going
to
specify
the
parameters
of
our
observations
model,
which
in
this
case
is
a
conditional
Gaussian.
You
could
also
imagine
learning
the
covariance
matrix
or
just
the
diagonal.
A
The
covariance
matrix
in
this
contact
context
immunised,
and
we
don't
when
people
are
modeling
images,
they
typically
just
kind
of
take
this
as
fixed,
but
you
could
also
do
that.
So
a
key
thing
that
variational
auto-encoders
do
is
introduce
what's
known
as
an
approximate
posterior,
so
this
Q
Phi
is
going
to
be
an
estimate
of
the
true
posterior
of
the
latent
variables,
given
some
image
or
given
some
input
X
and
again,
this
is
going
to
be
represented
as
a
conditional
Gaussian.
So
what
this
looks
like
is
we
have
some
data
instance.
A
We
pass
it
through
this
network.
That's
going
to
produce
this
new
and
Sigma,
which
are
going
to
be
the
mean
and
diagonal
covariance
of
our
conditional
Gaussian,
and
that
specifies
this
approximate
posterior.
So
the
next
couple,
slides
I'm
gonna
like
derive
a
bunch
of
maths
and
then
I'm
gonna,
go
back
and
give
some
like
good
intuition,
because
when
I
first
learned
variation,
a
lot
of
encoders
like
this
is
too
much
math
I,
don't
know
what
it's
doing
and
then,
when
you
actually
kind
of
understand
how
it
all
fits
together.
A
It's
a
very,
very
simple
framework
that
is
applicable
super
widely,
so
bear
with
me
as
I
go
through
this.
So
okay,
so
we
have
our
log
likelihood
again.
We
had
these
latent
variables
Z,
and
so
we
can
just
express
the
likelihood
as
this
product
of
the
prior
and
the
conditional
distribution
of
the
data,
given
this
latent
variable
now,
all
I've
done
here
is
just
add
in
something
that
equals
one.
So
again,
this
is
our
approximate
posterior.
A
So
now
I
am
just
replacing
the
integral
over
the
latent
variable
Z,
with
an
expectation
just
kind
of
definition,
of
what
an
expectation
is
so
Jensen's
inequality
lets
us
pull
the
log
inside
the
expectation,
and
now
we
have
this
inequality
and
now
I'm
really
just
rearranging
terms,
and
so
this
term
on
the
right
here.
This
is
just
the
KL
divergence
between
the
approximate
posterior
and
the
prior
distribution
that
we
have,
and
so
this
this
this
bound
here
so
that
we
referred
to
as
the
evidence
lower
bound
or
the
variational
lower
bound
okay.
A
This
is
just
an
expectation
of
a
constant,
and
so
this
just
turns
into
a
log
of
X,
and
then
this
here
this
is
just
actually
the
KL
distribution
between
P
and
Q.
So,
just
rewriting
that
now
we
have
the
data
likelihood
and
we
have
the
KL
distribution
between
our
approximation
to
the
posterior
and
the
true
posterior,
and
so,
if
we
just
sort
of
move
the
the
likelihood
term
to
the
other
side,
we
see
that
this
bound.
So
this
is
the
thing
we're
going
to
be
optimizing.
A
This
is
the
thing
that
we
want
to
optimize,
and
these
two
things
are
equal
when
our
approximate
posterior
equals
the
true
posterior
cool,
so
so
now
I'm
going
to
kind
of
flip
it
around
and
look
at.
You
know
how
this
is
implemented
with
neural
networks
and
kind
of
I
personally
find
this
to
be
a
little
bit
more
intuitive,
so
cool,
so
so
variational
encoders
they're
named
that
way
because
of
their
resemblance
to
traditional
autom
encoders.
So
just
remember,
auto
encoder
is
something
I
hope
autoencoders
were
covered
already
basic
idea
with
an
auto
encoder.
A
Is
you
have
some
data
instance?
You
encode
it
to
some
typically
lower
dimensional
district
or
lower
dimensional
space,
and
then
you
reconstruct
that-
and
you
typically
have
some
kind
of
reconstruction
error,
for
example,
mean
squared
error
in
your
input
space.
So
variational
auto-encoders
look
very
very
similar
to
this,
and
we
can
also
kind
of
think
of
them
as
a
stochastic
and
regularize
version
of
an
auto
encoder.
So
here
we
have
the
observation
model.
So
again
we
have
these
prior
distribution,
which
we
sample
from
we're.
A
Gonna
have
some
network
which
will
be
our
encoder
a
network
and
it's
gonna
lose
our
decoder
network
and
it's
gonna
produce
the
mean
of
a
Gaussian
distribution,
and
then
we
can
sample
from
that
distribution
and
our
recognition
model,
which
we're
also
going
to
refer
to
as
the
encoder
model.
Here
we
have
some
data
sample,
we
have
an
encoder
which
produces
the
mean
and
diagonal
variance
of
a
conditional,
Gaussian
distribution,
and
then
we
can
step
over
that
so
just
to
write
this
out
in
a
way
that
kind
of
maps
it
back
to
auto-encoders
right.
A
We
have
our
input,
we
get
mean
and
variance.
This
defines
our
approximate
posterior
distribution.
We
can
take
a
latent
code.
This
could
either
be
sampled
from
our
approximate
posterior
or
sampled
from
the
prior.
We
decode
this
and
then
we
get
our
distribution
over
data
instances.
So
again
recall
this
likelihood
term.
So
we
had
this
reconstruction
term
and
this
prior
term,
so
I'm
just
going
to
kind
of
look
through
like
what
go
through
what
this
looks
like
so.
A
A
A
Not
yeah
so
in
this
example,
here
it's
fixed
yeah.
You
could
learn
it
the
way
that
you
learn
this
one
which
I'll
get
to
like
how
you
actually
learn
this
one,
but
in
this
case
I'm
just
going
to
treat
it
as
fixed
I'm
doing
this,
because
I
mostly
work
with
images
and
then
people
work
with
image
datasets
they
just
kind
of
don't
bother
with
this
yeah
exactly
yeah.
Also,
the
kind
of
intuition
here
I
think
is
nice
and
easy.
A
If
this
is
fixed
because
then
the
the
likelihood
of
of
X
under
this
model,
it's
a
Gaussian
and
so
everything
else
just
kind
of
falls
away,
and
you
just
have
this
kind
of
mean
squared
error.
So
it's
a
kind
of
nice
intuitive
mapping
onto
auto-encoders,
but
you
could
totally
learn
it
and
I'll
describe
sort
of
how
you
learn
it
in
this
case
and
that
kind
of
coffee
is
over
cool.
So
yeah
here,
I'm
saying
this
kind
of
reduces
to
mean
squared
error
cool.
So
so
optimizing,
okay!
A
So
here
this
term
here
is
really
easy
to
compute
what
they
go
Xion,
but
actually
optimizing.
This
expectation
is
a
little
tricky,
so
some
cool
tricks
were
proposed
a
couple
of
years
ago,
which
is
referred
to
as
the
Li
parameterization
trick.
So
the
idea
here
is
to
rewrite
the
random
very
Elzy
as
a
deterministic
function
of
another
random
variable
in
this
case,
I'll
call
it
epsilon.
So
this
is
just
kind
of
a
generic
example
of
what
this
would
look
like
in
the
Gaussian
case.
We
can
rewrite
our
random
variable,
as
the
mean.
A
So
this
is
the
mean
output
by
our
network,
plus
this
kind
of
diagonal,
covariance
term
multiplied
by
a
zero
mean,
give
us
the
in
variable
so
now.
This
means
that
this
expectation
that
we
want
to
optimize
and
we
can
actually
rewrite
it
in
terms
of
this
epsilon
variable.
So
what
this
means
now
is
that
this
reconstruction
term
is
really
easy
to
optimize
and,
and
then
this
the
expectation
here,
this
expectation
here
we
just
take
Monte
Carlo
samples
in
practice.
A
People
often
just
do
like
one
sample,
but
there
are
lots
of
extensions
that
kind
of
look
at
optimizing
this
better
than
that,
but
for
now,
let's
just
assume
one
sample.
Okay,
so
now
kind
of
going
back
to
this
auto-encoder
framework,
the
two
different
pieces
of
the
lost
term
look
like
this.
So
here
we
have
our
KL
loss
right,
so
we
have
our
encoder,
we
output,
this
mean
and
variance,
and
this
kale
term,
if
our
prior
is
Gaussian,
this
is
just
can
be
computed.
Analytically,
we
can
get
the
the
great-aunts
directly.
A
This
is
also
easy
for
a
larger
family
than
just
a
Gaussian
distribution.
That's
very
easy
and
simple
there
and
then
the
second
part
of
our
loss
function
is
this
reconstruction
term
and
again,
this
essentially
kind
of
boils
down
to
an
l2
loss
here
and
so
again,
a
nice
kind
of
intuition
here
is
that
you
have
this.
This
basic
kind
of
auto
encoding
framework
here,
plus
this
additional
kind
of
regularization
term.
A
So
there's
a
lot
of
different
extensions
of
variational
Ottoman
coder
models,
just
gonna
like
stick
a
couple
of
references
here
right,
so
we
have
some
sequential
models.
These
models,
all
kind
of
do
basically
the
same
thing.
They
differ
mostly
in
the
way
in
which
they
represent
the
prior.
So
in
all
the
examples
that
I
was
kind
of
working
through
so
far,
we
had
a
fixed
Gaussian
prior.
A
If
you
have
time
series
data,
you
can
imagine,
instead
of
having
a
fixed,
Gaussian
prior,
actually
learning
the
prior
at
each
time,
step
and
having
that
depend
on
either
previous
data
instances
or
previous
states
of
your
RNN
stuff.
Like
that,
there's
a
lot
of
different
applications
here,
right
so
kind
of
like
modeling
speech
and
handwriting
and
natural
language,
music
generation.
This
is
another
cool
thing
that
came
out
of
magenta
and
yeah.
A
If
you
want
to
do
maybe
a
nice
controlled
generation
of
images
or
of
you,
whatever
data
instances,
you're
working
with.
So
that's
a
cool
framework
for
that,
the
VQ
BAE
is
a
vector
quantity
ie.
So
this
extends
the
basically
a
framework
to
discrete
lately
codes.
This
is
a
really
nice
generative
model.
A
It
gets
pretty
good
sort
of
image,
synthesis
results
and
I'll
say
more
about
what
an
auto
regressive,
Ducote
decoder
is
later,
but
they
use
a
very
powerful
decoder,
but
also
learn
a
latent
space,
and
these
are
just
some
examples
of
images
synthesized
by
this
model.
So
a
lot
of
this
talking
to
me
showing
you
images,
because
most
generative
models
are
applied
in
the
image
domain.
A
This
model
is
also
used
to
generate
video
frames.
So
this
is
a
nice
thing.
This
is
an
example
of
kind
of
using
a
generative
model
for
some
cool
downstream
tasks.
They
learned
this
model
and
then
once
they
had
this
nice
latent
space,
they
trained
a
sequential
model
in
that
latent
space,
as
opposed
to
training.
This
generative
model
in
pixel
space,
and
then
they
can
sort
of
synthesize
these
latent
codes,
and
then
they
have
this
decoder
that
map's
down
to
images.
A
That's
a
cool
thing:
okay,
so
I'm
gonna
go
into
a
different
type
of
generative
model
now
with
auto
regressive
models,
so
high
level
summary
Auto,
regressive
models.
They
are
fully
observed
models,
so
this
is
in
contrast
to
latent
variable
models
and
I'll,
get
a
little
bit
more
into
what
that
means
in
a
second,
it's
a
likelihood
based
learning
method,
so,
in
contrast
to
VA
ease
where
we
had
a
likelihood
function,
but
we
couldn't
optimize
it
directly,
and
so
we
derived
this
lower
bound
Auto,
regressive
models
to
find
a
tractable
density.
A
Basically,
by
specifying
an
ordering
on
the
variables
and
then
modeling,
a
product
of
conditional
distributions
sampling
can
be
slow
because
it
is
this
iterative
process,
there's
no
latent
representation.
So
this
is
this
comes
from
the
fact
that
it's
a
fully
observed
model
and
they
can
be
slow
to
train.
Sometimes,
although
there's
lots
of
different
sort
of
efficient
implementations,
cool
so
as
I
said
likelihood
based
method,
and
so
we're
gonna
see
that
we
can
specify
this
especially
the
models
that
this
can
be
optimized
exactly
the
basic
idea.
A
We
have
our
data
instance
X,
which
we
can
break
up
into
different
sort
of
dimensions,
x1
to
xn
and
we're
going
to
define
an
ordering
on
the
components
of
X
and
so
basically
just
based
on
rules
of
probabilities.
We
can
rewrite
P
of
X
as
this
product
of
all
of
these
different
conditionals
right,
so
we
had
this
or
we
have
P
of
X
1
P
of
X
2,
given
X
1
P
of
X
3,
given
X
1
and
X
2,
and
so
on
and
so
forth.
A
So
what
this
looks
like
in
kind
of
graphical
model
form
right
is
we
have
each
of
these
each
of
these
variables,
depending
on
the
previous
one
and
this
kind
of
sequential
sampling
procedure,
where
we
first,
you
know
sample
our
X
1
from
some
prior
distribution
over
X
1
and
then
subsequently
sample
components
of
our
data,
given
previous
components,
cool,
so
I'm,
just
going
to
kind
of
quickly
go
through
like
a
history
of
different
models
here.
Just
so,
you
kind
of
like
have
lots
of
different
references.
A
If
you
want
to
go
look
up
on
your
own,
this
is
an
early
auto
regressive
model
showed
pretty
promising
results
on
low
resolutions
has
a
feed-forward
architecture,
though,
as
we'll
see
we'll
get
into
convolutional,
architectures
and
recurrent
architectures.
This
is
a
very
similar
architecture
and
so
it
kind
of
had
limited
expressive
capability.
A
Ok,
so
now
we're
getting
is
a
little
bit
more
modern
and
powerful,
auto
regressive
models,
so
on
pixel,
RN
n.
This
is
a
deep
generative
model
of
images.
The
pixels
are
ordered
in
this
kind
of
raster
scan
manner.
So
if
this
is
an
image,
this
is
the
kind
of
ordering
like
this
or,
like
you
know,
left
to
right
top
to
bottom,
and
then
each
pixel
is
generated
conditioned
on
previous
pixels,
and
so
these
are
some
examples
of
sort
of
image,
completions
and
again
for
2016.
This
was
pretty
powerful,
also
relevant.
A
A
So
video
pixel
networks,
basically
an
extension
of
pixel
cnn's
to
a
recurrent
video
predictor,
and
these
produce
pretty
good
results
against
Riven
2016
wavenet.
This
is
a
really
cool
extension,
so
this
is
very
similar
to
pixel
CNN,
but
it's
applied
to
1d
audio
signals,
and
so
I
think
this
is
applied
mostly
to
speech.
We
could
also
apply
it
to
music
and
stuff
like
that.
So
it's
a
fully
convolutional
neural
network,
the
convolutional
layers
have
a
dilation
factor
and
this
kind
of
allows
the
receptive
field
to
grow.
Basically
exponentially
it
makes
it
much
more
efficient.
A
Actually,
here's
I
think
of
a
picture
of
the
dilation
there
we
go
and
basically
you
can
think
of
like
a
dilated
convolutional
neural
network
as
like
a
convolution
that
is
like
it's
a
very,
very
wide
receptive
field,
but
lots
of
kind
of
holes
within
it.
So
it
can
very
efficiently
in
this
case,
captures
with
long
term
temporal
dependencies,
but
do
it
in
a
very
efficient
manner,
and
so
this
is
another
cool
model.
I'd,
look
it
up,
it's
it's,
it's
fun,
it's
cool
and
it
could
be
applied
to
a
lot
of
different
kind
of
1d
signals.
A
Okay,
so
normalizing
flows.
So
this
is
the
next
class
of
generative
models
that
we're
going
to
look
at.
This
also
fits
within
this
kind
of
likelihood
based
methods
where
we
have
attractable
likelihood
that
we're
going
to
optimize
so
high
level
summary
directed
latent
variable
model
so
similar
to
the
VA
ii
likelihood
based
learning
again
we're
gonna,
we'll
see
we
can
define
the
likelihood
in
a
way
that
allows
for
exact
optimization
of
the
log
likelihood.
A
A
The
one
kind
of
downside
is
that
these
can
be
slow
to
Train
cool,
so
the
basic
tool
that
we're
going
to
use
here
is
called
normalizing
flows
and
so
normalizing
flows
are
basically
a
tool
for
constructing
complex
distributions
by
transforming
a
probability
density
through
a
series
of
invertible
mappings.
So
I
found
this
nice
little
graphic
here
which
I
like
and
basically
it
shows
you
know
we
sort
of
apply
a
sequence
of
invertible
transformations,
so
f1
2
FN.
These
are
all
gonna
or
FK.
A
In
this
case,
sorry,
these
are
all
going
to
be
a
sequence
of
invertible
transformations,
and
so,
if
we
sing
in
our
kind
of
generative
model
space,
we
can
have
some
prior
distribution
over
our
latent
codes,
again
something
nice
and
simple.
And
then
we
can
apply
a
sequence
of
transformations
in
order
to
get
this
distribution
over
our
data
instances.
A
So
they're
kind
of
like
two
key
concepts
we
need
to
understand.
Normalizing
flows
is
the
idea
of
Jacobian
of
a
determinant
and
change
of
variables
theorem.
So
just
really
quick
kind
of
linear,
algebra,
refresher,
Jacobian
matrix,
is
the
matrix
of
first
order.
Partial
derivatives
and
the
determinant
is
it's
one.
A
There
I'm
just
kind
of
kind
of
stated
here
and
then
use
it,
and
this
basely
tells
us
how
to
infer
the
unknown
probability
density
function
of
a
new
variable
in
this
case
p
of
x,
given
that
we
know
P
of
Z,
so
we
know
P
of
Z
G
theta
is
some
deterministic
function
of
Z.
We'll
see
you
later,
we
can
actually
write.
You
know
G
theta
as
a
sort
of
series
of
transformations
and
the
change
of
variables.
Theorem
just
lets
us
kind
of
write
this
likelihood
term.
In
terms
of
the
sorry
this
should
say
Z.
A
This
is
wrong.
This
should
say
Z,
because
this
doesn't
make
sense
cool
so
just
kind
of
working
through
this
in
our
generative
model,
so
the
generative
process.
We
have
this
prior
again
exact
same
thing
as
the
variational
autoencoder.
This
differs
from
the
variational
autoencoder
in
that
with
the
VA
e.
We
actually
had
this
observation
model
here.
G
theta
is
just
going
to
be
a
deterministic
function
of
Z.
A
So
if
you
want
to,
you
could
imagine
our
observation
model
as
being
like
a
sort
of
direct
Delta
function
on
like
one
point
and
then
again
likelihood
based
method,
so
we're
optimizing,
the
log-likelihood
with
respect
to
theta
so
right.
This
is
just
kind
of
repeating
what
I
had.
Except
now
we
have.
The
correct
is
e
here:
okay,
so
I'm,
just
gonna
kind
of
walk
through
the
different
components
of
the
slide,
because
this
kind
of
explains
everything
you
need
to
know
about
flow
based,
generative
models.
A
So
we
begin
with
an
initial
distribution
P
of
Z
and
we're
going
to
apply
a
series
of
invertible
transformations.
So
we
have
this
function
f
and
you
know
we
can
write.
We
can
basically
write
it
as
this
like
series
of
functions,
and
so
the
relationship
between
X
and
H
1
and
H,
1
and
H
2
and
is
to
an
and
said,
is
each
of
these
individual
functions.
And
so
this
is
just
exactly
what
I
wrote
before,
except
now
I'm
kind
of
separating
out
each
of
these
different
layers.
A
So
it's
useful
to
kind
of
think
of
it
in
terms
of
these
sort
of
compositions
of
transformations,
because
we're
gonna
build
this
with
a
big
neural
network,
and
so
this
means
that
we
just
kind
of
need
a
certain
kind
of
property
to
be
held
for
each
of
these
individual
components
of
our
network.
So
this
is
just
exactly
what
we
were
doing
before
the
change
of
variables.
A
There
I'm
just
expanding
it
out
to
this
kind
of
sequence
here,
and
so
what
we
need
in
order
to
do
this
is
for
each
of
these
functions
fi
to
be
easily
invertible
and
for
the
determinant
of
jacobian
to
be
easy
to
compute.
So
there's
a
lot
of
different
kind
of
flow
based,
generative
models
that
have
been
proposed,
I'm
just
going
to
kind
of
point
to
a
couple
of
them.
So
this
is
an
early
one,
nonlinear
independent
component
estimation
and
this
basically
stacked
a
sequence
of
invertible
transformations.
They
were
called
additive.
A
Coupling
layers,
this
real
NVP
model,
built
upon
this.
This,
the
main
kind
of
distinction,
is
they
change
additive,
coupling
layers
to
affine
coupling
layers
by
adding
like
a
scale
parameter,
and
then
they
also
introduce
a
multi
scale
architecture
which
allowed
for
more
efficient
models
of
large
images.
So
these
are
some
examples
of
images
generated
from
this
model
which
again,
at
the
time,
was
quite
impressive
and
then
more
recently.
This
is
a
cool
model
glow.
A
This
basically
builds
upon
the
real
MVP
model,
but
it
introduces
invertible
one
by
one
convolutions
and
it
ends
up
with
a
pretty
efficient
architecture,
also
kind
of
pulls
upon
the
multi
scale,
stuff
of
this
real
MVP,
and
now
we
end
up
with
some
really
nice
images.
So
now,
with
this
model,
we're
kind
of
getting
into
like
the
current
Jenner
generation
of
generative
models,
where
we're
having
like
really
really
good
image
synthesis
results.
A
So
this
this
model,
that's
like
I,
don't
know
these
faces,
maybe
are
where
you
want
to
be.
But
this
is
what
I've
played
around
this
model
a
little
bit,
and
one
of
the
one
of
the
downsides
is
that
in
order
to
get
kind
of
really
nice
quality
results,
you
end
up
having
to
sacrifice
diversity
a
little
bit.
So
some
extensions
of
this
stuff
here.
This
is
a
flow
based
generative
model
for
video.
A
This
is
just
all
of
these
video
models.
Kind
of
use,
these
robot
arms
that
push
around
objects
on
the
table.
So
any
video
examples
that
they
show
are
basically
just
going
to
be
this:
okay,
so
generative
adversarial
networks.
So
now
we're
going
to
be
getting
into
a
different
class
of
generative
models
that
don't
rely
on
maximum
likelihood
estimation,
so
I'm
gonna
kind
of
expand
this
out
a
little
bit
and
this
so
there's
a
nice
paper
here
which
I
would
I'm
not
gonna
like
cover
too
much
of
it.
A
Instead,
you
you
say:
I,
don't
even
I,
don't
even
care
about
her
legs.
Good
functional
I,
don't
need
it.
I
could
have
it
but
I'm
not
going
to
use
it
and
in
in
all
of
these
cases,
the
generator
at
Paseo
networks
and
matching
networks.
We
just
aren't
going
to
have
it
and
instead
what
we
have
access
to
is
this
generative
process,
and
the
idea
is
that
training
is
going
to
proceed
by
comparing
sets
of
images
essentially
between
the
data
distribution
that
you
have
access
to
through
your
training
set.
A
An
image
is
sampled
from
your
generative
model
and
these
different
sorts
of
approaches
moment:
matching
networks,
generative
adversarial
networks,
after
divergences
I'm
going
to
talk
about
very
briefly.
This
is
kind
of
a
a
broader
class
of
learning,
metrics
and
general
raduege.
Channel
networks
can
kind
of
fall
under
these
under
certain
circumstances,
but
basically,
all
of
these
approaches
kind
of
learn
through
this
comparison
approach,
so
I
think
that's
just
a
repeat
of
the
same
slide.
Oh
wait!
Now.
I'm
gonna
go
to
general
bed
for
sale
networks,
so
cool,
so
don't
ever
say
all
networks.
A
I'm
gonna
spend
a
decent
amount
of
time
on
because
they're
used
everywhere.
Now.
This
is
a
slide
that
I
took
from
another
talk
which
I
really
like.
So
this
is
the
number
of
and
papers
by
a
month,
starting
in
2014
up
until
I.
Don't
know
this
is
maybe
2018
so
like
just
huge
huge
exponential
spike
so
again
like
this
is
also
a
huge
area.
A
lot
of
these
papers
are
different
kind
of
like
tricks
and
techniques
to
improve
stability
of
training.
Some
of
them
are
defining
you
loss
functions
that
are
slight
variances,
the
original
ones.
A
Some
of
them
are
applications
there's
just
so
much
here
and
so
I'm
gonna
I've
tried
to
be
like
a
little
bit
selective
and
give
like
a
good
overarching
kind
of
view
of
what
gans
do
a
tiny
bit
of
theory
and
then
some
extensions
that
are
being
proposed
in
the
last
couple
of
years,
so
just
to
sort
of
show
I've
shown
a
lot
of
images
so
far.
So
this
this
kind
of
shows
the
progression
of
generative
adversarial
networks
from
2014
up
until
2018,
so
these
are
all
synthetic
faces.
A
None
of
these
are
real
people,
and
this
shows
so
in
2014
when
Ganz
originally
developed.
This
is
this
was
kind
of
like
state-of-the-art
generative
modeling
of
faces.
At
that
time,
its
general
volley
images
is
really
really
hard,
and
then
we
can
see
you
know
just
a
couple
of
years
later.
This
is
from
the
style
again,
which
I'll
talk
about
a
little
bit,
but
now
we're
able
to
get.
You
know
super
super
high-resolution
details
like
high
quality
image
synthesis,
so,
okay,
quick,
summary,
so
ganz,
again
directed
latent
variable
models.
A
So
this
is
just
what
the
generative
process
looks
like.
So
this
is
similar
to
the
glow
model,
glow
model,
the
flow
based
model
where
we
have
some
simple
prior
and
then
our
X
is
a
deterministic
function
of
that
latent
variable
again,
in
contrast
to
the
BAE,
where
we
have
some
probabilistic
model
here.
This
is
just
a
terminus
of
function
and
similar
to
the
flow
based
model.
So
if
you
want
to
think
of
this
as
a
probability,
you
can
think
of
it.
A
As
this
kind
of
like
you
know,
all
the
density
is
in
one
single
point
cool,
so
right,
no
explicit
density,
so
the
intuition
with
ganz
is
that
we're
going
to
learn
via
a
two-player
game.
So
we
have
our
generator
that's
kind
of
defined
up
here
and
we
have
a
discriminator
and
the
discriminator
is
trained
to
distinguish
between
samples
that
come
from
the
true
data
distribution
which
we
have
access
to
through
this
training
set.
A
The
actual
loss
for
the
discriminator
is
going
to
sort
of
change
depending
on
the
framework,
but
essentially
it's
just
trying
to
as
best
as
possible
distinguish
between
these
two
sets
of
images,
and
then
the
generator
is
trained
to
produce
samples
that
fool
the
discriminator,
and
so
the
training
is
going
to
kind
of
proceed
in
this
kind
of
back-and-forth
fashion,
where
the
generator
and
discriminator
are
constantly
updating
and
learning.
And
so
you
know,
as
the
discriminator
gets
better
at
differentiating
generated
samples
from
true
data
samples.
A
The
generator
is
going
to
have
to
get
better
at
fooling
the
discriminator
and
basically
get
better
at
producing
images
or
data
instances
or
not
in
the
image
domain
that
look
like
real
data.
So
what
this
kind
of
looks
like
a
network
form
we
have
this
discriminator.
So
we
have,
you
know
kind
of
two
cases,
two
types
of
data
that
the
discriminator
sees.
If
it
sees
real
data,
it's
basically
trying
to
produce
something
close
to
one
again.
The
exact
loss
function
will
get
into
and
it
will
change
depending
on
the
framework
and
then
the
discriminator.
A
A
You
know
two
different
types
of
samples
and
then
the
generator,
so
this
kind
of,
like
Zee's
sampled
from
the
prior
combined
with
the
generator,
this
really
just
kind
of
defines
a
model
distribution
and
the
generator
is
trained
to
make
the
discriminator
think
that
this
sample
is
real
and
so
close
to
one
and
so
we'll
see
the
way
the
generator
is
trained
is
basically
sort
of
forward
propagation.
Through
this
network,
the
generator
has
some
loss
function.
Then
the
gradients
are
going
to
be
propagated
through
the
discriminator
and
back
to
the
generator.
A
So
when
the
generator
is
being
learned.
So
when
the
gradients
are
passing
through
here
to
the
general,
the
discriminator
was
going
to
be
held
fixed
and
then,
when
the
discriminator
is
learning,
the
generator
is
held
fixed.
So
this
is
what
the
the
sort
of
originally
proposed
loss
function
looks
like.
So
here
we
have
this
first
term
here.
This
is
the
sort
of
expectation
under
our
data
distribution.
A
So
again
this
would
be
like
approximated
with
just
samples
from
our
training
set,
and
this
is
the
log
likelihood
of
that
data
from
the
true
data
distribution
are
considered
genuine
by
the
discriminator
and
then
in
contrast
here
now
we're
sampling
latent
variables
from
our
prior
distribution,
feeding
them
through
our
generator.
So
this
G
of
Z
here
gives
a
generated
sample
or
a
fake
sample,
and
so
then
this
here
is
the
log
probability
under
D
that
the
samples
from
the
generator
are
considered
fake.
So
this
is
really
just
like
a
crud.
A
If
you
ignore
the
kind
of
G
stuff,
this
is
just
D
is
optimizing:
a
cross,
entropy
binary,
cross,
entropy
loss
function
and
then
G
here,
so
we
have
max
over
D
min
over
G.
So
this
means
that
the
generator
is
basically
trying
to
fool
the
discriminator
and
so
the
the
generators
gradient
comes
through.
It
gets
back
propagated
through
G
or
through
D.
So
again,
as
I
mentioned,
this
is
like
an
alternating
optimization
procedure.
A
Again,
a
lot
of
the
kind
of
theory
around
Ganz
holds
under
optimal
conditions
and
we
are
never
under
awful
conditions.
So
a
lot
of
this
stuff
is
kind
of
giving
a
little
bit
of
intuition.
But
you
know
we're
not
saying
anything
super
precise,
but
another
sort
of
important
intuition
here
so
earlier
on.
At
the
beginning,
I
kind
of
talked
about
the
difference
between
minimizing
forward
KL
and
reverse
KL
and
Jensen's
Shannon
divergence,
and
what
that
looks
like
for
different
generative
models
here,
we're
minimizing
the
Jensens,
shannon
divergence,
and
so
here's
a
good
slide.
A
They
tend
to
have
higher
quality
samples,
but
they
might
miss
modes.
So
just
in
contrast,
we'll
look
at
this
is
like
the
KL
divergence.
So
maximum
likelihood
estimation
should
have
said
this.
The
beginning
maximum
likelihood
estimation
minimizes
the
KL
divergence
between
the
forward
KL
divergence
between
the
data
distribution
and
the
model
distribution,
and
so
a
lot
of
early
max
and
likelihood
based
models,
typically
VA
ease
for
the
most
part.
A
They
tended
to
produce
samples
that
were
quite
blurry
weren't
as
crisp
and
as
sharpest
as
Ganz,
and
one
of
the
ways
in
which
this
was
explained
was
because
the
KL
divergence
tends
to
emphasize
like
capturing
all
of
the
different
modes
and
basically
putting
model
density
anywhere.
That
data
lies,
and
sometimes
this
comes
at
the
expense
of
putting
density.
A
You
can
see
in
this
kind
of,
like
I,
can't
see
with
the
pointer
in
this
middle
area
here,
basically
where
there
is
no
no
data,
but
the
model
ends
up,
putting
density
there,
so
just
kind
of
like
a
little
bit
of
intuition
for
like
the
different
trade-offs
of
these
models,
and
then
this
this
paper
that
I'm,
citing
here
this
this
image
comes
from
this
paper.
I
would
also
just
recommend.
Reading
this.
A
So
this
this
is
more
commonly
used.
It's
often
referred
to
as
the
non
saturating
again
lost.
So
basically,
these
objective
doesn't
change
at
all,
but
the
generators
objective
changed
so
in
the
original
objective,
the
generator
was
just
trained
to
minimize
this
function,
and
so
G
only
really
shows
up
in
this
part,
and
so
G
was
trained
to
minimize
this
year.
In
contrast,
this
new
function-
G
is
trained
to
maximize
this
here.
A
So
what
this
looks
like
in
practice
is
the
discriminator
has
real
images
labeled
with
label
one
fake
images
or
synthetic
images
labeled
with
label
0
its
trained
with
the
binary
cross-entropy
lost,
distinguish
between
those
two
sets
of
examples
and
then
the
generator.
Now
what
we
do
if
we
were
implementing
this
is,
we
would
just
flip
the
label
of
generated
images
from
0
to
1,
for
when
we're
optimizing,
the
generator
feed
that,
through
the
discriminator
with
this
kind
of
flipped
label,
get
the
gradients
in
the
discriminator
and
update
the
generator
based
on
that.
A
A
A
This
is
another
I,
think
really
insightful
and
useful
again
paper
to
read
and
they
kind
of
go
through
a
couple:
different
sort
of
problems
with
the
the
typical
Gann
training
procedure,
looking
at
kind
of
stability
stuff.
So
so
the
first
thing
that
they
observed
is
that
both
the
data
and
the
the
generative
distributions
a
very
likely
lie
on
low
dimensional
manifolds,
and
so
so
what
this
means.
A
Basically,
so
this
is
three
dimensional
space
and
really
like
a
one-dimensional
or
two-dimensional
manifold,
and
so,
if
these
sort
of
you
know
red
and
and
blue
lines
or
planes
are
generative
and
data
distributions,
they
really
don't
intersect
very
much
there.
They
might
be
entirely
disjoint
or
there
might
be
a
very,
very
small
region
of
space
where
they
intersect,
and
so
this
paper
here
kind
of
talks
about
how
that
that's
really
challenging
for
gand
training,
because
it
means
that
you
can
actually
like
find
a
discriminator.
A
That
kind
of
can
perfectly
distinguish
between
the
generative
distribution
and
the
true
data
distribution,
and
so
this
means
that
you
know,
if
you
have
this
kind
of
perfect
discriminator,
then
its
gradients,
the
generator,
is
basically
get
like
zero
gradients
everywhere.
So
this
isn't
super
useful.
So
this
is
this
is
an
experiment.
This
is
a
plot
from
the
same
paper
or
basically
so
I
kind
of
said
earlier.
A
A
lot
of
the
like
nice
theoretical
results
from
Gans
come
under
the
conditions
of
like
an
optimal
discriminator,
but
in
this
paper
they
actually
sort
of
look
and
they
say
like
okay,
if
we
actually
have
an
optimal
discriminator.
This
is
actually
really
bad
for
the
generator
because
of
this
vanishing
gradient
problem,
and
so
what
they
did
here
is
they.
They
basically
continued
to
train
the
discriminator
for
increasing
amounts
and
then
looked
at
the
gradients
that
the
generator
got
for
each
of
those
discriminators.
A
So
here
it's
like,
as
we
move
along
the
x-axis,
the
discriminator
is
getting
better
and
then
this
is
this
is
the
norm
of
the
gradient
and
we
see
that
it
actually
degrades,
and
this
is
a
believable
yet
so
log
scale,
and
so
so
this
is,
this
is
kind
of.
This
is
problematic.
This
is
this
weird
kind
of
like
dilemma:
the
gans
face
where
you
know.
A
If
the
discriminator
is
really
bad,
then
it's
like
not
giving
good
feedback
for
the
narrator,
but
if
the
discriminator
is
really
good,
then
the
the
generator
might
not
be
getting
enough
signal
from
the
discriminator
and
so
there's
a
lot
of
different
things
that
people
have
kind
of
been
proposing
over
the
years
to
deal
with
this
so
instance.
Noise
is
one
example,
so
this
has
been
proposed
in
a
couple
of
different
papers
early
on
I.
A
Actually,
don't
think
it's
used
that
much
now,
but
it's
kind
of
an
interesting
sort
of
you
know
historic,
yeah
yeah,
so
in
the
traditional
loss
function
here,
because
we
have
this
log,
this
is
going
to
saturate.
So
if
the
discriminator
is
super
super
certain,
so
if
the
discriminator
is
really
really
good
and
it's
super
confident
and
doing
a
really
good
job
of
discriminating
real
samples
from
fake
samples-
it's
just
going
to
saturate
because
of
this
log
here.
A
So
the
yeah.
Sorry,
though,
like
the
log
function
just
kind
of
plateaus
at
a
certain
point,
and
so
there's
like
no
more
signal
once
you
reach
a
certain
point,
yeah
so
I
think
what
you're
referring
to
is
called
label
smoothing.
So
in
the
traditional
sort
of
Gans
framework,
you
have
labels
of
0,
&
1
for
sort
of
synthetic
and
fake
data
and
real
data
on
something
that
was
proposed.
A
I
think
in
this
paper
called
like
improving
training,
up
dance
and
ghost
in
2016
paper
they
proposed
label
smoothing
and
which
is
basically
instead
of
giving
0
1
labels
either
give
like
noisy
labels
or
give
labels
of
like
0.9
and
point
1,
and
so
that
you
end
up
kind
of
like
smoothing
at
this
distribution
a
little
bit,
and
that
does
help
a
little
bit
this.
This
instance.
A
A
So
another
thing
that
very
frequently
happens
with
ganz
is
referred
to
as
mode
collapse,
and
the
idea
here
is
that
at
some
point
in
training
the
generator
just
kind
of
collapses
and
starts
producing
a
very
small
set
of
images,
and
so
it
might
be
a
single
image.
It
might
be
a
small
set
of
images.
But
basically
it's
images
that,
like
look
like
total
garbage,
if
you're
looking
at
them,
they
don't
really
reflect
the
data
distribution
but
they're.
A
Fooling
the
discriminator
still
and
sometimes
you'll
see
this
kind
of
cyclic
behavior,
where
the
generator
will
kind
of
you
know
produce
this
set
of
images
here
and
then
the
discriminator
will
catch
up
and
you
know
start
to
be
able
to
discriminate
those.
But
then
it'll
shift
over
to
this
other
set
here,
and
this
is
a
really
common
thing
that
happens.
A
So
this
this
figure
here
is
just
kind
of
describing
this,
so
this
is
like
if
this
was
so.
This
figure
here
comes
from
this
unrolled
gaen
paper,
which
is
a
cool
gang
architecture.
It's
not
it's
not
really
used
much
that
I
know
of
now,
but
it's
kind
of
a
theoretically
pleasing
type
thing.
I'll,
explain
it
in
a
second,
but
first
this
this.
So
this
is
our
target
distribution
right.
A
You
could
imagine
that,
like
you
know
the
generator
kind
of
gets
this
mode
and
it's
fooling
the
discriminator
really
well,
but
then
all
of
a
sudden,
the
discriminator
realizes
that,
like
you
know,
oh
actually,
this
kind
of
looks
like
the
training
datum,
so
the
generator
hops
to
another
mode
and
it
just
kind
of
moves
around
and-
and
so
this
is,
you
know.
This
is
a
very
tight
example
of
that.
A
But
you
tend
to
see
this
a
lot
when
you're
training
hands,
so
this
unrolled
can
basically
kind
of
propose
that
the
generator
actually
take
into
account
the
discriminators
gradient
and
to
the
unrolls
kind
of
refers
to
like
optimizing
through
both
of
the
different
objectives.
So
another
thing
that
has
been
proposed
is
to
you,
whose
batch
statistics,
so
the
idea
here
this
was
kind
of
proposed
in
different
ways
in
a
couple
of
different
papers.
A
But
the
idea
is,
instead
of
the
generator
Sorenson
the
discriminator
just
seeing
single
instances,
it
actually
gets
to
see
a
large
set
of
instances
at
each
time,
and
so
you
can
implement
this
by
just
giving
the
discriminator
access
to
all
of
the
images
in
a
batch,
for
example,
and
so
now
the
discriminator
can
see
if
you
know,
there's
a
huge
amount
of
diversity
in
the
real
data
set.
In
these
you
know,
say:
100
images
in
the
batch,
but
the
hundred
images
that
are
coming
from
the
generative
distribution
have
very
very
limited
diversity.
A
A
So
another
challenge
of
gans
and
I'll
get
into
this
alone
a
little
bit.
It's
just
kind
of
like
an
evaluation
criterion
so
with
likelihood
based
methods.
There's
a
clear
loss
function
that
you
are
optimizing
and
it
goes
down
as
you
train,
or
hopefully
it
goes
down
and
if
it
doesn't
go
down,
then
you
know
there's
something
wrong.
In
contrast,
most
Gann
frameworks
are
kind
of
optimizing.
A
If
this
alternating
optimization
procedure,
the
Optima
might
be
a
saddle
point,
the
actual,
like
value
of
the
loss
function,
doesn't
tell
you
that
much,
and
so
this
can
make
it
really
tricky.
When
you
are,
you
know
trying
to
decide
what
a
good
stopping
criterion
is
when
you
want
to
do
model
comparison
when
you
just
want
to
like
run
a
giant
hyper
parameter,
sweep
and
pick
your
best
model.
If
this
becomes
really
really
challenging
and
so
you'll
see,
you
know
a
lot
of
a
lot
of
papers
in
this
area.
A
A
Okay,
so,
as
I
said,
there's
a
ton
of
different
gann
losses.
This
is
not
even
like
half
of
them.
This
was
just
a
nice
table
that
I
grabbed
from
this
paper,
there's
so
many
different
versions.
So
this
is
the
original
kind
of
Gann
objective
that
I
proposed.
This
NS
can
is
the
non
saturating
Gann,
which
is
used
very
much
in
practice,
I'm
going
to
go
through
the
W
game,
which
is
the
loss
of
sine
Gann
and
the
W
can
GP,
which
is
the
washer
son
Gann,
with
gradient
penalty.
A
There's
way
more
like
this
table
should
actually
go
on
forever,
because
there's
so
many
different
variants
that
have
been
for
post
I,
I'm
gonna
go
through
the
wash
aside
again
because
it
is
used
quite
frequently,
I
use
it
I
like
it
it's
fairly,
stable
and
yeah.
That's
enough
justification,
so
cool
okay.
So,
just
really
briefly,
I'm
going
to
give
some
intuition
for
the
Worcester
Stein
metric,
which
is
what
the
Worcester
Stein
can
uses.
So
again,
this
is
I.
Think
I
might
have
referenced
this
earlier,
maybe
not
anyway.
A
So
in
this
case
we
can
think
of,
like
you
know
our
model
distribution
in
our
data
distribution
and
the
kind
of
cost
of
transporting
mass
there,
so
I'm
just
going
to
go
through
a
really
toys,
amble,
which
will
hopefully
give
some
intuition,
but
basically,
you
can
think
of
you
know.
We
have
KL
divergences,
that's
a
way
of
measuring
distances
between
distributions.
This
is
another
type
of
way
of
measuring
distances.
A
So
basically,
in
this
toy
example
imagine
we
have
a
bunch
of
boxes.
You
know
the
number
of
boxes
in
each
in
each
particular
location.
You
know
in
this
kind
of
example,
represents
the
the
probability
mass
in
this
location
and
we
want
to
convert
this
one
distribution
into
another
distribution,
and
so
we're
going
to
do
this
if
I,
just
kind
of
like
moving
these
boxes
over
and
so
the
cost
of
moving
a
box
is
the
weight
of
the
box,
which
just
assume
is
one
for
now
and
multiplied
by
the
distance.
A
So
if
we
were
to
move
this
box
six
over
here,
the
the
cost
would
be
seven
ten
minus
three.
So
basically
the
washer
sign
distance
is
the
cost
of
the
cheapest
plan,
and
so
we
can
kind
of
define
a
plan
based
on
sort
of
each
of
these
different
boxes.
How
we
move
them
here
so
I
won't
go
into
too
much
depth
here.
But
if
you
go
back
and
look
at
the
slides
of
this
blog
post,
it
kind
of
like
goes
through
this
and
more
depth.
A
But
basically
this
you
know
kind
of
plan
here
describes
where
each
of
the
boxes
here
are
moving
and
each
of
these
distributions
are
going
to
be
the
marginal
distributions,
and
so
this
is
saying
you
know
we
had.
Actually
these
distributions
should
be
flipped,
because
we
have
sorry.
This
is
bad,
but
here
we
have
sort
of
three
boxes
here
and
then
we're
moving
them
over
to
these
different
locations
here.
So
rights
may
and
have
a
whole
bunch
of
different
plans.
A
Each
of
them
will
have
a
different
cost
and
the
washer
sign
distance
is
the
cost
of
the
cheapest
of
these
transport
plans.
In
this
particular
example,
here
they
both
cost
the
same.
But
you
know
this
is
a
simple
example
of
how
you
know
they
can
cost
different
things.
So,
okay,
so
kind
of
writing
this
out
mathematically.
We
have
our
data
distribution
and
our
generator
distribution,
and
these
are
the
different
plans
that
we
have
these
gammas
and
then
this
altogether
here
is
the
washer
stein
metric
also
called
the
Earth
River
metric.
A
So
this
paper
I
would
recommend
reading
this
goes
into
a
lot
of
like
really
nice
Theory,
until
why
this
type
of
metric
makes
a
lot
more
sense
when
you're,
comparing
so
different
probability,
distributions
and
a
lot
of
the
intuition
here
that
they
go
through
is
just
because
you
end
up
with
sort
of
no
gradients
in
certain
places
if
you
use
KL
divergence
or
reverse
KL
or
Jensen
Shannon,
but
you
end
up
with
gradients
everywhere.
If
you
use
this,
and
so
they
work
through
a
couple
of
really
simple
toy
examples
there.
A
So
of
course
this
is,
this
is
intractable.
So
of
course
it
come
up
with
a
nice
approximation,
so
the
approximation
that
they
propose
in
the
in
the
paper
uses
this
duality,
which
I
honestly
don't
know
much
about
so
I'm.
Just
going
to
point,
this
blog
post
I
haven't
really
seen
this
kind
of
explained
much
in
most
and
papers.
It's
just
kind
of
stated
so
I
found
this
blog
post
and
I
would
encourage
you
to
look
at
this
blog
post,
because
I
can't
teach
that
so
cool.
So
this.
A
So
now,
because
of
this
duality,
we
have
this.
We
rewritten
this
Worcester
Stein
loss
in
this
form,
which
looks
very
very
similar
to
the
discriminator
I'm,
sorry
to
our
Gans
setup
from
before.
So
we
had
this
expectation
over
data
sampled
from
our
data
distribution
and
this
expectation
over
generated
generated
images.
Here
we
have
this
F
function
which,
in
the
you
know,
Wasserstein
framework
is
typically
called
a
critic,
but
if
we
just
sub
in
D
here,
then
we
basically
have
the
you
know.
A
It
looks
a
lot
like
the
gain
function,
so
just
to
kind
of
compare
these
side-by-side.
This
is
the
original
gain
objective.
This
is
the
watcher
Stein
Gann
objective.
Here
we
now
have
a
constrained,
optimization
problem,
because
we
need
this
discriminator
function
to
to
be
one
Lipschitz,
and
so
the
watcher
Stein
can
paper.
Originally
they
enforced
this
through
weight
clipping.
A
So
that's
not
the
greatest
way
of
doing
it.
They
did
get
some
good
results
and
I
think
when
that
paper
came
out,
it
was
state
of
the
art
when
it
was
originally
proposed,
but
there's
different
ways
of
enforcing
this
constraint.
So
this
was
her
signed,
Gannett,
the
gradient
penalty.
This
paper
is
much
more
commonly
or
sorry.
This
formulation
is
much
more
commonly
used
and
basically
the
idea
here
is
just
to
enforce
the
constraint
by
adding
a
gradient
penalty
that
constrains
the
discriminator
to
have
gradient
norms
less
than
one
everywhere
so
cool.
A
So
now,
with
this
formulation,
we
have
some
really
nice
samples.
So
these
are
a
bunch
of
synthesized
bedroom
images.
I
always
feel
like
people
who,
like
jump
into
generative
models.
No
contexts
are
always
like.
Why
are
you
generating
bedrooms?
But
bedrooms
are
it's
just
a
really
common
data
set
that
people
use
in
the
general
world?
So
that's
why
we're
looking
at
bedrooms-
and
these
are
these-
are
all
synthetic
images
but
they're
all
pretty
high
quality,
and
so
again
this
was
kind
of
state-of-the-art
when
it
came
out.
Okay,
so
another
set
of
Gans
stabilization
techniques.
A
So
so
far,
we've
looked
at
a
couple.
Different
loss
functions
now
I'm
just
gonna
kind
of
stick
in
some
different
architectural
improvements.
These
are
just
things
to
think
about
when
you're
building
a
can
model,
so
Bachelor
organization,
so
bachelor,
realization
and
the
idea
of
avoiding
sparse
gradients.
Both
of
these
ideas
were
initially
proposed
in
the
DC
and
paper
that
came
out
in
2015.
This
is
just
kind
of
an
example
of
what
that
architecture
looks
like
I'm,
just
really
sort
of
simple
tricks
that
just
really
improve
training,
so
basically
back
to
norm,
works
by
normalizing.
A
The
input
features
to
have
to
a
layer
to
have
zero
mean
and
unit
variance,
and
it
just
really
helped
with
stability.
It
also
sort
of
empirically
helped
with
mode
class
a
little
bit
and
I
think
there
was
some
intuition
that
people
had
that
it
kind
of
like
helped
to
deal
with
poor
parameter
initialization
in
some
cases.
Cause
batch
norm
is
generally
helpful.
There
then
virtual
batch
norm
is
you
know,
kind
of
a
different
variant
here
where
each
example
is
normalized.
A
Based
on
the
statistics
collected
from
a
reference
batch
as
opposed
to
in
batch
norm,
the
statistics
are
collected
for
the
given
batch,
that's
going
through
the
network,
so
yes,
this
was
another
good
thing
and
then
generally
it's
good
to
avoid
sparse
gradients.
So
you
know,
instead
of
using
Raley's
use,
leakier
a
lose,
and
this
just
kind
of
helps
improve
the
signal
from
the
discriminator
to
the
generator
spectral
normalization.
This
is
a
more
recent
paper.
A
It's
a
weight,
normalization
method
that
has
also
been
proposed
proposed
to
stabilize
training-
and
this
is,
you
can
be
thought
of
as,
like
you
know,
an
alternative
to
the
kind
of
weight
clipping
or
gradient
penalties.
Another
normalization
technique.
This
is
so
it
was
originally
just
applied
in
the
discriminator.
These
synthesized
images
are
coming
from
this
paper
so
again,
now
we're
really
kind
of
getting
into
state-of-the-art
image
generation
stuff.
These
are
synthetic
pizzas
and
synthetic
cats
because
everybody
loves
cats
and
pizza.
A
So
these
are
good
examples
to
show,
and
then
this
work
later
applied
at
both
the
discriminator
and
the
generator
and
found
it
was
helpful
there
yeah.
So
this
is
a
cool
paper
also
I
think
worth
reading,
but
basically
they
looked
at
the
conditioning
of
the
jacobian
of
the
generator
and
found
that,
like
they
found
empirically
that
poorly
conditioned
generators
and
it
was
very
predictive
of
the
different
sorts
of
like
metrics
that
people
use
to
evaluate
generative
models,
so
okay,
multi
scale
architectures.
So
now
we're
going
to
get
into
some
like
architectural
stuff.
A
Progressive
growing
up,
ganz.
This
is
another
good
paper.
This
is
I'd,
say
also
still
kind
of
state
of
the
art,
and
the
idea
here
is
that
we
start
the
low
resolution
image
and
then
progressively
increase
the
resolution,
but
this
time
by
adding
layers
to
the
network
in
an
incremental
fashion.
As
opposed
to
the
kind
of
multi
scale
stuff
where
it's
like,
you
train
one
model
at
this
resolution,
and
then
you
fix
that
and
you
trade
another
model.
A
Another
resolution
on
here
the
layers
of
the
network
right
is
kind
of
progressively
being
added
in,
and
this
is
nice
because
it
allows
the
models
work
early
on
to
capture
very
coarse-grained
features
of
your
data
set
and
then
over
time
kind
of
focus
on
more
fine-grained
details.
These
are
some
samples
from
the
progressive
growing
up
Ganz
paper.
These
are
all
synthesized
phases.
It
was
trained
on
a
data
set
of
celebrities,
so
they
all
look
like
celebrities.
I,
don't
know
these
are
1024
by
1024
pixels.
A
This
is
quite
a
high
resolution,
so
long
rage,
dependencies
and
global
stress
sure,
still
kind
of
remained
well
remains
and
remained
a
challenge.
This
is
getting
better
for
the
self
attention.
Gann
was
proposed
as
a
way
of
kind
of
dealing
with
this,
and
so
at
this
point,
when
it
was
proposed,
begins
were
really
good
at,
like
you
know,
kind
of
low
level
textures,
and
they
were
also
really
good
at
faces
that
have
a
lot
of
symmetry
and
a
lot
of
structure,
but
for
just
kind
of
generic
objects.
A
Cans
would
have
trouble
with
like
counting
and
just
kind
of
put
like
six
eyes
on
faces
and
stuff,
and
so
you'd
see
images
that,
like
from
far
away,
looks
like
an
image
but
close-up
not
really
so.
Basically,
the
self
attention
Gann
introduced
an
attention
mechanism
to
the
generator
and
the
discriminator,
and
this
kind
of
helped
learn
kind
of
long-range
dependencies
and
more
global
structure,
and
so
now,
with
a
self
attention
again,
we
see
really
kind
of
coherent
global
structure.
A
These
are
samples
of
like
dogs
and
different
kinds
of
birds
and
fish
and
yeah
just
like
much
more
coherent
structure:
okay,
big
Gann-
this
is
another
good
one.
This
is
another
paper
I'd
recommend
reading,
because
this
is
the
state
of
the
art
right
now
and
these.
These
are
all
synthetic
images
and
they're
really
good.
A
What's
cool
about
this
paper
is
that
they
just
did
a
bunch
of
really
simple
things,
so
they
did
things
like
increasing
the
batch
size.
Here's
the
batch
size
they
just
like
increased
it,
and
already
that
improved
things.
These
are
different
kind
of
metrics,
which
I'll
get
into
in
a
second
and
smaller,
is
better.
Here,
larger
is
better
here
they
increased
batch
size.
They
like
double
the
number
of
channels
in
the
layer.
A
They
added
skip
connections
from
the
noise
vector,
so
instead
of
the
noise
vector
just
going
directly
into
the
generator,
it
would
go
into
each
layer
of
the
generator
and
they
had
a
couple
different
versions
here.
They
would
either
give
the
exact
same
kind
of
noise
back
to
each
layer
or
they
each
layer.
A
We
got
a
different
sort
of
chunk
of
the
noise
vector
and
then
they
also
saw
a
really
good
results
by
sampling
from
a
truncated,
Gaussian
distribution,
and
the
idea
here
is
that,
when
values
sort
of
fall
outside
a
particular
range,
you
just
release
sample
within
this
range,
and
this
didn't
work
with
every
architecture,
because
you're
essentially
sampling
from
a
different
distribution.
You
saw
during
training
but
with
some
other
kind
of
tricks.
They
made
this
work,
and
so
here
now
we
see
okay.
So
this
is
the
effects
of
that
truncation.
A
So
basically,
this
is
similar
to
that
glow
model.
That
I
was
showing
earlier,
where
you
had
a
temperature
parameter
control,
and
so
you
can
kind
of
generate
very,
very
regular
images
or
get
more
diversity,
and
you
can
see
that,
like
the
quality
of
the
image
improves
as
you
kind
of
limit
the
diversity,
but
you
can
kind
of
like
you
know,
pick
something
in
the
middle
here
and
it
looks
pretty
good
and
then
yeah.
This
is
just
some
things
that
the
big
and
still
struggles
with
so
on
the
Left.
A
But
this
is
another
method
where
basically
I'm
the
kind
of
take
techniques
from
style,
transfer,
some
style
transfer
models
and
bring
those
into
the
generator
model,
and
so
this
is
basically
they're
building
off
of
the
progressive,
an
model
but
just
improving
the
generator
and
now
we're
starting
to
get
just
like
kind
of
scary
image.
Generation
of
people
like
these
are
all
synthetic
faces
that
came
from
this
file
again
model,
and
then
this
model
also
does
a
good
job
on
just
kind
of
other
objects.
A
So
these
are
like
bedrooms
again
and
cars
and
we're
seeing
like
good
kind
of
global,
coherent
structure,
cool
I'm,
actually
gonna
rush
through
this.
There
is
I'm
just
going
to
point
to
this
paper,
because
this
is
again
some
nice
theory,
this
kind
of
Afghan
paper
they
basically
look
at
so
F.
Divergences
are
sort
of
class
of
divergences
again
for
comparing
different
probability
distributions
and
they
they
show
some
nice
theory
that
kind
of
unifies
a
lot
of
different
Gann
frameworks
and
other
kind
of
implicit
models
within
this.
And
yes,
oh
just
read
that
paper
matching
networks.
A
This
is
another
framework
I'm,
just
gonna
I'm,
gonna
kind
of
skip
through
it,
similar
to
Ganz
in
terms
of
the
actual
architecture
of
the
model,
ancestral
sampling
procedure,
all
of
that
kind
of
stuff
but
stable,
because
you
don't
have
this
kind
of
like
alternate
optimization
procedure,
but
they
haven't
really
caught
and
the
samples
are
not
amazing.
Although
there
is
some
stuff
where
people
are
like
combining
Gans
and
the
matching
networks-
and
you
can
also
use
this-
no
one
matching
criterion
as
a
way
of
evaluating
ganz.
A
So
if
you're
interested
go,
read
this
further
I'm
gonna
skip
through
here,
okay,
so
evaluating
generative
models.
So
this
is
something
I
kind
of
skimmed
over
throughout
sort
of
obvious
way
of
evaluating
generative
models
is
to
look
at
the
log-likelihood
again.
Look
at
the
the
likelihood
of
your
dataset,
typically
I,
held
out
data
sets
that
you
didn't
see
during
training
and
look
at
the
likelihood
of
that
data
under
the
model
that
you've
learned.
A
So
this
is
a
sort
of
natural
way
in
which
a
likelihood
based
methods
are
compared
with
one
another,
but
this
isn't
really
viable
for
implicit
models
like
ganz
again,
because
there
is
this,
there's
no
kind
of
explicit
density
function.
Early
works.
Try
to
leverage
this
kind
of
cars
and
window
approach
where
you
take
a
bunch
of
sampled
images
and
you
place
it
give
scene
on
top
of
each
one,
and
you
say
that
this
mixture
is
your
density
model
or
an
approximation
to
it.
A
But
this
doesn't
really
work
because
high
dimensional
spaces
that
just
doesn't
really
make
sense,
so
other
things
that
you
might
care
about.
Our
perceptual
quality
of
the
generations
diversity,
whether
my
model
is
over
fit
or
fitting,
is
really
important
because
you
know,
if
you
just
like,
take
your
training
set,
then
that
has
a
lot
of
diversity
and
the
perceptual
quality
is
really
good,
but
that's
not
a
great
generative
model,
and
so
overfitting
is
really
important
to
check
and
then
also
you
know,
utility
for
some
kind
of
downstream
tasks.
A
A
So
this
means
that,
like
you
know,
given
a
huge
set
of
generated
images
if
I
look
at
the
marginal
distribution
over
the
labels
from
all
of
these
I'm,
going
to
kind
of
equally
hit
all
of
the
classes,
and
so
the
inception
score
looks
at
the
KL
divergence
between
these
two
distributions
right,
because
you
want
one
to
be
a
highly
peaked
while
waiting
to
be
uniform.
And
so,
if
we
look
at
the
KL
divergence,
then
the
higher
the
KL,
the
more
we're
satisfying
these
two
things
cool.
A
So
this
is
this
is
nice
because
it
gives
us
kind
of
like
one
single
number:
it
doesn't
some
limitations
or
it
doesn't
care
of
them
within
class
diversity.
So
if
you
just
generated
a
single
instance
of
each
class,
you'd
get
a
really
high
score
here.
Another
limitation
is
that
it
doesn't
actually
rely
at
all
on
the
data
distribution,
so
nowhere
here
am
I
using
the
data
distribution
when
I
compute
this,
although
you
could
imagine
training
the
the
classifier
networks
that
you
use
on
your
data
distribution.
A
If
you
had
labels
for
it,
and
so
you
could
kind
of
get
it
in
there,
but
it's
a
bit
weird
because
it
doesn't
use
that
and
then
also
there's
no
kind
of
measure
of
overfitting
right.
You
could
just
you
know,
reproduce
exactly
instances
from
your
training
set
and
you'd
have
a
you
know:
high
Inception
score.
A
A
So
an
additional
thing
that
we
need
to
do
again
like
looking
at
nearest
neighbors.
This
is
frequently
done
to
measure
overfitting
nearest
neighbors,
either
in
pixel
space,
which
is
like
something.
But
you
know,
nearest
neighbors
in
pixel
space
doesn't
make
a
lot
of
sense
just
because
we're
in
this
very,
very
high
dimensional
space.
You
could
also
look
at
nearest
neighbors
in
the
embedding
space
from
some
kind
of
you
know.
Other
classifier
model,
for
example.
A
Also,
human
evaluations
are
frequently
used
where
people
will
just
kind
of
like
you
know,
get
a
batch
of
humans
and
have
them
look
at
these
images
and
say
you
know,
which
one
is
kind
of
perceptually
better
to
you
and
yeah.
These
are
some
good
further
readings
on
this
kind
of
like
space
of
evaluating
models.
This
alert-
this
is
actually
a
good
one.
To
read.
This
all
again
are
all
against
created
equal.
They
basically
implemented
like
a
huge
huge
number
of
all
the
different
gand
training
frameworks
and
we're
like
hey.
A
A
lot
of
the
differences
actually
just
come
down
to
little
kind
of
optimization
things,
but
this
is
a
really
good
summary
and
then
also
pros
and
cons
and
Gantt
evaluation
metrics.
They
just
go
through
like
a
huge
different
set
of
evaluating
things
like
two
minutes
left,
so
I'm
going
to
go
through
some
cool
applications,
image
to
image
translation.
A
This
work
here
basically
translates
between
different
types
of
domains,
so
we
can
go
from,
like
you
know,
daytime
to
nighttime
or
like
this
sketch
to
like
an
actual
image
of
a
purse
unpaired
image
to
image
translation.
This
is
cool,
so
this
was
paired,
so
it
means
that
during
training
we
needed-
you
know
sort
of
parrot
examples
of
this
this
one
here
we
just
need
two
different
domains,
and
this
is
cool.
A
Super
resolution
I
think
I
mentioned
this
one
earlier
on,
but
this
is
a
nice
kind
of
image.
Editing
application
generating
molecular,
so
I've
been
focusing
a
lot
on
images
throughout
this
talk,
but
there's
you
know
a
whole
lot
of
other
domains
in
which
you
can
apply
these
different
techniques
and
yeah.
There
we
go.