►
Description
Featuring Balaji Lakshminarayanan, Dustin Tran, and Jasper Snoek from Google Brain.
More about this lecture: https://dl4sci-school.lbl.gov/uncertainty-and-out-of-distribution-robustness-in-deep-learning
Deep Learning for Science School: https://dl4sci-school.lbl.gov/agenda
A
Thanks
mustafa
and
thanks
everyone
for
the
invitation
and
for
coming
to
our
talk.
So
can
you
help
hear
me:
okay,
yeah,
yeah,
yeah,
okay,
great
all
right!
So
today
we're
gonna
split
the
talk
roughly
in
three
parts.
A
So
the
way
we
have
structured,
it
is
maybe
like
20
22
minutes
each
and
like
we'll,
have
like
five
to
eight
minutes
for
questions
or
you
can
even
take
a
break
if
you
want
to
and
I'll
present
first
and
if
you
have
questions
keep
we
can,
you
can
ask
in
the
slack
channel
or
post
it
in
the
zoom
chat,
whichever
is
convenient
and
we
can
in
about
20
minutes.
A
I
can
take
some
of
those
questions
all
right
so
today
we're
going
to
be
talking
about
uncertainty
in
deep
learning
and
this
joint
work
with
lots
of
awesome
colleagues
at
google,
deepline
and
elsewhere.
A
So
I'll
first
start
off
by
giving
some
background
on
why
uncertainty
is
important,
and
why
do
we
care
about
this
topic
so
so
before
we
get
started
to
the
methodology?
What
do
we
mean
by
predictive
uncertainty
so
to
make
sure
we
are
on
the
same
page?
So
what
we
mean
is
that
we
want
to
predict
like
output
distributions
rather
than
point
estimates.
A
So
imagine
you
are
doing
like
classification
problem
like
in
the
2d
example
above
where
you
have
some
features:
x1
and
x2,
and
you
have
two
classes
shown
in
like
red
and
blue
or
you're,
maybe
trying
to
predict
some
trying
to
do
some
regression,
where
you
have
some
feature
x
and
you're,
trying
to
predict
some
function
of
value
y
and
typically,
like
you
know,
most
people
just
do
train
a
classifier
and
gives
like
a
deterministic
prediction,
and
here
we
don't
want
deterministic
predictions.
A
We
want
a
distribution
to
capture
the
uncertainty
so
for
classification
it
could
mean
that
you
don't
just
predict
which
class
it
is.
You
predict
the
probability
with
which
you
believe
that
that
point
may
be
from
the
class
and
for
regression
you
don't
just
want
to
produce
a
point
estimate,
but
also
a
sense
of
the
variance
and,
as
shown
in
the
figure
on
the
right.
A
So
what
are
the
sources
of
uncertainty?
So
in
so
there
are
kind
of
like
a
two
sources
of
uncertainty
that
I'm
going
to
be
talking
about.
So
the
first
source
is
some
inherent
noise
in
the
in
p
of
y
given
x
itself.
So
you
can
imagine
there
is
some
noise
in
the
labeling
process
like,
for
instance,
here
is
an
example
from
the
fifa
10
h
data
set
where
basically,
they
took
like
images
of
from
like
c5
test
set,
and
then
they
asked
multiple
raters
to
rate
and
here's
the
distribution.
A
So,
as
you
can
see,
some
images
are
pretty
clear,
like
the
human
labels
may
be,
like
super
clear
like
on
the
top.
It's
everybody
agrees
that
it's
a
plane.
The
second
everybody
agrees.
It's
a
cat,
but
the
the
last
two
images
are
inherently
pretty
ambiguous,
so
you
could
imagine
it
could
be
a
ship
or
you
know
like
an
animal
or
something
and
the
last
one,
since
we
can't
see
the
head
super
clearly,
depending
on
like
how
you
look
at
it.
A
The
texture
looks
a
lot
like
a
deer,
but
it
it
could
also
be
a
bird
like.
So
there
could
be
some
human
ambiguity
in
the
label,
so
that
would
be
in
inherent
distribution
or
like
labels
for
a
given
input,
rather
than
just
like
a
deterministic
label.
And
similarly
you
can
have
like
this-
for,
like
you,
know,
measurement,
noise
and
y.
A
So
imagine
you're
running
some
experiments
and
you're
trying
to
like
regress,
like
the
the
kind
of
like
some
sort
of
like
function
with
respect
to
some
parameters
that
could
be
like.
A
If
there
is
inherent
randomness
in
the
procedure
or
some
measurement
noise,
then
every
time
we
do
the
measurement,
you
may
observe
a
slightly
different
value
so
that
the
the
there's
not
a
deterministic
y
for
a
given
x
but
like
an
entire
distribution-
and
this
is
sometimes
also
called
you
may
see
this
term,
which
is
called
alias
uncertainty
and
the
the
the
distinct
property
between
the
two
types
of
uncertainty
I'm
going
to
be
talking
about
is
that
this
source
of
uncertainty
is
usually
considered
to
be
irreducible
in
the
sense
that
even
in
the
limit
of
infinite
data,
it's
considered
like
to
be
reducible.
A
However,
a
lot
of
the
this
can
be
caused
by,
like
you
know,
partial
observability,
in
some
sense
that
if
you
actually
give
additional
features,
you
can
reduce
this.
So
for
imagine,
if
you
had
like
a
higher
resolution
of
this
image
or
something
like
that,
where
you
could
see
more
clearly,
you
can
reduce
this
uncertainty
where
you
have
that
or
whatever
the.
A
So
the
next
source
of
uncertainty
is
what
we
call
model
uncertainty
and
this
basically
arises
because
you
have
this.
Basically,
this
is
because,
given
limited
training
data,
you
may
have
like
multiple
functions
which
are
consistent
with
the
observed
training
data.
So
an
example
of
this
is
the
figure
on
write
where
you
see
that
you
have
two
classes
shown
in
squares
and
triangles,
and
we
are
trying
to
do
a
binary
classification
problem.
But
as
you
can
see,
there
are
multiple
possible
classifiers,
which
equally
will
explain
the
data.
A
So
given
just
this
limited
data,
we
cannot
precisely
pin
down,
which
is
the
this,
is
the
only
classifier
that
separates
the
data
right
like
so,
there
could
be
like
multiple,
equally
valid
explanations,
which
can
separate
out
the
two
datasets,
so
we
are
not
sure
which
is
the
the
the
right
function
yet
so
maybe,
if
you
observe
more
data,
then
the
classifier
would
be
like
more
constrained
and
in
the
in
the
infinite
data
limit.
Assuming
that
you
have,
your
models
are
identifiable.
A
By
that
we
mean
that
when
you
specify
the
problem,
there
is
a
unique
optimum.
You
don't
have
like
a
symmetries
or
something
like
that
which,
if
you
so
that
you
can
actually
identify
each
model
so
in
in
that
case,
as
you
get
like
more
data,
this
uncertainty
reduces.
So
this
is
also
known
as
epistemic
uncertainty,
and
this
is
considered
to
be
reducible
uncertainty,
unlike
the
previous
slide,
where
the
its
the
data
uncertainty
process,
even
in
the
infinite
data
limit.
A
Basically
so,
and
how
do
we
before
I
jump
into
like
the
methods?
The
let's?
I
think
it's
useful
to
discuss
how
do
we
measure
the
quality
of
uncertainty,
so
there's
a
couple
of
measures
that
are
commonly
used
and
one
of
the
terms
you
will
hear
a
lot
is
what's
called
calibration
and
like
an
intuitive
way.
To
think
about
calibration
is
that
it
measures
how
well
the
models
when
models
predict
distributions,
they
express
their
predicted
confidence,
which
is
basically
their
own
estimate
of
the
probability
of
correctness.
A
So
that's
how
we
can
think
about
like
the
confidence,
so
the
model's
own
estimate
of
the
probability
of
correctness
and
you
we
can
check
how
well
it
aligns
with
the
reality.
In
the
sense
you
can
actually
measure
the
accuracy
on
the
on
on
the
data
set,
and
you
can
see
whether,
in
the
cases
where
the
model
is
expects
to
be
correct.
What
fraction
it
actually
is
correct
and
a
common
way
to
do
this
is
the
following,
which
is,
I
hope
you
can
see
my
most
pointer.
A
So
what
people
do
is
in,
like
binary
classification,
imagine
you
are
expressing
probabilities
in
the
zero
to
one
range,
so
people
usually
bin
up
these
probabilities
into
like
multiple
bins
and
what
you
can
do
is
for
each
bin.
So
in
this
bin,
for
instance,
the
model
expects
to
be
correct
and
90
percent
of
the
time
like
in
the
90
to
1
0.9
to
1
range.
The
model
expects
to
be
correct
in
on
average,
like
0.95
times,
or
something
like
that.
You
can
use
the
average
expected
gain
for
the
bin.
A
Then
you
can
make
you
can
take
all
of
the
examples
that
ended
up
in
this
bin
and
you
can
measure
what
fraction
of
those
the
model
is
correct.
So
if
it
is
well
calibrated
on
all
of
these
points,
the
model
should
be
correct.
90
of
the
time,
and
that's
what
you
plot
here.
So
the
x-axis
shows
the
the
bin,
the
average
probability
or
average
model
confidence
within
the
bin,
and
the
y-axis
shows
the
actual
accuracy
of
the
model
in
that
bin.
A
Only
10
percent
of
the
time,
and
so
the
average
accuracy
should
be
quite
low,
because
the
model
is
expressing
a
very
high
uncertainty,
whereas
in
this
bin,
where
the
model
expresses,
you
know
very
high
confidence,
you
expect
it
to
be
like
more
accurate
and,
if
it's
well
calibrated,
you
should
expect
the
the
predicted
confidence
to
measure
up
with
the
actual
estimate
of
the
accuracy
itself.
So
you
can
think
of
calibration
as
some
sort
of
meta
accuracy
of
fault,
because
it
measures
the
alignment
between
the
model's
own
confidence.
A
This
is
the
accuracy
and
there
are
various
aggregate
metrics
that
you
can
measure
how
far
the
model
is
from
the
ideal
calibrated
curve,
but
one
common
way
that
people
do
is
they
take
each
bucket,
and
then
you
measure
up
the
difference
between
the
the
the
expected.
This
is
the
actual
accuracy,
and
then
you
sum
it
up
and
basically
that's
what
is
called
the
expected
calibration
error.
A
And
it's
important
to
note
that
calibration
is
not
always
sufficient
in
the
sense
that
that
could
be
because
what
we
really
care
about
as
a
model
to
be
as
accurate
as
possible
and
also
be
as
calibrated
as
possible.
We
want
both
of
these
criteria
simultaneously
and
so
like
a
simple
way.
A
model
can
cheat
by
becoming
perfectly
calibrated,
is
imagine
your
probability
on
the
test
set
is
like
all
classes
are
equal,
just
like,
or
if
you
take
mnist
or
c4,
all
classes
are
equal,
so
the
model
can
just
always
output.
A
You
know
like
the
uniform
distribution
for
trivially
for
all
inputs
and
that,
if
you
do
that,
then
all
of
the
samples
will
end
up
in
like
this
bucket
here,
like
the
point
one
bucket,
because
the
model
just
predicts
a
point
on
confidence,
and
just
because
of
the
statistics
of
the
data,
it
can
be
perfectly
calibrated,
but
it
wouldn't
be
very
accurate
at
all,
because
it's
just
like
a
random
predictor.
So
you
what
we
actually.
A
We
also
care
about
something
called
refinement
and
accuracy,
and
that's
why,
when
we
look
at
calibration
of
models,
it's
also
important
to
look
at
the
accuracy
of
the
underlying
models.
Basically
and
the
there
are
a
bunch
of
other
metrics
that
are
commonly
used
and
a
lot
of
the
these
metrics
were
actually
evaluated
actually
invented
in
the
weather,
forecasting
literature
so
because
they
use
a
lot
of
like
forecasters
and
that
it's
very
important
to
assess
the
property
of
the
probabilistic
forecasters.
A
So
this
is
a
very
nice
reference
if
you
want
to
learn
more
on
proper
scoring
rules
and
evaluation
measures
for
probabilistic
forecast-
and
I
just
briefly
mentioned
two
commonly
used
measures
that
people
use,
one
is
called
the
negative
log
likelihood,
which
is
basically
you
take
the
confidence
and
the
the
logarithm
of
the
confidence
so
that
it
has
a
proper
scoring
rule.
One
implication
is
that
it
can
sometimes
over
emphasize
tail
probabilities,
so
it
can
be
a
bit
sensitive
to
outliers.
A
So
that's
something
to
be
aware
of
the
other
popular
metric
that
a
lot
of
people
use
is
what's
called
the
bryoscope,
which
is
a
quadratic
penalty,
as
you
can
see
here,
which
is
basically
taking
your
probabilistic.
So
imagine
you're
doing
classification
problems.
You're
do
if
you
have
a
like
your
probabilistic
forecast.
A
You
just
can't
measure
the
mean
square
as
well,
against
your
one
heart
target
basically
and
then
use
the
average,
which
is
called
the
bryoscope
and
the
nice
property
of
this
is
that
it
is
bounded,
and
so
the
the
error
on
any
one
single
data
point
is
bounded
in
the
zero
one
range
that
has
a
log
that
the
range
can
be
quite
broad
basically,
and
this
can
be
useful
and
bryce
core
also
has
a
couple
of
other
nice
properties.
A
In
particular,
it
turns
out
you
can
actually
decompose
this
into
calibration
and
refinement,
which
I
mentioned
in
my
previous
life,
I'm
not
getting
into
the
details,
but
you
can
find
more
details
in
the
in
this
paper
above
so
the
other
property
that
we
care
about,
for
uncertainty
is
evaluating
on
order
of
distribution
inputs
and,
like
an
example
of
that,
is
this
figure
here.
A
Where
imagine
you
are
training
a
classifier
on
c410,
which
is
a
popular
benchmark,
data
set
which
contains
images
of
like
flights,
birds,
cats
and
so
on,
and
what
you
can
do
is
intuitively
if
your
model,
if
you
ask
this
model,
something
which
is
not
one
of
the
existing
classes
like,
for
instance,
the
example
on
the
right,
which
is
images
from
the
street
view
host
numbers
data
set.
A
So
if
you,
if
you
show
the
classif
like
a
classifier
train
just
on
fifa
10,
if
you
show
that,
then
it
should
say,
I
don't
know
or
may.
If,
if
it
makes
a
prediction,
it
should
be
like
with
very
low
confidence
that
it's
not
one
of
the
existing
classes.
Basically-
and
humans
are
great
at
this-
so
imagine
you
just
speak.
A
Like
one
language,
if
I,
if,
if
I'm
shown
like
like
you,
know
images
from
like
a
different
language,
that
I
don't
even
understand,
I
can
still
say
I
don't
know
what
it
is,
but
I'm
pretty
sure
it's
not
like
an
english
character
or
something
like
that
right
like
so.
A
Humans
are
great
at
this,
and
we
want
our
models
to
also
have
this
property
that
they
can
clearly
predict
none
of
the
above
or
say
when
examples
don't
belong
to
any
of
the
existing
classes
and
some
ways
to
measure
this
or
basically
taking
like
the
model.
Confidence,
for
instance,
on
the
iid
inputs
and
model
confidence
on
what
are
called
out
of
distribution
inputs
like
od
inputs
like
svh
and
above
and
you
can
look
at
like
some
sort
of.
A
Like
summary
statistics
like,
for
instance,
how
separable
are
the
model
confidences
like
the
max
confidence,
the
model
science
or
the
entropy
of
the
p
of
y
given
x,
that
the
model
assigns
for
in
distribution
versus
out
of
distribution?
And
you
can
also
do
like
summary
metrics
of
these
statistics
like,
for
instance,
you
can
measure
auc
which
measures
how
separable
these
curves
are
and
so
on.
A
So
I've
talked
a
lot
about.
Like
I've
talked
a
lot
about
the
introducing
uncertainty.
Let
me
spend
a
few
minutes
talking
about
some
motivating
applications,
because
it's
very
important
to
ground
all
the
research
we
are
doing
in
on
why
it's
like
useful
in
the
larger
context
of
science
right
like
so
like
there's
a
lot
of
you,
your
applications
of
predictive
uncertainty.
A
So
one
theme
you
will
see
over
and
over
in
this
talk
going
forward,
is
it's
important
to
know
when
to
trust
model
predictions,
and
especially
on
the
I
under
data
structure.
This
is
going
to
be
a
big
problem,
and
uncertainty
is
also
useful
for
decision
making
and
I'll
show.
Some
applications
like
that
and
it's
important
for
active
learning,
was
to
use
the
uncertainty
to
get
more
data
in
regions
where
you
don't
have
a
lot
of
data
it
will.
A
So
one
use
case
is:
what's
called
natural
distribution
shift,
so
we
typically
assume,
like
the
test,
comes
from
the
same
distribution
as
string,
but
that
assumption
is
violated
a
lot
in
like
real
life
applications,
and
this
is
an
example
of
like
street
view
images
just
like
natural
variations.
You
don't
have
to
do
any.
You
know
anything
special
to
induce
this
kind
of
like
chip.
This
naturally
emerges
and
a
lot
of
the
data,
so
you
can
imagine
like
over
time
the
the
way
the
store
storefronts
look
change.
A
So
if
you
have
been
collecting,
if
you
train
a
model
based
on
like
10
years
old
or
something
like
that,
there
may
be
a
natural
shift
in
how
the
these
images
look
and
similarly
for
countries.
So
imagine
you
have
data
in
one
country
and
you're
training
and
you
you.
You
want
to
deploy
the
model
to
like
like
data
from
a
different
country
or
so
on.
Then
there
will
be
some
natural
variations
in
the
data,
so
you
want
the
models
to
be
like
robust
to
such
type
of
like
natural
distribution.
A
Another
example
is
this
that
I
mentioned
before
is
open,
set
recognition,
which
is
the
case
that
the
test
input
may
not
actually
belong
to
one
of
the
existing
classes.
So
this
like
a
really
nice
example
of
this-
is
this
paper
here
where
it's?
Basically,
what
we
are
trying
to
do
is
predict
bacterial
species
from
genomic
sequences
and
what
happens
is
when
you
train
the
classifier.
So
imagine
you
at
some
point
time
and
some
point
in
time.
A
You
take
all
the
existing
all
the
known
bacterial
species,
and
then
you
train
your
classifier
right
like
so,
but
we
don't
the
you.
People
have
been
discovering,
like
a
lot
of
you
know
like
new
bacterial
classes,
that's
denoted
by
the
blue
line
here,
so
we
don't
know
still
like
a
lot
of
the
the
classes
that
exist,
so
we
can
train
a
classifier
only
at
any
point
in
time
we
can
train
a
classifier
on
all
the
known
classes,
and
if
we
deploy
this
classifier,
what
we
see
is
it
it
can
be.
A
It
can
achieve
high
accuracy
if
the
classifier
belongs
to
one
of
the
is
the
test.
Input
belongs
to
one
of
the
known
classes,
but
the,
but
when
we
deploy
it
since
new
bacteria
are
being
discovered
and
there's
a
lot
of
things
that
do
not
belong
to
one
of
the
existing
classes,
what
can
happen?
We
want
the
model
to
say
the
like.
A
If
it
encounters
like
an
input
our
test
time,
we
wanted
to
be
reliably
able
to
say
it
does
not
belong
to
one
of
the
existing
classes
and
not
wrongly
classify
it
as
in
distribution,
because
that
can
have
you
know
this
type
of
misclassification
can
have
a
huge
impact.
Basically,
so
we
wanted
to
so.
A
This
is
the
setting
called
open
set
recognition
where
you
don't
know
all
the
classes
at
training
time
and
at
you
you
really
need
a
classifier
that
can
reject
touch
all
the
inputs
in
a
reliable
way
and
another
similar
example
is
conversational
dialogue
system.
So
yeah
look
at
this
example
on
the
right.
A
Imagine
you
have
like
a
chat
board
which
can
only
answer
questions
about
like
you
know,
it
can
only
answer
questions
about
your
finance
or
something
like
that,
and
if
you
ask
a
different
question,
like
you
know
something
about
like
sports
or
something
or
then,
if
it
answers
something
related
to,
because
it
can
only
answer
such
questions
if
if
it
says
that
that
can
lead
to
a
very
frustrating
like
human
experience,
right,
like
we've
all
been
there,
and
it
would
be
much
more
like
a
much
more
graceful
way
to
fail.
A
Is
you
can
say
something
like
sorry?
I
can
only
answer
questions
about
like
this
domain
and
the
water
value
asking
is
out
of
scope
in
some
sense,
so
another
application
where
uncertainty
is
very
important,
is
medical
imaging,
so
there's
been
lots
of
papers
on
using
uncertainty
from
deep
learning
models
to
improve
uncertainty.
Basically-
and
I
have
image
like
I-
have
some
images
from
these
papers
check
them
out.
One
is
related
to
like
diabetic
retinopathy
detection,
another
one
on
iv
disease
classification
from
ocd.
A
So
here
the
the
model
sometimes
predicts
makes
a
multi-class
classifier
and
it
predicts
how
severe
the
condition
is
or
like
which
cases
need
to
be
seen
by
a
doctor
and
so
on,
and
in
these
cases
it's
very
important
to
have
uncertainty,
because
you
have
like
asymmetric
loss
functions
and
you
may
want
to
take
the
models
uncertainty
into
account
when
knowing
when
to
touch
the
model
as
well
as
maybe
if
the
model
is
unsure,
we
shouldn't
you
know,
pass
the
model's
predictions,
but
you
defer
it
to
a
human
or
something
like
that,
and
that
could
also
be
like
if
there
is
a
sufficient
order
of
distribution
inputs
like
if
the.
A
If
the
image
is
not
taken
properly
or
something
like
that
or
if
it's
blurred
or
something
like
that,
then
the
model
should
be
able
to
reject
it.
Reliably
saying
that
this
is,
but
I
mean,
maybe
the
image
is
not
centered
or
something
like
that,
so
the
model
should
be
able
to
reject,
so
that
the
image
is
maybe
like
re-take
them,
or
something
like
that.
So
you
can
also
have
a
lot
of
interesting
use
cases
there
and
uncertainty
also
comes
up
a
lot
in
applications
like
bayesian,
optimization
and
experimental
design.
A
So
here
I
have
on
the
right.
I
have
this
image
showing
bayesian,
optimization
and
action.
So
the
the
way
we
use
uncertainties
to
decide
the
trade-off
between
so-called
exploration
versus
exploitation,
so
in
experimental
design
and
basin
optimization.
We
want
to
find
the
best
set
of
hyper
parameters
and
what
we
want
to
do
is
we.
A
We
want
to
minimize
the
number
of
experiments,
but
we
also
so
we
want
to
intelligently
pick
the
next
point
so
that
we
do
a
efficient
search
and
here,
if
we
don't
want
to,
if
you
have
already
evaluated
one
point,
the
model
can
pretty
confidently
assess
the
you
know
like
the
unobserved
target
at
like
nearby
points,
a
much
more
reliable
way,
whereas
in
points
that
are
far
away
from
existing
evals,
then
the
model
is
more
uncertain.
A
So,
depending
on
the
you
here,
you
care
about
doing
an
accurate
prediction
of
like
the
target,
but
also
you
need
the
uncertainty,
because
you
can
use
the
uncertainty
to
take
better
decisions
and
deciding
trade-off
by
our
acquisition
functions
and
do
more
efficient
search.
A
So
I've
talked
a
lot
about
like
applications,
and
I
wanted
to
point
some
concrete
steps
to
highlight
where
current
deep
learning
models
fair,
so
that
you
can
appreciate
that
this
is
indeed
a
prob
problem
that
comes
up
a
lot
in
the
context
of
like
deep
learning
methods.
A
So
I've
talked
a
lot
about
like
motivating
applications,
but
for
a
lot
of
research.
We
want
like
benchmark
datasets,
where
we
can
do
large
scale,
evaluations-
and
you
know,
compare
like
different
methods
head
to
head
and
so
on.
To
understand,
which
are
the
more
promising
methods
and
so
on.
One
benchmark
that
has
become
quite
popular
in
the
field
is
what's
called
imagenet
c,
which
contains
like
corrupted
versions
of
imagenet.
A
So,
basically,
in
like
typical
benchmarks,
we
assume
the
we
use
an
iid
set
from
the
same
distribution,
and
then
we
know
that
deep
learning
methods
can
do
very
well.
Like
I
mentioned,
it
is
one
of
like
the
success
stories
of
deep
learning,
and
but
what
happens
if
you
violate
the
train
test
assumption,
so
how
gracefully
these
models
fail
and
how
robust
are
they
and
the
way
to
measure?
A
This
is
to
just
take
like
some
sort
of
like
simple
types
of
corruptions,
like
you
know
like
gaussian,
noise
or,
like
you
know,
blurring,
and
so
on
and
contrast
and
so
on.
So
just
take
like
simple
types
of
corruptions
and
then
you
can
increase
the
intensity.
So
each
of
these
ops
you
can
apply
with
like
you,
can
increase
the
amount
of
blurring
you
do
so
that
you
have
some
increasing
dataset
shift
so
that
you
have
a
knob
to
move
move
from
iid
to
more
and
more
progressively
out
of
distribution.
A
And
then
you
can
measure
like
how
different
methods
fail
in
some
sense
to
see
whether
intuitively
we
expect
the
model
to
be
like
most
accurate
here
and
less
accurate
going
forward,
and
we
can
kind
of
like
benchmark
how
the
models
do.
A
So
if
we
did
like
like
a
benchmark
on
like
a
lot
of
methods
before
in
our
previous
paper,
basically
evaluating
methods
on
the
clean
set
and
also
measuring
like
how
the
how
the
model
accuracies
drop
as
you
increase
the
shift
and
as
expected,
you
do
see
the
moral
accuracy
to
be
dropping
as
you
increase
the
ship.
But
the
important
thing
from
the
perspective
of
uncertainty
is
that
it's
okay
for
the
models
to
be
wrong.
A
As
long
as
you
know,
their
calibration
or
their
own
estimate
of
confidence
actually
reflects
this
accurately,
and
so
we
can
measure
the
calibration.
So
here
is
a
plot
of
the
expected
calibration
error
of
different
methods.
What
you
can
see
is,
unfortunately,
the
calibration
of
the
models
also
becomes
quite
worse
as
you,
as
the
shift
increases.
So
in
the
sense
what's
happening
is
that
the
models
are
becoming
more
wrong,
but
their
confidence
does
not
really
reflect
that,
so
they
are
made
even
when
they
are
making
mistakes.
A
They
are
pretty
confident-
and
this
is
a
big
problem,
basically
in
deep
planning
and
I'll
go
into
the
details
of
the
methods
later.
But
this
just
to
showcase
that
this
is
still
an
unsolved
problem
and
the
other
problem,
also
that
I
mentioned,
which
is
deep
nets,
can
assign
high
confidence
predictions
to
od
inputs.
So
on
the
left
is
and
like
image
images
from,
like.
A
You
know
like
a
classic
paper
where
they
showed
that,
even
though
you
show
these
images
to
like
a
classifier
it
will,
the
state-of-the-art
dns
will
assign
like
very
high
confidence
like
more
than
99
confidence
to
these
images,
which
are
completely
like.
You
know.
You
you'd,
be
surprised
that
these
are
this
assigning
such
high
confidence
predictions,
and
you
can
also
construct
completely
like
simple
2d.
A
Examples
like
concerns
like
the
binary
classification
problem
that
I'm
showing
here
where
there
are
two
classes
shown
in
like
blue
and
orange,
and
we
have
like
od
inputs
and
if
you
look
at
the
class
boundaries,
the
models
exhibit.
You
know
some
sort
of
uncertainty
near
the
boundary,
but
there
are
still
like
pockets
of
inputs,
which
are
very
far
away
from
relatively
far
away
from
the
training
data,
where
the
model
is
very
confident.
You
would
expect
the
uncertainty
to
the
model
to
be
more
uncertain.
A
C
Okay,
yeah
jasper,
would
you
like
to
I
I
can
relay
some
of
these
questions
to
you.
I
think
jasper
has
been
answering
them.
Yes,
yeah.
A
C
Probably
also
on
the
question
where
you
were
adding
noise
to
the
images,
I
think
a
few
slides
earlier
this
one
yeah.
A
Oh,
oh,
I
see.
Okay,
I
think
this
is
probably
short
noise.
If
I
yeah
it's
not
gaussian,
but
also
noise
also
looks
somewhat
thinner.
A
Oh,
I
see
somebody
asked
like
how
do
you
define
accuracy
in
a
regression
task?
That's
a
good
point
like
I'm,
mostly
focusing
on
classification
as
the
running
example
in
this
talk,
but
for
regression
you
could
also
do
you
could
define
if
you
have
like
a
likelihood,
you
can
imagine
like
you
know
the
model
predicts
like
a
mean
and
a
variance
or
something,
and
then
you
could
evaluate
the
log
likelihood
under
this
probabilistic
distribution,
and
you
can
have
measures
like
mean
squared
or
mean
absolute
error
and
so
on,
which
are
measures
of
accuracy.
A
How
accurately
the
model
requires
the
underlying
function,
and
I
mentioned
like
briar
score.
It
turns
out.
You
can
also
define
some
like
measures
on
cdf,
so
once
you
convert
to
cdfs
it
you,
you
can
basically
reduce
everything
to.
You
can
reuse
a
lot
of
these
measures,
basically
because
they
are
all
functions
on
zero
to
one
and
there
is
in
particular,
a
very
nice
measure,
called
cumulative
ranked
probability
score,
which
is
measuring
like
difference
in
the
actual
cdf.
A
A
A
A
lot
of
like
focus
right
now
is
on
you
know
it's
kind
of
like.
A
I
think
it's
important
to
focus
on
the
problem,
but
I
also
feel
like
the
there
are
some
if
the
average
case
performance.
If
the
model
does
not
even
pass
the
simple
checks
that
you
would
expect
like,
c510
versus
svhn,
it's
going
to
be
really
hard
to
solve
the
adversarial
burst
case
problem
right
like
so.
I
I
think
it's
important
to
make
progress
on
these
other
benchmarks
as
well
before
we
start
focusing
on
the
worst
case.
In
some
sense,.
B
Well,
yeah-
and
I
guess
I'll
take
questions
after
my
set
and
I'm
sure
there'll
be
some
there'll
be
a
little
bit
of
time
after
the
whole
sequence
of
talks,
so
that
we
can
asynchronously
answer
questions
so
yeah.
So
I'll
start
by
going
into
sort
of
like
the
language
and
frameworks
that
we're
using
to
sort
of
describe
potential
solutions
to
these
problems.
B
Here
the
next
slide-
and
this
starts
from
sort
of
taking
out
probabilistic
approach,
to
understanding
how
we
solve
machine
learning
and
statistics,
style
problems
and
the
very
high
level
overview
how
this
works
is
that
if
you're,
imagining
like
some
formalization
of
what
the
scientific
processes
scientific
method,
is
you
start
with
having
some
domain
knowledge
you
formalize
them.
You
start
making
assumptions,
given
that
with
your
domain
knowledge,
you,
you
bake
them
into
an
actual
model
that
models,
something
like
a
you
know.
B
It
describes
a
generative
process
or
just
it's
just
some
function
that
takes
inputs
and
returns
outputs.
You
have
data
that
data
comes
from
actually
running
experiments
if
you're
in
the
1700s
or
you
have
modern.
You
know
amazon,
turk
or
something
to
get
together
the
data
or
you
just
scrape
data
from
the
web.
Anything
you
can
really
count
as
data
then,
given
those
two
sort
of
things,
those
two
inputs,
you
actually
run
an
algorithm
to
sort
of
infer
the
in
hidden
structure.
That's
behind
your
problem!
B
So
usually
that's
like
a
set
of
parameters
that
governs
the
family
of
distributions
that
you
have
in
your
model
after
running
that
algorithm
you're
able
you're,
finally
able
to
make
predictions
and
sort
of
explore
sort
of
sort
of
see
what
your,
what
your
your
model
can
do
on
arbitrary
inputs
and
outputs,
and
this
high
level
overview
is
really
important,
because
this
is
sort
of
the
foundation
behind
many
fields
that
leverage
data
analysis
and
this
pipeline.
B
You
can
go
to
the
next
slide,
and
so
this
process
is
not
really
just
a
serial
three-step
process.
It's
actually
a
loop
in
that
just
like
the
scientific
method,
it's
all
about
taking
your
models
and
really
checking
if
these
are
actually
fitting.
The
assumptions
that
you
care
about
so
here
the
formal
name
for
this
is
called
model
criticism,
but
more
broadly
in
machine
learning.
This
is
how
you
just
evaluate
methods
you
you
have
a
leaderboard
or
something
of
that
sort.
B
You
you
check
something
like
accuracy,
you
see
how
well
that
model
fits
the
data
and
then
you
on
unseen
data.
So
you
test
that
or
something
and
then
you
go
back
to
revise
your
model,
so
you
check
specific
assumptions
like
maybe
my
maybe
the
way
that
I'm
I'm
I'm
composing
convo
layers
is
not
not
really
super.
B
Indicative
of
anything
next
slide
next
slide,
please
so
so
that
was
a
high
level
overview
and
where
publishing
machine
learning
particularly
fits
in
is
how
it
involves
the
language
of
probability
theory
in
the
individual
steps
of
that
process.
So
really
you
can
think
of.
You
can
think
of
a
particular
model
under
the
probabilistic
approach,
as
sort
of
a
joint
distribution.
B
So
here
this
is
all
going
to
be
in
discriminative
land
because
we're
talking
most
about
supervised
learning
here,
so
we
would
have
an
observe.
We
have
a
distribution
over
observed,
outputs
y
given
x
and
there's
going
to
be
some.
B
And
the
processing
model
is
going
to
have
it's
going
to
be
a
joint
distribution
of
those
outputs
in
those
parameters
and
that's
the
model.
So
that's
step.
One
step
two
is
about
making
inference
about
the
unknown.
So
in
this
case
it's
the
parameters
and
the
posterior.
What
we
call
exterior
is
this
conditional
distribution
of
the
parameters
given
you're
actually
you're,
observing
data
set-
and
this
is
really
this
is
not
like
a
philosophical
statement
or
anything.
This
is
just
bayes
rule
that
I'm
applying
here.
B
So
this
is
the
conditional
distribution
you
expand
it
out.
You
get
a
joint
divided
by
this
marginal
distribution
and
you
can
play
around
with
distributions
all
you
like,
and
one
of
the
central
things
with
policy.
Machine
learning
is
that
for
most
interesting
models,
this
denominator
here
this
marginal
likelihood
is
not
detractable,
because
it's
a
high
dimensional,
integral
problem
integration
problem
where
theta
here
the
parameters
is
probably
millions
to
billions
of
dimensions.
B
So
that's
the
that's
the
stage
of
how
we're
thinking
about
the
modeling
step
one
and
then
the
inference
step.
Two.
Can
you
go
to
the
next
one.
B
So
so
that's
really
the
recipe
for
the
promising
approach
to
machine
learning
and
how
you
might
try
to
get
at
something
like
uncertainty,
estimation,
robustness,
so
here
you're
specifying
the
likelihood.
Typically
it's
something
like
a
neuron
that
ends
up
output
distribution.
So
maybe
it's
a
categorical
likelihood.
The
gaussian,
if
you
have
continuous
outfits
or
something
more
complicated
than
those
two
and
you
have
a
prior
distribution
over
those
parameters
and
given.
B
You
step
two
is
choosing
some
algorithm
to
actually
perform
approximate
inference
and
we'll
go
into
in
more
depth
of
how
you
select
these
things
and
then
step
three.
How
you
actually
make
predictions
of
exploration.
There
are
multiple
approaches.
The
most
generic
one
is
to
just
do
modern,
cardinal
submission
to
sample
from
the
distribution.
B
So
here
what
we're
looking
at
is
the
distribution
of
of
the
outputs
given
an
unseen
input,
so
an
arbitrary
input
x
conditional
on
your
data
set
and
here
what
we're
doing
is
we're
sampling
from
the
exteriors
of
p
of
theta
and
we're
doing
a
monocar
lesson.
So
an
average
over
each
parameter
set
each
sample
from
the
thetas
and
giving
you
the
the
the
distribution
of
the
outputs
conditional
and
that
set
of
parameters,
and
we
can
we
it's
pretty
easy
to
work
out
how
this
formula
is
direct.
A
But
that's
the
that's
the
general
recipe
next
slide.
Please.
B
So,
to
connect
this
back
to
what
people
are
familiar
with,
if
you're
thinking
about
just
like
how
do
neural
network
actually
fit
into
this,
you
can
think
of
this
as
sort
of
taking
a
point
as
a
point
estimate
like
a
specific
set
of
your
parameters
to
approximate
that
full
posterior
distribution,
it's
very
simple
approach
works
well
and
just
a
simple
baseline
for
a
lot
of
things
that
we're
talking
about
and
the
way
you
might
want
to
select.
That
point
is
to
choose
the
highest
probability
under
that
posterior
distribution.
C
B
You
start
from
trying
to
take
the
maximum
of
your
posteriors,
a
p
of
theta,
given
x
and
y.
You
can
equivalently
you're
maximizing
the
log
likelihood,
because
log
is
a
convex
function.
It
preserves
the
mode
you
can
expand
out
the
posterior,
so
this
is
taking
the
joint
distribution
so
log
of
the
joint-
and
there
is
a
constant
which
is
the
marginal
likelihood
for
reasons
that
may
not
be
clear
from
the
starvation
that
has
a
constant
with
respect
to
theta,
because
the
marginalized
does
not
depend
on
the
parameters
that
you're
changing.
B
You
can
rewrite
the
max
zoom
in
and
now
we
have
the
generic
a
soft
mask.
You
get
a
generic
cross,
entropy
problem
with
some
prior
and
as
a
special
case
of
this,
if
you're
doing
classification-
and
so
your
likelihood
is
categorical-
and
let's
say
you
have
some
prior
on
your
parameters,
it
could
be
something
like
l2,
so
this
is
taking
the
gaussian
prior
with
a
particular
standard
deviation
which
is
given
by
lambda
here,
and
so
that's
that's
exactly
the
procedure
here.
B
The
special
case
of
this
is:
is
there
a
softness,
cross,
entropies,
l2
and
then
with
a
specific
algorithm
to
actually
find
the
mode
or
the
minima?
This
is
something
like
sgd,
so
in
the
figure
here,
there's
there's
you
know
this
there's
a
generic
distribution,
it's
probably
bimodal.
So
in
this
case
it
has
four
different
modes
and
what
sgd
will
do
is
it
will
try
to
find
one
of
these
modes
and
then
I'll
use
that
to
make
predictions
next
slide.
B
B
Next
slide,
so
the
most
natural
one
coming
from
just
typical
neural
nets,
is
to
think
about.
If
we
were
to
use
non-degenerate,
priors
and
non-degenerate
procedures,
so
ones
that
actually
have
problem.
We
asked
beyond
a
single
point,
and
the
two
ingredients
here
is
think
about
what
that
prior
is
p
of
theta.
B
There
are
many
different
choices
for
how
you
might
choose
at
ps
data,
it's
sort
of
shown
by
the
future
here
on
the
right.
So
there's
like
there
there's
a
lot
of
different
behaviors
that
you
might
want
to
consider
from
sparsity,
so
where
how
peaky
this
is,
the
distribution
is
at
zero.
The
tail
behavior
is
how
much
quality
mass
is
is
on
the
ends
of
the
spectrum,
so
around
three,
so
you're
higher
or
negative,
three
or
higher,
and
then
after
specifying
the
prior.
B
You
have
a
family
distribution,
you're
going
to
use
to
actually
approximate
that
true
posterior
and
we'll
go
into
how
you
select
that
and
after
you're,
actually
fitting
you're
you're,
finding
the
specific
family,
your
specific
distribution
that
fits
well
to
the
posterior
you'll.
You
might
have
something
like
the
bottom.
B
The
bottom
figure
here
where,
if
you
have
a
sufficient
amount
of
data
for
this
sort
of
interval
from
like
negative
2
to
1.5
or
something
you'll,
fit
the
data
pretty
well,
but
then
on
unobserved
inputs
and
you've,
never
seen
so
like
out
of
distribution
inputs.
Anything
beyond
that
sort
of
interval,
your
your
confidence,
sort
of
grows,
other
sub,
behavior,
linear,
exponentially
or
something
so.
The
uncertainty
grows.
The
predictive
standard
deviations
grows
as
you
try
to
extrapolate,
which
is
sort
of
the
desired
behavior.
You
want
for
uncertainty,
estimation.
B
So
the
first
approach
to
trying
to
do
inference
is
with
variable
inputs.
Visual
inference
is
a
fairly
simple
procedure.
It
just
takes
the
way
you
do
reference.
A
B
And
you
cast
it
as
certain
optimization
problem,
so
you
have
the
family
of
distributions
here.
This
is
a
parametric
family
and
a
common
common
choice
is
something
like
a
mean
field
or
a
fully
factorized
distribution.
So
here
there's
q
theta
it's
parameterized
by
a
set
of
parameters.
Lambda
and
here
we're
just
factorizing
it
just
that
there's
q
of
theta,
I
theta
is
each
weight
element
or
each
element
in
the
bias
terms,
and
you
typically
might
want
to
choose
directional
distribution
each
of
these
variational
distributions
to
be
something
like
the
prior.
B
The
the
loss
function
is
taking
expectation
of
your
log
likelihood
term
with
respect
to
your
true
posterior
or
your
approximate
posterior
cues
data
and
your
kl
term
and
algorithmically.
How
this
works
is
that
you
might
do
something
like
sampling
from
q,
to
monte
carlo
estimated
expectation
similar
to
how
you
might
monte
carlo
estimate,
the
posterior
predictive
when
you're
doing
test
time
predictions
and
then
you're
going
to
take
gradients.
B
You
just
need
backdrop
through
with
hdd
and
most
of
the
time
it
will
work
with
like
footnotes,
and
how
might
you
interpret
like
what
this
sort
of
loss
function
is
doing?
Well,
you
can
think
of
the
negative
of
this
loss
function
as
a
lower
bound
to
the
marginal
likelihood.
So
this
is
known
as
the
evidence
lower
bounds.
It's
sort
of
the
likelihood
of
interpretation
where
vi
was
actually
first
invented.
B
It's
it's
sort
of
like
a
very
it
comes
from
the
em
algorithm,
where
you're
taking
the
the
marginal
likelihood
which
you
want
to
maximize
so
you're
doing
something
like
mle
and
you
can
derive
it
bound
to
this.
Using
an
approximate
exterior
and
then
you're
you're
trying
to
fit
best
fit
the
approximate
fit
there.
It's
just
that
you
can
get
a
tighter
and
tighter
bound
on
the
true
the
two
likelihood
on
here.
B
So
this
is
about
this-
is
this
is
less
than
or
equal
for
all
parameters,
lambda,
and
if
the
true,
if
the
variational
posterior
was
exact,
so
it
was
equal
to
the
true
positive
then
this
is
this
is
a
this
is
an
equal
sign
and
not
just
less
than
like
a
strictly
less
than
equal
to
inequality.
B
There's
also
a
code
length
view
from
this
coming
from
the
minimum
description
length
perspective
or
the
coding
theory
perspective,
which
is
that
if
you
look
at
the
first
term
here,
what
you're
trying
to
do
is
minimize
the
number
of
bits
you're
using.
So
your
the
flexibility
you
have
is
in
choosing
lambda
here,
you're
you're
minimizing
number
of
bits,
you're
that
required
to
explain
the
data.
B
So,
every
time
you
evaluate
this,
this
log
likelihood
term
you're,
paying
a
certain
amount
of
bits
that
are
required
to
actually
reconstruct
the
outputs
given
certain
inputs,
but
as
a
trade-off,
you
don't
want
to
pay
too
many
bits,
because
the
second
term
here
is
the
trade-off
where,
if
you
move
lambda
too
much,
you
might
be
deviating
too
much
from
the
prior,
so
the
kl
penalty,
the
the
the
penalty
you're
paying
for
deviating
from
the
prior,
is
going
to
be
possibly
too
large.
B
So
those
are
the
two
trade-offs
that
you
want
to
make
from
the
perspective.
B
Okay
next
slide,
please
so
the
first
thing
on
in
terms
of
basically
known,
that's
how
you
reselect
the
prior
there's
a
lot
of
details
in
in
how
we
do
this
and
it
it
is
in
fact,
still
sort
of
an
open
challenge,
but
the
standard
one
you
might
want
to
do
is
just
a
normal
prior
with
zero
mean
and
unit
standard
deviation,
that's
sort
of
the
default
that
everyone
uses,
but
it's
not
necessarily
the
best
fryer
to
use
and
there's
many
reasons
behind
why
we
don't
really
want
to
use
standard
normal
priors.
B
In
practice
you
we
can
already
look
at
this
from
sort
of
like
this
fiscal
perspective,
of
like
what
the
model
looks
like
and
what
the,
if
you
generate,
draws
from
the
model.
What
is
it?
What
does
that
want
to
look
like,
and
it
goes
from
like
how
we
might
leverage
information
within
the
network
structure,
what
the
asymptotic
behavior
looks
like
and
whether
that's
sort
of
the
behavior
that
we
like
to
you
know
how
we
actually
select
the
prior.
B
So
if
we
have
a
specific
gaming
knowledge
like
if
we
want
to
encourage
exploration,
how
we
might
do
such
the
thing
and
you
can
go
into
actual
actually
trainability
properties,
so,
if
you're
to
actually
use
this
prior
with
sg,
what
what
sort
of
behavior
does
that
cause
sgd
to
like
what
sort
of
inductive
biases
does
that
have
on
sgd,
and
this
ranges
from
the
parameterization
so
like
centered
on
fires
are
not
invariant
to
grammarization.
B
So
if
you
just
change
the
way,
you
parameters
your
normal
net
that
can
lead
to
very
different
end
behaviors.
When
you
optimize,
you
get
your
end
solutions
and
in
fact
it's
also
just
too
strong
a
regularizer,
particularly
during
the
early
stages
of
training.
So
if
you,
if
you,
if
you've
ever
tried,
training
basic
neurological
practice,
what
often
happens
is
that
if
you
look
at
the
gradient
of
the
gradient
signal's
magnitude
for
the
kl
penalty,
that
is
often
much
higher
than
what's
required
to
fit
the
data,
so
the
expected
log
likelihood
term.
B
So
also
what
will
happen
is
that
you'll,
just
collapse
and
you'll
have
the
majority
of
your
approximate
stereo
distribution
just
be
added,
be
equal
to
the
prior,
so
you
won't
really
be
fitting
the
data
at
all
and
there's
a
lot
of
recent
work
that
I've
been
improving.
This
improving
how
we
select
the
prior
coming
from
thinking
about
priors
in
the
function,
space,
thinking
about
exactly
how
inputs
and
outputs
might
operate
with
a
distribution
over
this
non-parametric
space.
B
This
is
sort
of
a
you
know,
a
question
that
has
been
long-standing.
It's
been
a
question
since
we
even
fit
started
to
fit
public
models
thrown
at
back
in
the
late
80s,
and
as
I
was
mentioning
the
comment,
the
most
common
one
coming
from
the
late
80s
was
in
fact
this
mean
field
approach,
fully
factorized,
where
you
might
have
a
gaussian
per
element
of
your
weight
matrix
and
vice
versa.
B
But
there
there's
many
different
choices
like
you
can
do
mixtures
of
of
mean
field
distributions.
You
can
do
structure
characterizations.
There
also
hierarchical
versions
of
these
things,
so
there's
a
lot
of
different
strategies
in
literature
and
many
individual
publications
are
all
about
choosing
like
what
is
a
better
approximate
procedure
that
works
well
with
vi
next
slide,
please
so
as
an
alternative
to
approximate
inference
with
vi.
B
You
can
also
do
something
with
money:
markov
chained,
money
harlow,
where,
instead
of
having
a
parametric
family
distributions
queue
that
you're,
optimizing
and
and
finding
the
set
of
you
know
the
set
of
parameters
in
that
family
that
best
represents
exterior.
You
can
just
do
a
non-parametric
thing,
where
you're
sort
of
just
doing
more
of
a
random
walk
behind
the
you're
sort
of
exploring
within
the
true
posterior
and
you're
collecting
samples,
as
you
explore
in
that
space.
B
So
here,
if
you're,
just
looking
at
what
the
test
time
behavior
it
is
with
when
you're,
making
predictions
you're
you're
just
using
a
bunch
of
different
samples,
so
you're
using
s
different
samples
from
theta
and
mcmc
is
sort
of
a
way
of
how
you
might
draw
those
samples
and
the
way
it
works
is
by
taking
what's
called
serial
energy,
but
in
in
the
previous
option,
sort
of
in
the
previous
slide.
B
This
is
sort
of
just
the
joint
distribution,
the
negative
joint
distribution,
a
joint
density,
where
your
first
term
here
is
a
likelihood.
So
it
sums
over
each
data
point,
and
you
have
your
prior
and
mcc-
is
just
many
different
strategies
for
how
you
might
carefully
walk
leveraging
this
energy
function
to
better
explore
the
full
posterior
and
give
you
those
samples
next
slide.
Please
and
mcmc
for
neural
nets
is
also
a
very
plastic
thing.
B
I
think
it
it
sort
of
became
the
standard
for
how
you
might
do
bayesian
neural
nets
and
probably
modeling
with
neural
nets
in
the
90s.
It
was
in
fact
sort
of
winning
on
a
lot
of
high
profile
competitions
back
in
the
day
and
the
a
lot
of
the
ideas
come
classically
from
statistical
physics.
B
They
leverage
hamiltonian
dynamics
or
launch
in
dynamics
to
give
you
ways
to
take
a
specific
sample
and
move
along
move
in
the
space,
while
preserving
certain
behaviors
that
you
care
about
at
mcmc
and
there's,
and
I'm
leveraging
a
lot
of
these
ideas.
B
It
used
to
be
the
case
that
mcmc
for
deep
learning
really
didn't
work,
because
mcmc
wasn't
as
amenable
as
vi
with
stochastic
optimization
but
it,
but
in
recent,
but
in
recent
literature
within
the
past
year,
or
so,
there
have
been
a
lot
of
interesting
works
that
have
gotten
pretty
good
results
with
mcmc.
B
But
as
a
caveat
here,
smc
is
not
sort
of
the
perfect
solution.
There
are
many
tricks
that
are
required
to
get
it
to
work
and
there's
a
lot
of
there's
a
lot.
There's
a
lot
of
impracticality
with
how
you
leverage
these
things,
because
the
procedure
tends
to
be
fairly
expensive
compared
to
sgd.
B
B
And
the
next
slide
will
be
about
sort
of
be
like
about
simpler
bass.
Lines
is
sort
of
like
a
step
back
from
the
sort
of
higher,
very
complex
strategy
of
how
you
pick
different
things
in
the
recipe,
but
before
the
before
I
hand
that
out
to
jester,
maybe
I
can
sort
of
just
take
questions
and
start
answering
those.
D
Dustin
one
of
the
questions
was:
can
you
please
provide
links,
slash
pointers
for
learning
the
topic,
and
that
came
up
when
you
were
talking
about
priors
and
posteriors
yeah.
B
That's
that's
a
great
question.
I
don't
think
there
is
a
good
canonical
link
or
a
canonical
paper
for
selecting
priors.
I
think
the
neo
1994
paper
and
when
we
shared
the
pdf
for
this,
you
can
get
the
actual
paper
from
it.
I
think,
if
you
search
like
rad
for
neil
priors
for
infinite
neural
nets
or
something
like
that,
it
goes
into
a
bit
more
depth
of
sort
of
the
asymptotic
perspective
of
if
you
were
to
take
a
note
on
that
and
go
the
infinite
with.
B
I
think
you
can
also
go
into
just
the
sort
of
first
paper
with
neural
nets
in
backdrop
and
vi
in
literature,
so
that
would
be
a
blundell
2015..
It's
sort
of
an
icml
viewer.
That's
pretty
good
at
describing
some
of
this
and
there's
a
there's,
a
slew
of
a
lot
of
recent
literature.
So,
in
those
cases
once
the
pdf
is
available,
you
can
never
check
out
a
lot
of
the
more
recent
papers
that
study
priors.
D
Also,
some
early
work
by
david
mackay
is
is
really
great
kind
of
setting
the
the
standard
for
thinking
about
these
things
yeah.
I
agree.
B
Jasper
would
it
make
sense
to
use
the
prior
which
resembles
the
distribution
that
we're
trying
to
model,
and
in
that
case,
how
can
we
get
the
underlying
situation?
Aha,.
D
That's
a
really
great
question:
that's
like
it's
something
I
think
about
quite
a
bit
which
is
like.
Yes,
if
you
know
what
the
structure
of
the
problem
is,
so
maybe
like
the
form
of
the
function
you're
trying
to
regress,
and
ideally
you
would
specify
a
prior
that
takes
advantage
of
that
with
deep
neural
nets.
That's
harder,
particularly
with
structured
inputs
right.
We.
D
We
do
certainly
capture
something
like
that
with
data
augmentation,
for
example,
where
we're
saying
effectively
we're
imposing
a
prior
by
saying
our
model
should
output
effectively
the
same
thing
for
slight
rotations
of
an
image
or
something,
but
in
neural
nets.
It's
really
hard
to
specify
a
prior
on
the
form
of
the
function,
it's
kind
of
implicit
in
the
architecture
and
the
initialization
and
lots
of
stuff
around
it.
I'm
going
to
talk
about
gaussian
processes
in
a
second
and
there.
D
B
I
think
the
the
best
way
to
describe
this
is
that
without
going
too
much
into
sort
of
the
equations
of
it,
you're
leveraging
chemical
dynamics
to
sort
of
preserve
some
properties
of
the
specific
sample
that
you're
you're
you're
using
there's,
there's
particular
ways
of
of
leveraging
something
known
as
sort
of
like
the
the
leapfrog
integrator
to
actually
propose
the
next
step
that
you're
doing
there's
a
lot
of
sort
of
complications.
B
How
like
leveraging
sdds
to
sample
to
get
new
points
works
with
this
sort
of
from
by
transitioning.
From
that
point,
there's
a
lot
of
like,
for
example,
there's
like
discretization
behavior.
That's
that
there's
sort
of
discretization
that
you
have
to
do
and
that
sort
of
causes
inaccuracies.
B
So
the
ultimate
procedure
that
you
might
do
with
leveraging
these
sort
of
different
equations
is
to
use
it
within
a
metropolis
hastings
procedure
where
you're
actually
proposing
and
then
you're
you're
checking,
if
that,
if
that
falls
under
a
particular
ratio
and
you're
accepting
you're,
rejecting
that
sample
and
you're
coming
and
so
on
and
so
forth,
I
think
there
could
be
an
entire
lecture
on
just
mcmc.
B
Next
question
is:
how
would
you
select?
How
do
you
pick
your
prior
if
your
data
has
multiple
different
distributions
in
it?
I
guess
it
depends
on
what
you
mean
by
multiple
different
distributions.
If
you
have
a
data
set,
there
is
sort
of
like
one
distribution
which
it
has,
but
it
might
have
something
like
multimodal
behavior
or
you're.
You
might
be
thinking
about
the
extrapolation
behavior,
where
the
training
set
that
you're
using
is
very
different
from
your
test
set.
B
B
If,
if
you,
if
you
have
a
if
you
have
a
better
sense
of
like
if
the
distribution
is
multimodal
or
if
my,
if
my
distribution,
that
I'm
trying
to
make
predictions
on
is
like
actually
within
the
distribution,
but
it
concentrates
more
on
a
certain
area.
B
Those
things
are
better
encapsulated
through
functional
priors,
and
you
can
think
of
these,
as
I
think
I
think
the
best
answer
in
terms
of
how
you
might
think
of
functional
priors
is
something
that
jasper
also
talked
about
when
he
goes
into
guessing
processes.
B
Thoughts
on
uq
methods
based
on
gradient
perturbation,
I'm
not
sure
what
particular
method
you're
talking
about.
There
are
adversarial
methods
that
use
gradient
perturbations
to
be
robust
to
certain
examples.
Those
and
those
are
evaluated
on
atmosphere,
examples
which
are
a
form
of
od.
B
They
don't
work
as
well
on
inexplication,
so
they
might
solve
this
sort
of
worst
case
behavior
when
you're
looking
at
sort
of
an
epsilon
ball
around
your
certain
inputs
that
you're
testing,
but
if
you
evaluate
on,
if
you
look
at
like
how
they
perform
on
standard
leaderboards
on
a
corrupted
version
of
c4,
imagenet
or
or
a
natural,
more
natural
version
where
you
you
just
run,
you
take
the
internet
data
gathering
process.
You
just
run
that
process.
Again
you
get
a
new
data
set.
B
Those
behaviors
don't
work
super
well,
but
I
think
they're.
They
are
super
exciting
in
the
case
where
we
do
care
about
the
worst
case,
behavior.
D
Stochastic
gradient,
london
and
dynamics
or
casting
creating
mcmc
can
be
thought
of
as
gradient
perturbation
where
you
just
do
stochastic
gradient,
but
you
add
a
little
bit
of
noise
to
the
gradients
at
every
step,
and
that
has
the
effect
to
kind
of
perform.
This
random
walk
trajectory
so
that
that
is
one
way
to
to
get
a
a
sample
from
the
posterior
through
breaking
for
the
patient.
D
D
C
D
D
So
we'll
start
out
with
recalibration,
and
so
a
really
simple
idea
for
getting
better
uncertainty,
you
might
imagine,
is
just
like
explicitly
recalibrate
the
model
and
one
way
you
can
do.
This
is
something
called
temperature
scaling,
which
is
basically
you
take
a
held
out,
validation,
set
or
an
out
of
distribution
set,
and
you
rescale
the
temperature
of
the
output
distribution.
So
you
basically
like
either
smooth
out
the
output
probabilities
or
sharpen
them
to
to
match
the
that
validation
set.
More
that's
something
called
temperature
scaling
and
you
can
see
it's
just
optimizing.
D
A
D
D
Okay,
then
another
strategy-
that's
been
very
popular,
is
yarengal
and
zubin
garamani
had
this
paper,
where
they
proposed
effectively
kind
of
approximating
an
ensemble
through
through
dropout.
So
if
you're
familiar
with
dropout
right,
it's
a
regularization
technique
where
you
stochastically
drop
out
hidden
units
during
training,
and
the
innovation
here
is
to
keep
this
stochastic
these
stochastic
units
at
test
time
and
average
over
the
predictions.
D
So
you
basically
drop
out
units
when
you're
predicting
and
then
average
over
the
predictions,
and
that
gives
you
at
least
better
uncertainty
than
than
not
doing
anything
and
is
also
a
pretty
competitive
baseline
next
slide.
D
Okay,
so
the
the
next
thing
is
deep,
ensembles,
and
so
here
the
idea
is
basically
just
rerun
standard,
stochastic,
gradient
training
or
whatever
optimization
method
you
you
like,
but
with
different
different
random
seeds
and
then
average
the
predictions
of
the
of
the
different
models
you
end
up
with.
D
D
These
are
diverse
and
we
get
a
nice
diverse
set
of
predictions
balaji
along
with
his
co-authors,
tried
this
out,
so
they
tried
a
whole
bunch
of
things
and
were
surprised
at
how
effective
ensembles
were
and
have
a
fantastic
paper
kind
of
talking
about
why
this
is
the
case
in
that
simple
and
scalable
predictive
uncertainty.
Estimation
paper
next
slide.
D
So
belgium
showed
you
this
picture
above
earlier
talking
about
how
accuracy
degrades,
but
then,
if
you
look
at
calibration,
so
these
are
our
corruptions
on
imagenet
looking
at
calibration.
It
also
gets
worse
and
that's
from
a
benchmark
paper.
We
ran
a
year
or
so
ago
and
and
there,
the
kind
of
the
shining
thing
that
that
did
really
well-
or
maybe
I
should
say,
didn't
do
quite
as
badly
in
terms
of
calibration
was
ensembles.
D
D
So
why
do
these
work
well
in
practice?
Oh
bellagio.
C
D
I
would
ask
you
to
imagine
a
beautiful
animation
where
it
says
space
of
solutions,
but
maybe
I'll
try
to
describe
what
it
is.
So
you
might
ask
why
do
deep
ensembles
seem
to
do
better
than
than
these
very
rigorous
things
like
variational,
inference
or
or
much
markup
chain,
monte,
carlo,
that
dustin
talked
about,
and
the
one
of
the
reasons
is
that
these
approximate
bayesian
methods
in
general
tend
to
kind
of
start
in
a
random
place
and
then
find
a
mode
and
then
locally
explore
that
mode.
D
D
A
D
D
You
know
for
our
purposes
it
where
we
like
want
to
serve
a
giant
model
at
very
high
throughput.
Then
you
might
imagine
that's
undesirable
right.
You
need
to
do
a
forward
pass
through
each
model
and
typically
we
found
that
we
try
to
get
the
biggest
neural
nets
we
possibly
can
fit
into
into
memory,
and
that
makes
it
impossible
to
carry
around
multiple
copies.
D
So
oh
yeah
and
bayesian
neural
nets
also
seem
to
be
very
promising,
but
mcmc.
For
example,
you
need
to
carry
around
many
samples
as
well,
so
you
also
have
to
carry
around
all
these
different
models
and
so
balaji
and
dustin,
and
I
have
spent
a
considerable
amount
of
effort
thinking
about
how
do
we
kind
of
approximate
these?
D
These
approaches
in
in
cheaper
ways,
along
with
other
fantastic
researchers
in
the
community,
such
as
aaron,
andrew
gordon-wilson
and
iron
gal,
and
more
next
slide,
please,
okay,
so
one
of
the
things
that
that
actually
dustin
and
collaborators
ethan
when
ed
al
came
up
with
was
what,
if
we
take
a
standard
neural
network
and
we
say
that
each
layer
has
a
rank,
one
factor
that
gets
multiple
multiplied
by
another
rank,
one
factor
that
then
forms
the
size
of
the
weight
matrix
and
then
multiplicatively
multiplied
by
the
weight
matrix.
D
And
so
you
can
imagine
that
this
is
kind
of
modulating
the
weights
of
the
ons
of
the
single
neural
net,
such
that.
If
you
have
multiple
of
these
rank
one
factors,
then
you
can
have
multiple
effectively
different
paths
through
the
network
from
the
bottom
to
the
top
and
then
averaging
over
a
bunch
of
these
random
vectors
multiplied
by
the
network
kind
of
gives
you
an
implicit
ensemble
that
you
can
then
compute
in
a
really
efficient
way,
just
using
batching
next
slide.
D
So
we
took
this
this
idea
further
and
said:
well,
can
we
come
up
with
a
bayesian
interpretation
of
this
and
and
actually
placed
a
a
variational
posterior
on
these
rank
one
factors,
so
we
place
a
distribution
on
the
rank,
one
factors,
and
then
we
optimize
them
via
the
elbow
that
dustin
told
you
about,
and
there
then
the
product
is
you
have
this
model
with
a
distribution
over
rank,
one
factors
and
you
can
sample
rank
one
factors
that
then
modulate
each
layer
of
the
network
and
kind
of
modulate
the
path
through
which
the
data
goes
from
the
bottom
to
the
top
and
averaging
over
a
bunch
of
these
samples,
implicit
ensemble.
D
One
way
to
get
even
closer
to
ensembles
is
to
say
you
have
a
mixture
distribution
over
these
rank.
One
factors,
so
you
sample
rank
one
factor
from
a
mixture
of
rank
one
factors,
and
you
could
imagine
each
mixture
kind
of
corresponding
to
an
ensemble
member,
and
so
we
found
empirically
that
this
performs
really
well
and
actually
gives
better
calibration
than
even
a
standard
ensemble
on
a
bunch
of
problems,
while
incurring
only
a
slight
addition
like
a
tiny
addition
in
in
terms
of
extra
parameters,
next
slide.
D
So
there
there
is
a
particular
instance
in
which
we
can
actually
compute
the
marginal
likelihood.
So
this
integral
that
dustin
told
you
about
as
well,
analytically
and
not
have
to
approximate
it,
and
that
arises
if
we
assume
that
there's
a
gaussian
distribution
on
the
likelihood
a
gaussian
distribution
on
the
prior,
then
multiplying
two
gaussians
gives
a
gaussian
and
an
interval
over
a
gaussian
gives
another
gaussian.
So
we
can
do
everything
in
closed
form
and
compute.
D
D
So
what
you
end
up
with
is
a
flexible
distribution
over
functions,
this
giant
gaussian,
it's
specified
now
by
a
covariance
function
over
examples.
So
if
you're
familiar
with
the
kernel
trick,
then
that's
exactly
what's
happening
here.
The
kernel
becomes
the
covariance
matrix
of
this
big
gaussian
effectively
and
the
parameters
disappear
entirely
because
we've
integrated
them
out
and
then,
if
we
condition
on
data,
we
get
a
nice
posterior
on
functions.
So
on
the
right
there,
you
can
see
samples
from
a
gaussian
process
prior
so
sample
functions
and
then
condition
on
data.
D
D
That
was
interesting.
Google
assistant
just
answered
a
question
that
I
didn't
ask
for
some
reason:
okay,
so
they
are
specified
by
a
mean
function
and
a
covariance.
The
covariance
is
that
that
kernel,
that
I
told
you
about
and
we
can
compute
effectively
everything
that
we
want,
analytically,
so
the
predictive
mean
and
covariance
given
observations
is
this
equation
here
this
mu,
which
involves
unfortunately
inverting
this
kernel,
this
covariance
matrix
and
then
the
variance
of
predictions
is
below,
which
also
involves
the
covariance
between
test
examples
and
the
training
set
and
the
confidence
between
all
training
examples.
D
So
you
might
look
at
this
and
say:
oh,
I
have
to
compute
a
coherence
matrix
over
my
training
data,
which
is
right,
which
is
n
n
squared
and
size,
and
then
I
have
to
invert
it,
which
is
cubic
in
the
number
of
operations.
D
So
gps
are
typically
only
used
in
very,
very
low
data
regimes,
they're
kind
of
seen
as
the
the
state
of
the
art
or
the
gold
standard
and
getting
good
uncertainty
estimates
especially
for
regression,
but
because
of
the
scaling
issue,
they're
they're
kind
of
limited
to
smaller
problems.
D
Intuitively
there
are
prior
for
smooth
functions,
similar
outputs
should
have
similar
inputs
should
have
similar
outputs
and,
and
we
can
compute
all
the
quantities
that
we
want
easily.
Analytically
next
slide.
D
Okay,
so
you
might
wonder
why
are
you
telling
me
about
calcium
processes?
This
is
about.
This
is
a
deep
neural
net
lecture
and
the
reason
is
that
in
the
limit
of
infinite
width-
and
you
assume
the
gaussian
prior
then
integrating
over
your
parameters,
which
gives
you
good
uncertainty
and
is
the
kind
of
the
bayesian
thing
to
do
converges
to
a
gaussian
process.
D
So
this
is
a
seminal
result
that
again
came
from
bradford
neal's
phd
thesis,
which
is
kind
of
this
amazing
tome
of
literature
that
he
produced
the
you.
Can
you
can
kind
of
think
of
it
as
saying
that
the
covariance
matrix
the
kernel
is
basically
just
a
covariance
taken
over
the
hidden
layer
activations,
so
you
just
take
the
inner
product,
basically
of
the
hidden
layer,
activations,
and
that
gives
you
the
distance
or
the
similarity
between
two
examples.
D
D
Chris
williams
came
up
with
with
an
actual
covariance
function
for
a
particular
kind
of
neural
net
in
in
97
and
then
very
recently,
there's
been
a
lot
of
renewed
interest
from
the
deep
neural
net
literature,
and
so
there's
a
couple
of
citations
here,
but
there's
been
a
bunch
of
fantastic
work,
establishing
the
connection
between
gps
and
deep
neural
networks
at
the
infinite
limit
and
coming
up
with
the
gp
equivalent
of
interesting
architectures
like
convolutional
networks
and
so
on.
D
D
D
C
C
We
have
six
minutes
left,
but
I
it's
fine.
If
we
go
a
little
bit
over
time,
I
think
it's.
D
D
Okay,
so
this
is
essentially
one
way
to
get
good
calibration
is
to
come
up
with
kind
of
implicit,
priors
or
inductive
biases
that
that
you
expect
would
be
kind
of
out
of
distribution
data
that
you
would
see
and
so
data
augmentation
is
a
good
way
to
do
that
and
that's
something
we've
we've
been
exploring
and
tends
to
help
calibration
significantly
next
slide.
D
Another
thing
we're
really
interested
in
is
trying
to
come
up
with
a
drop-in
replacement
for
standard
neural
networks.
So,
instead
of
having
to
like
follow
this
complex
machinery
or
carry
around
an
expensive
model,
wouldn't
it
be
nice
if
you
could
just
take
your
existing
model,
augment
it
in
some
way
and
then
have
good
uncertainty
and
one
way
when
doing
that
is
to
slice
off
the
top
and
stick
a
gaussian
process
on
and
there's
there's
certainly
some
some
kind
of
complexities
to
how
to
make
that
work.
D
Well,
but
that's
something
we're
really
excited
about
and
and
seems
to
work
quite
well
in
practice
next
slide
and
then
ensembles.
You
could
imagine
you
could
go
a
lot
further
than
just
doing
random
initializations.
D
And
the
answer
is
yes,
the
there
is
we
are
in
within
our
team
at
google.
We
are
open
sourcing
as
much
as
we
possibly
can
so
there's
some
great
code
to
specify
models
and
run
them
in
edward
too.
D
Then,
a
a
code
base
that
we
just
open
sourced,
called
uncertainty
baselines
which
contains
a
lot
of
of
pre-made
models
effectively
to
to
run
across
including
a
bunch
of
benchmarks,
so
that
imagenet
c
that
we
talked
about
and
a
whole
bunch
of
others.
D
If
you
want
to
like
try
a
new
model
and
run
it
across
a
bunch
of
uncertainty
benchmarks,
then
then
you
can
do
that
using
that
code
base
and
then
uncertainty
metrics,
another
code,
library
that
we
just
open
sourced,
which
contains
kind
of
canonical
implementations
of
things
like
briarscore
and
ece,
so
that
everyone
can
kind
of
share
the
same
implementation
of
metrics
next
slide
all
right
and
then
we'll
we'll
finish
off
with
with
some
open
challenges,
things
that
that
we're
thinking
about
so
one
is
thinking
about.
D
And
so
florian
wenzel
sebastian
was
in
and
a
bunch
of
others,
and
I
put
this
paper
up
on
online
exploring
why
that's
the
case?
So
why
does
the
bayesian
approach
seem
to
not
always
outperform
the
non-vision
approach
and
there's
some?
I
think
some
really
interesting
kind
of
technical
challenges.
We
need
to
get
past
to
answer
that
what
are
good
priors
so
dustin
talked
about
that.
What's
the
role
of
the
choice
of
architecture,
hyper
parameters,
heuristics
like
batch
storm?
Are
they
bayesian?
Are
they
not
basin?
D
Do
they
give
better
uncertainty
or
not?
How
do
we
efficiently
marginalize
over
high
dimensional
neural
net
posteriors,
so
better
approximations,
certainly
is
a
strong
research
area
right
now,
getting
a
better
understanding
of
od,
behavior
and
and
formulating
kind
of
a
more
rigorous
bayesian
interpretation
of
deep
ensembles,
so
bellagi
actually
has
a
paper.
D
That
is,
I
think,
on
archive
now
that
that
really
tries
to
pin
down
this
question
and
then
we
need
better
benchmarks.
So
we
need
realistic
benchmarks
that
reflect
real
world
challenges,
which
maybe
some
of
you
have
immediately
have
ideas
where
you
know.
If
you
get
better
uncertainty
on
on
your
problem,
which
is
like
a
real
scientific
problem,
then
it's
it
would
be
really
meaningful
and
it
would
be
really
useful.
I
think,
for
the
for
the
community
to
develop
across
those
those
benchmarks,
all
right
next
slide.
D
A
A
We
have
a
lot
of
references
in
the
intermediate
slides
when
we
present
the
stuff
as
well.
We'll
add
them
back
here
as
well,
so
that
you
can
find
them
all
in
one
place.
D
B
D
Yes,
they
they
effectively
do
so
the
it's.
If
you're
familiar
with
the
kernel
trick,
then
then
you
probably
wouldn't
be
asking
this
question.
So
maybe
that's
not
the
right
way
to
answer
it,
but
effectively
they
have
infinite
number
of
parameters
and
the
way
that
it
works
is
you
can
actually
compute
the
integral.
D
So
what
you
want
is
the
covariance
between
examples
over
the
the
last
layer
of
and
that's
for
example,
and
so
what
you
do
is
you
say
I
have
theta
of
x
times,
theta
of
x,
prime,
the
inner
product
of
those
two,
and
that
will
give
me
the
covariance
between
these
two
examples
at
the
end
of
the
neural
net.
D
Then,
if
I
compute
an
integral
over
that
from
negative
infinity
to
infinity
effectively
marginalizing
over
all
possible
weights,
then
I
can
actually
compute
that
integral
the
integral
over
that
inner
product,
analytically,
which
is
exactly
the
construction
that
happens
for
most
of
the
the
kernels
in
svm.
A
D
It's
it's
clearly
like
it's.
It's
a
pretty
nuanced
thing
and
super
elegant
once
once
it
kind
of
like
clicks,
but
it
might
take
more
than
and
then
a
few
minutes
to
explain
in
in
all
its
glory.
B
B
So
I
think
the
best
way
to
sort
of
choose
could
be
sort
of
if
you
just
look
at.
Actually,
I
think
this
is
this
is
done
pretty
well
in
terms
of
the
leaderboards
that
you
have
in
their
open
source
code.
Ultimately,
what
matters
isn't
really
the
like
algorithmic
approach,
but
things
like
compute,
so
how
much
compete
you
have
what
sort
of
assumptions
that
you're
making
with
the
model?
What
what
can
you
like
it?
Can
you
better
impose
in
your
model?
B
So,
given
those
things
you
can
sort
of
just
choose
the
the
top
one,
but
of
course,
if
you're
doing,
if
you're
doing
research
like
methodological
research,
you
can
always
just
sort
of
choose
your
favorite
one
see
if
you
can
advance
it
a
little
bit
more
best
solutions
for
you.
Jasper.
Do
gps
work
well
in
modeling
temporal
data
that
is
irregularly,
sampled,.
D
Good
question,
so
I
would,
I
would
say
yes,
but
with
a
caveat,
so
gaussian
processes.
I
think
a
previous
question
kind
of
alluded
to
this
right
so
like
how
do
you
specify
a
prior
over
the
over
like
the
function
that
you
care
about
and
gaussian
processes?
Give
you
a
really
nice
toolkit
to
do
that
effectively.
D
You
can
basically
say
like
here,
is
kind
of
the
model
of
of
the
space
that
I
want
to
of
the
kind
of
functions
that
I
might
imagine
seeing
and
in
gps.
We
do
that
by
specifying
a
kernel
function.
That's
like
maybe
I
think
it's
twice
differentiable
and
really
smooth,
or
maybe
I
think
it's
periodic
and
there's
a
coherence
function
for
that.
D
You
might
also
specify
like
okay.
I
think
it's
a
step
function
and
you
can
compute
a
kernel
which
is
like
the
inner
product
of
infinite
step
functions,
and
so
you
can
kind
of
really
carefully
specify
what
your
model
is
using
a
gp,
and
so
it
may
be
like.
I
think
it's
a
dynamical
system
with
that
is
irregular,
not
regularly
sampled
and-
and
you
can
carefully
specify
that
model.
D
So
I
would
say
like
if
I
was
modeling
that
data
I
would
use
gp
and
mount
a
neural
net,
probably
unless
it's
a
a
large
data
set,
but
I
would
very
carefully
specify
the
model
and
think
about
it
carefully
before
trying
to
apply
like
a
standard
gp
with
a
standard
kernel.
B
A
So
that's
a
great
question,
so
there
has
been
some
work
on
like
using
like
bayesian
influence
for
generative
models
as
well
as
like
on
some
generative
models.
I
I
think
the
methodology
can
definitely
do
like
integrating
our
parameters
is
a
quite
general
principle
and
you
can't
do
that.
A
There's
a
parallel
literature
which
we
didn't
actually
there's
a
lot
of
things
that
we
didn't
go
into
this
talk
and
one
of
the
things
is
the
role
of
like
generative
models
and
order
of
distribution
and
detecting
od
inputs
and
so
on.
So
I,
the
short
answer
is
the
initial
experiments.
We
found
some
surprising
results
in
making
these
ideas
work
and
but
now
we've
been
making
some
progress
and
I'll
add
some
references.
A
I
I
shared
a
slide
that
the
zoom
q
a
summarizing
some
of
our
works
earlier
this
year,
but
we
also
have
some
recent
work
that
also
has
been
trying
to
get
to
the
bottom
of
this
phenomenon.
So
maybe
that
could
be
useful
for
folks
and
if
we
didn't
answer
your
question,
please
ask
it
on
slack
channel
or
feel
free
to
email
us
as.
C
C
Oh,
there
is
yeah,
maybe
one
last
question
that
I
think
actually
is
interesting
in
the
infinite
limit
of
training,
the
model
uq
would
be
reduced.
If
would
the
model
uq
be
reduced?.
D
That's
a
a
great
and-
and
I
think
very
loaded
question,
so
you
know
I
guess
the
in
the
theory
on
on
stochastic
gradient
descent.
It.
D
You
have
kind
of
an
infinitely
small
learning
rate
or
an
infinite
decimal
learning
rate
and
you
run
forever.
You
will
converge
to
an
optimum
and
I
think
the
answer
is
yes.
At
that
point,
you
will
will
probably
have
better
a
worse
uncertainty.
D
There's
you
know
there
there's
a
bunch
of
work,
studying
things
like
early
stopping.
So
if
you
hold
out
a
validation,
set
and
you're
training
and
you
watch
the
training
curve
get
better,
but
you
watch
the
validation.
B
D
D
Beyond
that,
I
guess
certainly
all
of
the
mcmc
literature
would
suggest
that
you
should
maybe
keep
training
forever,
but
add
noise
to
your
to
your
model.
So
it's
following
a
markov
chain
through
the
the
loss,
manifold
or
the
posterior
effectively.
C
Okay,
I
think
we
had
so
many
questions
and
yeah.
This
was
a
great
lecture.
Thank
you
again,
bobaji,
dustin
and
jasper.
I
certainly
there's
also
a
lot
of
material
to
check
after
the
lecture,
and
maybe
we
look
at
the
code
and
also
look
at
some
of
these
seminal
papers.
So
thank
you
again
for
for
putting
all
this
effort
preparing
this
great
material
and
great
lecture
and.
C
Yeah,
thank
you
and
thanks
to
everyone
for
joining
today's
lecture,
and
if
you
have
more
questions,
please
feel
free
to
ask
them
on
the
slack
channel
relevant.
A
C
This
lecture-
and
I
think
blaji
is
already
there.
So
if
you
have
questions,
I
think
at
least
bellagio
will
be
able
to
answer,
and
maybe
jasper
and
dustin
can
chime
in
at
some
point.
Hopefully,
the
the
slides
are
up
on
the
website
and
I'll
update
the
slides
once
village
has
the
links
for
more
material.