►
From YouTube: ICLR 2020 Conference Recap - May 20 2020
Description
Lucas Souza does a “trip report” on the ICLR 2020 conference, which was held remotely. He focuses on papers related to neuroscience, deep learning theory, pruning and sparsity.
B
Okay,
so
this
is
a
little
trip
report
for
iclear
2020.
B
Okay,
so
this
is
a
little
trip
report
for
iclear
and
I'm
gonna
talk
a
little
bit
about
the
conference
first.
So
these
are.
Some
of
the
pictures
I
took
in
the
conference
was
in
ethiopia,
I'm
just
joking,
but
that's
that's
actually
how
it
was,
and
so
it
was
fully
online
but
supposed
to
be
near
ethiopia,
which
is
a
really
nice
move
from
the
organizers.
B
Due
to
the
recent
visa
issues,
we
had
so
a
lot
of
researchers
from
africa.
They
were,
they
couldn't
get
into
iclear
or
new
rips
because
of
visa,
so
they
decided
to
move
the
conference
there,
which
is
nice,
but
it
was
online
anyway,
but
we
had
89
countries.
You
can
see
in
this
small
picture
over
here,
89
countries,
1400
speakers,
a
lot
of
chat
methods
and
video
watches.
So
I
I
didn't
watch
the
conference
live
so
that
week
we
were
quite
busy.
B
B
My
general
feedback
is
that
it's
not
the
same
as
going
to
a
conference,
mainly
because
in
the
when
you
are
in
the
conference
I
mean
it's,
it's
very
tiring.
You
you
get
there
eight
o'clock
in
the
morning.
We
just
watch
presentation
after
after
presentation
all
day
long
and
then
at
night
you
go
to
posters,
but
it's
just
you
were
there.
You
were
there
in
that
environment
and
you
just
don't
want
to
stop.
B
You
know
like
seeing
things
and
they're
just
motivated
to
do
that,
so
I
could
do
that
for
an
entire
month,
but
when
you're
doing
this
at
home,
it's
it's
very
much
different.
So
after
two
three
hours
just
gets
very
boring,
very
tiring.
So
it's
it's
definitely
not
the
same
experiences.
So
when
you're
watching
live
like
it
did
with
neuromatch,
it
was
a
little
bit
better.
So
if
you're
going
for
a
conference
online,
I'd
recommend
you
watching
it
live,
but
when
you're
not
watching
live
and
then
you
can
stop
and
then
you
can
do
whatever
it's.
B
I
wasn't
very
motivated
to
do
so.
I
had
more
stuff
going
on
and
I
I'd
rather
code
and
watch
the
vectors.
B
Too
okay,
so
they
did
this
little
map,
which
I
think
it's
very
nice
like
this
grouping
by
topics
like
clustering
by
topics
and
they
had
search
capabilities.
You
could
search
by
keyword
by
author
every
paper
had
this
small
video.
My
feedback
here
is
that
the
videos
were
usually
very
small,
so
even
the
the
main
talks
there
were
only
like
six
seven
minutes.
B
There
was
not
a
lot
of
information
in
it
and
I
was
getting
a
lot
more
information
from
the
review
from
the
discussions
in
the
review
than
I
was
getting
from
the
presentations
and
yeah,
and
the
reviews
are
also
very
interesting
to
see.
I
mean
they
can
be
very
brutal
at
some
point,
so
it
was
yeah.
It
was
an
interesting
experience,
so
I
I
I
have
three
topics
when
it's
neuroscience:
it's
not
a
lot
of
neuroscience,
just
whatever
it
was
there
and
then
deep
learning
theory
and
pruning
and
sparsity.
B
B
So,
on
neuroscience
there
was
this
one
day
workshop
on
bridging
ai
and
cognitive
science,
motorcycle
neuroscience
and
there
are
also
a
few
papers
in
the
overall
conference,
but
it
was
not
a
lot.
It
was
a
lot
less
than
an
europe,
so
you
could
see
narrative
especially
last
year,
had
a
bigger
focus
on
neuroscience.
B
So
in
this
workshop
the
main
topics
were
being
addressed,
was
concept,
learning
causal
reasoning,
language
acquisition
and
learning
from
field
data
which
are
general
topics
from
cognitive
science,
and
they
had
these
open
questions,
which
were
the
questions
that
the
papers
were
supposed
to
answer
which
inductive
bias
do
we
humans
or
animals,
use
to
support
rapid
learning?
How
can
we
share
concepts
across
multiple
domains?
B
How
can
we
have
models
of
the
word
that
can
be
approximate
and
useful?
How
does
memory
limitations
facilitate
learning
and
how
should
we
represent
other
people's
goals
and
intentions?
So
these
were
the
main
open
questions
in
the
field.
Let's
say:
there's
a
nice
workshop.
It
was
long
like
eight
hours
long,
but
I
pick
up
a
few
papers
from
there
that
I
thought
it
would
interest
you
so
this
one
it
was.
B
We
reviewed
last
year
that
the
carlo
paper
and
then
there
are
several
follow-ups
where
he
correlates
convolution
neural
network
activations
with
neural
activity,
and
this
paper
it's
kind
of
a
continuation
on
that,
but
they
went
full
scale,
and
so
they
compare
narrow
recordings
with
over
50
different
architectures.
So
they
got
everything
because
they're
in
pythort
model
zoo
and
they
use
a
two-photon
calcium,
imaging
data
set
from
30
000
euros
in
the
mouse
visual
cortex
and
instead
of
just
doing
regular
object
classification,
they
compare
with
21
computer
vision
tests.
B
C
There
lucas,
can
you
remind
me
just
what
are
they
basically
doing
to
make
these
comparisons,
how
they
have
this
huge
amount
of
two
photon,
calcium
imaging
and
they
got
a
network?
What
is
the
method
that,
by
the
comparison,
making
these
components.
B
B
C
Seems
like
a
very
it's
like
a
very
open-ended
idea
right
I
mean
it's
not
clear
to
me.
I
don't
know
I
mean
it
just
that
seems
almost
like
a
I.
I
guess
I
just
don't
understand
it.
It
seems
like
there's
so
many
ways
you
could
go
about
that.
You
know.
What's
the
animal
doing,
what
are
the
you
know
the
how
they
characterizing
the
behaviors
of
a
neuron?
I
mean
I
don't
know
it's
it's
confusing
to
me.
D
I
mean
just
just
making
sure
you
get
the
basic
paradigm,
they
show
images
to
a
mouse,
they
show
those
images
to
a
neural
network.
They
look
if
you
can
find
mappings
between
the
neurons
of
the
neural
network
and
the
calcium,
imaging
data
of
the
mouse,
showing
the
same
images.
C
Okay,
so
they're
literally
showing
like
the
kind
of
image
data
that
we
work
with
all
the
time
they
just
flash
in
front
of
our
mouse's
retina.
Is
that
what
they're
doing.
E
So
they
don't
train,
there's
no,
like
a
loss
function
on
the
on
the
neural
networks,
to
like
approximate
the
like
real
neurons
or
is
it?
Are
they
just
running
a
bunch
of
them
and
seeing
how
close
they
come
to
mouse?
Cortex.
A
E
D
Like
I,
I
know
their
their
previous
study
where
they
used
primate
data,
I
know
it
quite
well
and
what
they
what
they
do
is
they
look
for
linear
mappings,
where
they
can
say
like
this
neuron
in
and
primate
v
like
v4
or
it
can
be
somewhat
well
approximated
by
taking
a
linear
mapping
of
these
fiber
six
artificial
neural
network
neurons
and
and
they
they
search
for
these
mappings
and
if
they
can
find
a
linear
mapping.
Specifically,
if
it's,
if
it's
linear,
then
they
they.
D
They
say
that,
like
that
means
that
these
neurons
and
this
artificial
layer
are
using
a
similar
basis
in
a
sense,
okay,
they're
they're
encoding
the
input
in
a
similar
basis,
so
they're
trying
to
figure
out
sort
of
like
what
is
the
basis
for
it
and
v4
and
for
these
static
images.
I
think
they're
they're
they're,
humble
about
the
fact.
They
know
that
they're
doing
static,
flashed
images
and
that
that's
a
limitation,
but
that's
that's
kind
of
the
general
idea.
D
C
A
C
We
don't
even
know
if
these
are
rodents
right,
so
we
don't
even
know
if
these
images
are
even
in
any
remote
sense,
meaning
boltzmann
right.
I
mean
we
know
that
they
know.
Now
we
can
classify
these
things.
We
don't
know
that
the
rodent
can
do
that
either.
C
C
A
C
C
It
seems
like
it's
a
very
sort
of
a
brute
force
approach.
I
guess,
but
could
still
be
very
interesting,
so
yeah.
D
C
You
know
what
can
I
do?
Yeah
yeah,
but
you
know,
but
mark
is
right.
If
you
want
to
do
primary
studies,
it's
really
really
difficult.
I
mean
it's
just
so
many
other
layers
of
problems
and
issues,
and
you
have
to
find
the
lab.
A
C
C
E
Isn't
this
problem
kind
of
degenerate
like
these?
These
networks
have
like
many.
I
don't
know
how
many,
how
many
units
do
they
have
like
many
thousands
right,
these
image
net
networks
and
then
like
you,
can
find
several,
especially
if
you're
doing
a
linear,
you
could
find
several
different
types
of
network
of
company
linear
com,
linear
combinations
of
units
that
would
ex
look
like
some
unit
randomly
I'm
sure
they
do
some
statistics
on
that.
I
just
like.
Don't
really
don't
really
see
why
this
is
interesting.
D
I
I
I
can
I
can
give
defenses
for
these
things,
because
I've
talked
about
the
previous
work,
where
they
were
really
careful
about
that,
where
they
were
really
careful
about,
like
having
number
of
layers
line
up
with
specific
layers
and
cortex.
So
there
are
answers
to
these
questions.
This
is
going
to
become
a
long
discussion,
though.
If
I
keep
answering.
A
In
the
context
of
primates,
it
makes
a
little
more
sense
to
me.
It's
just
you
know,
even
that
I
think
it's
somewhat
dubious,
but
at
least
it
makes
a
little
bit
more
sense
in
the
mouse.
It's
like
so.
C
Just
just
getting
past
that
and
I
I
need
a
refresher
on
the
r
squared
value,
the
variance
component,
to
know
how
significant
these
results
are,
what
they
found
like
I,
when
you
go
through
that
lucas,
maybe
you
can
just
remind
me
what
these.
B
Yeah,
so
just
to
answer
a
subtech
question.
First,
that's
so
the
paper
says:
65
000
euros
collected
across
the
fuso
cortex
of
221,
awake
adult
mice,
and
the
nearest
sample
includes
six
areas
of
visual
cortex
and
four
cortical
layers.
C
Yeah
they're,
probably
just
not
they're,
just
they're,
doing
surface
optical
imaging
on
the
surface
like
it
only
goes
so
deep.
C
They
can't
they
can't
reach
the
deeper
layers,
because
the
technique
doesn't
allow
that.
B
C
F
C
B
A
A
B
A
Did
they
try?
You
said
they
did
220
mice?
Did
they
try
to
relate
one
mouse
to
you
know
a
bunch
of
mice
to
other
mice,
I'm
just
wondering
what
the
best
you
can.
D
A
A
B
So
kind
of
all
the
information
in
the
papers
that,
when
we
talk
about
here,
so
they
might
release
something
bigger
after,
like
a
full
conference
paper
or
something
like
that,
so
that
this
information's
not
there
yeah,
I
don't
know.
C
G
And
they
say
clearly
convolution
depth
alone
are
not
enough
to
sort
of
explain
what
they're
trying
to
explain.
I'm
going
to
say
that
I
I
think
we
sort
of
jumped
the
gun
and
just
sort
of
assumed
that
they
were
going
to
sort
of
make
some
kind
of
conclusion
saying
that
there
would
be
a
high
correlation,
but
their
sort
of
results
are
there
isn't.
F
G
F
A
C
It
possible
they
already
had
this
data
somehow
and-
and
someone
said
hey,
we
already
have
this
data,
let's
just
crank
it
with
a
few
lines
of
code
here
and
see
what
we
get.
B
A
A
H
Hey,
let's
show
all
these
all
these
mice,
these.
You
know
these.
These
image
data
sets
we
use
in
machine
learning,
which
is
an
interesting
question
why
they
did
that,
but
anyway,
okay,
I
think
so.
Yes,
I
think
the
conclusion
here
is
there
isn't
a
lot
of
correlation
and
this
technique
may
not
be
very
successful.
F
E
Justify
so
the
r
squared
is
the
squared
of
the
correlation
coefficient.
I
think,
which
is
easier
for
many
people
to
imagine
so,
like
r
squared
of
0.10
is
about
0.3
correlation,
which
means,
if
you
plot
the
you
know,
if
you
plot
your
variables,
the
x
versus
y,
it
will
be
like
0.3
slope.
Basically,.
C
B
C
B
I
haven't
seen
it
yet
just
a
a
quick
reminder,
so
I
haven't
spent
a
lot
of
time
on
each
of
these
papers
like
presentations
like
five
minutes,
and
then
I
scheme
the
paper.
So
I
I
might
not
know
the
answer
to
all
your
questions,
so
this
paper
is
from
the
main
conference
actually
not
from
the
workshop.
B
So
that's
a
head
direction
system
of
a
fruit
fly
and
they
use
rnns
to
estimate
the
head
direction
through
integrating
the
angular
velocity
and
what
they
show
is
that
they,
some
of
the
neurons,
would
be
the
activity
of
some
of
the
neurons
would
be
correlated
to
that
activity
of
compass
nearest
and
shifting
neurons
in
the
fruit
flies.
So
what
they're
implying
is
that
you
could
recover
the
same
type
of
neurons
in
the
recurrent
neural
network
by
training
it
as
you'd
find
in
an
actual
fruit
fly.
C
So
you
know
look
at
that
seems
to
be
less
less
of
a
stretch
than
the
previous
one.
So,
okay,
you
know
this
is
a
very
simple
network.
It's
not
doing
you
know
head
direction,
cells
you're,
just
trying
to
recreate
a
certain
breakpoint
property
that
is
actually
might
be
implemented
in
a
very
few
neurons
in
the
brain.
So
so,
just
in
case
you
thought
we're
gonna
jump
all
over
this
one,
I'm
not
I'm!
Okay!
B
Okay,
I
I
didn't
think
I'm
opening
the
presentation
another
screen,
because
I
had
notes
and
the
notes
are
not
showing
soon,
okay
cool,
so
so
this
this
paper
is
a
big
paper
for
results,
you're
there.
So
this
other
one
was
in
the
workshop.
I
really
like
it
just
because
of
work
on
robotics
a
little
bit
in
the
past,
so
this
was
by
leslie
beck
cobling
from
mit
she's,
a
roboticist-
and
this
was
a
this-
was
a
very
open
and
frank
discussion
and
she
she
approached
the
conference
as
a
roboticist.
B
Like
I
don't
know
anything
about
neuroscience.
I
don't
know
anything
about
cognitive
science,
but
these
are
the
things
we've
learned
in
robotics
over
the
last
30
years,
and
that
was
a
really
nice
overall
presentation,
and
these
are
the
things
that
we
as
roboticists
want
to
learn
from
cognitive
scientists
that
can
help
us,
so
she
talks.
C
In
the
little
diagram,
it's
interesting
that
she
puts,
I
think
she
right.
Where
is
it
she?
She
she
puts
intelligent
systems
as
sort
of
the
outside
of
the
box.
It's
like
it's
the
least
common.
It's
a
it's.
You
know
I
I'm
not
sure
what
what
intelligence
systems
are
in
that
diagram
if
they're
not
embedded
in
bodies
or
animals,
you
know
it's
like
what
qualifies
as
intelligence.
C
You
know
some
people
would
say:
intelligence
is
sort
of
the
peak
of
of
something
you
know
peak
of
cognitive
ability,
and
this
sort
of
suggesting
intelligence
is
sort
of
the
opposite
of
that.
Am
I
interpreting
that
correctly.
C
Okay,
so
that's
an
important
it's
a
very
simple
idea:
yeah,
okay,
you
know
it's
interesting
because
I'm
dealing
with
this
issue
in
the
book-
and
you
know
trying
to
define
what
intelligence
is
and
there's
there's
various
people
with
different
ideas
about
this,
and
it's
interesting
to
see
this.
This
is
sort
of
almost
the
opposite
of
you
know
what
I
would
yeah,
what
most
people
might.
A
Change,
I
I
think
you're
going
back
going
to
what
luca
said.
I
think
the
way
to
maybe
interpret
his
intelligence
systems
here
might
be
sort
of
passive
stuff.
That's
just
getting
data
in
and
processing
it,
that's
the
most
general
thing
and
then
embedded
systems
are
maybe
embedded
in
the
environment
and
maybe
getting
streaming
data
coming.
C
Yeah,
it's
just
it's
just
interesting
that
I
think
most
lay
people
would
not.
They
could
understand
everything,
except
maybe
the
intelligent
part
being
outside.
Of
that.
You
know
it's
like
just
an
interesting
observation
that
I
think
it
grows
a
bit
counter
to
what
the
lay
person
thinks
about
what
intelligence.
C
H
Know,
okay,
anyway,
okay,
this
is
interesting.
B
Alright,
so
so
that
the
kind
of
questions
she
had
for
the
current
design,
so
this
was
her
workshop
presentation.
She
had
a
larger
presentation,
the
main
conference,
but
this
was
supposed
to
be.
Like
a
conversation,
I
mean
it
didn't
happen
as
it's
supposed
to
happen
because
of
you
know
remote,
but
the
question
she
she
asked
was:
what
kind
of
knowledge
are
innate,
so
what
sort
of
things
we
can
assume
and
use
as
inductive
files
in
our
models
in
robotics?
B
What
corners
can
we
safely
cut?
I
mean:
what
can
we
ignore?
What
we
don't
need
to
worry
about
things?
We
can
learn
from
the
brain
right.
What
kinds
of
multilaterality
we
see
in
the
brain
that
would
be
useful
to
replicate
in
robots,
how
do
brains,
encode
spatial
information?
That's
the
million
dollar
questions.
Everyone
is
working
on
right
now
and
what
are
the
multiple
scales
and
mechanisms
of
learning
that
we
have
in
brain
and
how
do
we?
What
are
the
mechanisms
that
animals
use
to
stop
repeating
the
same
unsuccessful
actions
and
how
do
we?
B
How
can
we
model
other
agents?
So
all
these
questions
are
questions
that
we
can
learn
from
neuroscience.
That
would
help
robotics.
So
there
are
the
questions
that
are
being
asked
at
the
same
time
in
this
book
in
these
two
disciplines
so
yeah
there
is
a
huge
gap
right
now
between
robotics
and
cognitive
science,
you
could
even
say
between
robotics
and
deep
learning
and
they're,
not
the
same
field
and
I've
done
robotics
in
the
past,
and
it's
very
different
for
machine
learning,
but
she's
trying
she's
trying
to
close
the
gap
right
just.
B
C
Or
just
we're
just
I
don't
know
I
can
share
it,
just
share
it
on
slack
or.
B
A
B
Okay,
okay,
so
okay
cool,
so
so
the
whole
point
of
the
trip
report
is
like
showing
these
interesting
papers,
and
then
I
put
the
link
and
you
can
take
a
closer
look
there.
So
this
last
one
I'm
not
I'm
actually
not
gonna
talk
about
it.
So
I
just
thought
it
was
interesting
because
you
know
you
had
that
freestand
paper,
I
think
2009,
maybe
aries
concrete
me,
which
is
a
reinforcement,
learning
or
active
inference,
and
then
a
lot
of
people
are
working
now
on
using
active
inference
principles
in
reinforcement
learning.
So
this
was.
B
This
paper
was
like
based
on
that
idea,
but
I
really
don't
want
to
go
down
the
active
inference
road
right
now,
it's
going
to
be
at
least
one
hour
discussion,
okay,
so
deep
learning
theory,
I
pick
up
three
papers
which
I
think
are
really
nice.
So
one
is
is
also
like
a
huge
effort
like
a
huge
work.
It's
called
fantastic
generalization
measures
and
where
to
find
them,
it's
a
reference
to
the
movie
fantastic
beasts
or
something
like
that,
like
the
harry
potter
movie.
B
So
what
they
did
is
they
evaluated
for
40
different
generalization
measures
over
more
than
2000
models,
simple
models
in
two
data
sets
and
the
idea
was
to
uncover
causal
relationships
between
each
generalization
measure
and
generalization
process,
so
they
wanted
to
see
which
ones
actually
had
some
relationship
with
generalization
and
they
had
some
very
interesting
findings
there.
B
One
of
them,
which
was
quite
surprising,
is
that
norm
based
measures.
They
failed
to
correlate
12
generalization.
Some
of
them
were
even
negatively
correlated
and
the
reason
why
it's
interesting
is
because
we,
the
way
we
force
our
networks
to
generalize
is
we
use
l2
regularization
right,
it's
essentially
a
norm
based
measure
and
what
they
are
showing.
Is
it's
not
actually
a
very
good.
B
It
doesn't
correlate
with
generalization
at
also
there
there's
better
things
we
should
be
using,
and
it
also
doesn't
mean
that,
just
because
one
measure
correlates
well
with
generalization
that
we
should
be
optimizing
for
it,
because
when
you
you
add
it
in
the
loss
function,
you
change,
you
completely
change
the
the
load
surface.
So
it's
gonna.
B
So
the
the
measures
that
were
better
at
correlating
with
generalization
were
sharpness
based,
so
sharpness
is
the
sensitivity
of
the
loss
over
the
entire
training
set
to
perturbations
and
model
parameters
and
optimization
based
ones
which
are
like
basically,
the
gradient
noise
and
the
speed
of
optimization.
So
this
is
a
sharpness
based
measures
are
more
difficult
to
calculate,
but
optimization
based
are
very
easy
to
calculate,
especially
when
you
have
access
to
to
how
gradients
are
being
calculated,
and
they
are
both
very
good
predictors
of
generalization,
and
I
think
this
this
this
was
a
massive
work.
B
B
Yeah
gradient
yeah
graded
noise
is,
is
the
variance
you
know
how
the
variance
of
the
gradient
between
matches
or
between
epochs,
how
much
the
grain
it
is.
A
B
Oscillating
and
speed
of
optimization
it's
how
how
fast
you're
you're
moving
towards
your
your
local
minima.
You
could
say
that,
so
these
are
both
measures
you
can
take
while
you're
optimizing
right
and
sharpness
space
you
you
can.
You
can
even
take
like
a
static
snapshot
and
they're
like
pc,
bayesian
bones,
but
they
are
a
little
bit
more
difficult
to
calculate
because
you
have
to
do
a
little
perturbation
and
see
how
how
it
changes
so
yeah
sharpness.
A
B
Yeah,
so
that's
the
second
point,
so
the
goal
of
the
paper
is
just
to
measure
correlation
of
of
these
measures
with
generalization
so
and
it's
the
groundwork
for
other
stuff
which
like
how?
How
can
I
use
this
to
improve
my
training
right?
It's
not
necessarily
that
if
I
just
add
it
like
to
the
last
function
like
we
do
with
l2
norm,
it's
going
to
improve
it.
May
maybe
it
won't.
Maybe
just
change
the
last
surface
in
a
way
that
it's
just
going
to
make
it
worse,
but
we
can
try
and
use
in
other
ways.
B
B
G
I
was
just
gonna
ask
I
imagine
that
for
the
gradient
noise,
one
to
bounce
off
supernatural
questions,
I
imagine
there's
like
an
optimal
range.
I
don't
know
if
you
remember
they
mention
anything
like
that.
Like
sort
of
like,
like
you
know
not
too
much
not
and
not
too
little,
is
that.
B
No,
you
remember
that
I
don't
remember
they
mentioning,
because
they
actually
just
measure
correlation
right.
They
didn't
measure,
you
know
like
what.
What
is
it
going
to
improve
it
just
better
if
it
correlates
well
with
with
generalization
or
not,
and
it
does
correlate
well.
G
B
Do
I
oh
okay,
like
did
they
quantify?
I
have.
B
Asked
but
I
don't
know
yeah
I've
assumed
that
small
gradient
noise
means
a
better
generalization,
but
I
might
be
wrong,
but
I
think
that's
the
fair
assumption.
Let's
see.
H
B
B
So
all
the
all
all
the
frameworks
do
these
days.
They
they
calculate
the
backward
pass,
but
they
make
it
very
hard
to
calculate
extra
measures
you
have
to.
You
can
do
it
yourself,
but
it's
not
going
to
be
efficient.
So
what
these
guys
did?
They
did
this
package
and
they
did
it
in
an
efficient
way
and
it's
very
easy
to
use
as
well.
B
So
if
you
want
like
the
variance
of
the
gradient,
for
example,
you
can
just
this
is
how
you
use
it
and
so
this
package,
let
me
see
what
they
have,
so
they
what
they
have
right
now,
it's
individual
gradients
of
the
mini
batch
estimates
of
the
variance
or
the
second
moment,
and
they
also
have
approximates
of
second
order.
Information
like
here
the
fischer
matrix
and
yeah
so
field
stuff.
It
can
be
very
useful.
For
example,
you
usually
use
this
approximates
of
the
second
order
in
continuous
learning
about
the.
B
B
B
Okay:
okay,
it's
not
facebook,
okay,
and
this
third
one
is
actually.
It
was
quite
interesting.
There's
a
finding
here,
it's
kind
of
contradictory
to
what
we
think.
So
what
they
show
here
is
that
the
early
phase
of
training
neural
networks,
it's
crucial
for
final
performance
and
they
show
that
there
is
a
break-even
point.
B
So
if
you
use
a
very
large
learning
rate
or
if
you
use
versus
using
a
small
learning
rate,
you're
going
to
go
in
different
directions
in
the
low
surface
and
what's
going
to
change,
is
that
if
you
go
in
one
direction,
you're
going
to
sgd
is
going
to
implicitly
regularize
your
network,
while
you're
going
the
other
direction
you're
going
to
go
to
a
a
bad
bad
loss
surface,
you
could
say
that
so
what
they're
showing
here
so
here
is
the
spectral.
I
think
it's
the
spectral
norm
of
the
hessian
which
shown
here.
B
If
you
take
a
smaller
numerator
you're
going
to
go
to
the
left
right,
so
here
are
two
different
examples,
and
so
the
contradictory
finding
is
that
if
you
just
increase
your
bet
size
for
example,
right
so
you're,
actually
reducing
your
effective
learning
rate,
you
get
a
larger
variance
of
the
grade.
So
you
would
expect
that
if
you
increase
the
bad
size,
you
get
the
smaller
variance
of
the.
B
B
You
generate
the
adversarial
examples
using
graded
that
you
get
through
back
propagation
and
one
simple
defense
is
that
you
up
obfuscate,
the
gradient
somehow
and
then
to
counter
that
you
could
use
what
it's
called
backward
past
differentiable
approximation,
which
is
you
find
an
approximation
like
g
of
x,
of
f
x,
you
find
like
a
different
function,
which
is
differentiable,
so
you
get
around
the
obfuscated
gradient
thing
and
k
winners
lets
you
count.
C
B
C
Which
one
is
we
talking
about
here
is
this:
assuming
that
we
do
that,
the
the
attacker
knows
this
or
not,.
C
You
have
access
yeah
yeah,
just
as
a
general
question.
Is
that
considered
an
important
problem
these
days,
given
that
you
can
easily
obscure
the
internals
of
a
network
or
is
it?
Is
it
considered
still
a
significant
problem.
B
Well,
it
is,
it
is
important,
probably
because
you
can,
I
mean
the
networks.
Everyone
is
using.
It's
almost
the
same
they're
training,
the
same
data
set.
So
it's
not
it's
not
hard
to
replicate
a
work
at
all
right.
So
there's
no
such
thing
as
a
true
black
box
attack
unless
you're
using
like
a
very,
very
widely
different
network
uranium,
like
some.
C
Topic
here,
but
wouldn't
it
be
possible
to
just
train
your
network
a
little
differently,
maybe
in
a
different
order
or
using
slightly
different
data
set
or
something
like
that
and
with
that
then
make
it
a
black
box.
B
Well,
it
will
make
a
black
box
per
se,
would
make
it
harder,
but
but
still,
but
still
you
can.
A
B
Are
different
classes
of
problems
right
there,
black
box
and
the
white
box,
which
are
saying
that
is
the
white
box
relevant?
I
think
it
still
is
because
somehow
you
can
get
closer
to
it.
Even
if
you
don't
get
all
the
full
way,
you
could
still
easily
get
the
model
which
is
really
close
to
it.
Then
it
just
becomes
a
white
box.
I
think
okay.
A
B
B
So
so
the
idea
between
the
k-winners-
they
call
is
that
it's
a
non-differentiable
function
and
that
could
hardly
be
approximated
by
smooth
functions
and
they
show
here
the
last
surface
right
here.
The
e
epsilon
is
just
small
perturbations
and
how
the
loss
surface
changes
and
they
they
show
that
k
winners.
Alcohol
has
this
very
weird
cloud
surface
and
this
discontinued
in
in
the
k
windows
econ
network.
They
can
prevent
graded
base
search
of
adversaries,
samples
and,
at
the
same
time
it
doesn't
hurt
training
right.
B
That's
that's
a
claim
in
the
paper,
and
they
show
that
the
robustness
of
these
networks
they
outperform
other
traditional
networks
under
white
box
attacks.
B
I
thought
this
was
a
very
interesting
addition
to
the
work
we've
already
done
on
robustness.
We
mainly
used
white
knives,
but
we
thought
about
it
for
zero
attacks.
A
lot
and
this
paper
actually
goes
goes
ahead
and
shows
it
and
also
shows
it.
Theoretically.
So
there's
all
the
the
map
is
there.
D
C
There
is
isn't,
isn't
this
go
back
to
this
idea
that
I
first
heard
from
sir
gangly,
which
is
like
it
looks
like
there
might
be
a
lot
of
local
minimum
there,
but
in
high
dimensional
spaces
they
really
aren't
and
therefore
it
look
as
long
as
you
want,
but
there's
always
going
to
be
a
gradient.
But
that
will
take
you
where
you
want
to
go.
A
Yeah
yeah,
I
think,
it'll
look.
Potentially
it
could
look
much
better
in
high
dimensions
than.
B
I
think
maybe
we
can
even
do
like.
Maybe
a
journal
club
on
that
one
day
might
be
useful.
B
Okay,
so
this
next
paper
we
actually
did
the
journal
club
on
it
last
october,
at
the
time,
was
archive
and
then
open
review.
Now
it
was
an
iclear
and
I
included
here
because
it's
something
I
think
we
can
actually
use
even
in
the
current
dynamic
sparsity
models,
we
we
have
so
what
they
do
here
is
they
have
this
dynamic,
sparse
model
which
is
like
magnitude-based
pruning?
B
It's
very
similar
to
the
one
ching
did
you
know
ching
from
cerebral
we
talk
about,
but
he
adds
this
extra
extra
thing
here
is
that
he
keeps
true
network
right.
He
he
keeps
the
the
prune
weights
the
mask
and
the
regular
weights
and
what
he
does
at
every
step
at
every
epoch.
He
computes
the
mask
based
on
the
actual
weights,
and
then
he
computes
the
gradients
based
on
the
prune
network
and
updates
just
the
way
it
baits
on
the
prune
network.
B
So
the
next
time
he
prunes
he's
gonna
prune
on
the
full
weight.
So
if
there
is
any
weights
that
has
like
a
large
change
and
then
they
they're
going
to
be
included
back
in
the
mask,
so
he
always
keeps
a
copy
of
the
full
weights
and
they
can
go
back
and
forth.
They
can
be
removed
from
the
mask
or
be
included
in
the
mask
right,
so
the
gradients
are
always
calculated
only
based
on
the
prune
mask,
but
the
updates
are
done
to
the
full
weights
and
the
mask
is
always
calculated
on
the
full
weights.
B
So
this
is
a
technique
that
could
be
included
in
any
dynamics.
Varsity
work,
we
do,
it
will
be
additive
to
it
and
they
actually
showed
really
good
results.
So
they
showed
improvements
over
other
techniques.
For
example,
he
used
the
same
technique
that
chin
used,
but
using
this
these
two
like
keeping
a
copy
of
the
regular
weights
and
he
can
get
improvement
over
that.
So.
I
Just
so
what?
What
does
the
hysteresis
look
like?
For
that?
I
mean
how
compared
to
them
kind
of
going
out
being
removed
in
the
next
pass,
how
much
retention
actually
goes
on
when
they
use
the
scheme.
I
Okay
histories
is
basically
saying
you
potentially,
if,
if
the
system
was
highly
reactive,
it
would
immediately
flip,
but
hysteresis
means
you
hang
on
to
the
previous
values
for
some
period
of
time
before
it
actually
switches
over.
B
Okay,
I
don't
know
it's
a
good
question.
Definitely
I'd
like
to
know,
but.
B
Don't
know
if
they
did
this
kind
of
analysis
in
the
paper.
I
don't
remember
that,
but
yeah
it
would
be
good
to
see
right.
So
maybe
we
can
even
incorporate
this
in
regard.
I
don't
know,
michael
andre,
you
can
look
at
it
later.
G
Yeah,
that's
what
I'm
thinking
about
while
you're
talking
to
about
it.
I'm
not!
I
think
it
should
be
like
orthogonal
to
rigal,
but
I
just
want
to
be
sure
about
that.
Well,
maybe.
B
Yeah,
yeah,
okay,
all
right
so
how
many
eleven
o'clock-
okay,
so
the
next
one,
but
maybe
something
we
can
also
do
so.
There
is
this
paper
from
2019,
it's
called
snip.
I
don't
know
if
we
discuss
that's
the
the
idea
is
that
you
can
do
this
pruning
prior
to
training,
so
you
can
adjust
your
network
in
such
a
way.
B
So
it's
going
to
be
easier
to
train,
so
what
this
work
does
in
addition
to
that,
so
it's
need
to
something
that's
been
out
there
we
can
use.
So
the
idea
here
is
that
you
could.
B
Okay,
so
what
he's
doing
here
is
that
for
all
layers
he's
trying
to
minimize
so
here's
the
forbidden's
norm-
and
here
the
c
is
the
mask:
w-
is
the
weight
so
there's
an
inner
product
minus
the
thing
the
forbidden
norm
of
this,
and
the
idea
is
that
you
make
a
you.
Try
to
minimize
this
in
order
to
improve
yeah,
improve
the
your
topology
and
make
it
more
trainable.
B
I
I
had
this
on
my
head
in
monday,
but
I
actually
forgot
to
do
this,
but
what
this
paper
is
pointing
to
is
that
there
is
future
work
where
he
wants
to
incorporate
this
into
the
training
itself.
So
the
idea
is
that
every
step
when
you're,
using
sparse
networks
or
proof
networks,
you
could
at
the
same
time
be
trying
to
improve
your
signal
propagation
because,
as
you
prune,
it
you're
you're,
going
to
make
it
first
right,
you're
going
to
make
you're
going
to
break
this
dynamic
zombie
tree.
B
Even
if
you
ensure
that
initialization,
as
you
prune
you're,
going
to
remove
some
weights
you're
going
to
add
others
you're
going
to
break
this
and
you
could
keep
trying
to
restore
signal
propagation
at
every
apple.
So
this
is
the
idea
he's
pointing
to
it's,
not
where
this
paper,
what
this
paper
did.
Yet
he
what
he
did
is
just
did
this
at
initialization,
but
he
includes
there
as
future
work
and
next
step
as
to
use
this
during
training
itself.
B
So
the
reason
I
included
this
here
is
because
of
it's
something
we
can
do
right.
We
can
use
some
sort
of.
We
can
do
this
analysis
at
initialization
and
even
if
we're
dynamically
pruning,
we
can
make
sure
that
when
we
start
training
we
already
have
the
best
network
we
could
have
in
the
perspective
of
signal
propagation
and
then
at
some
steps,
some
bad
books.
We
could
recheck
this
and
we
could
adjust
our
weight
somehow
our
connection,
somehow
as
to
ensure
that
we
still
have
a
network
in
which
the
signal
propagation
works
well,.
I
A
It's
I
thought
they
do
solder,
though,.
G
Like
like,
I
thought
they
sort
of
like,
I
think
they
end
up
sticking
out
some
term.
That
would
be
layer
to
layer,
but
I
think
they
do
derive
it.
Considering
the
intent.
I
Because
I
I
was,
I
was,
I
was
thinking.
If
it
you,
you
basically
help
the
signal
propagation
between
two
layers,
but
it
for
whatever
reasons
you
prune
things
so
that
it
gets
blocked
on
the
next
layer.
So
it
seemed
like
there
would
be
some
disconnect
there
unless
you
had
some
overall
guidance
as
to
where
the
salient
paths
are
through
the
whole
network.
B
Didn't
maybe
I
didn't
walk
through
the
full
derivation
until
then
and,
like
michelangelo
said,
they're,
just
compounding
these
things
and
I'm
doing
it
end-to-end,
I'm
not
sure.
Actually,
because
I
didn't
went
that
deep
into
this
paper.
It's
like
last
minute
edition.
B
So
the
idea
is
that
at
some
point,
you're
just
gonna
find
a
ticket
and
that
it's
not
gonna
change
much,
even
if
you
keep
training,
so
you
can
just
stop
there
and
it's
a
it's
a
very
simple
technique,
actually
that
they
they
apply
the
pruning
at
each
step.
So
they
calculate
the
mask
at
each
step
and
then
they
keep
it
in
a
queue
and
they
they
calculate
the
mass
distance
between
this
current
network
and
the
last
network,
you-
and
if,
after
a
few
steps
that
mask,
didn't,
change
a
lot,
then
you
just
returned
it.
B
That
means
it's
not
going
to
change
a
lot,
even
if
you
keep
on
training.
So
it's
a
simple
idea,
but
just
by
doing
that
it
could
replicate
the
results
of
the
lottery
ticket
using
six
to
ten
percent
of
of
the
calculation
they
use
in
the
original
paper
with
the
simple
idea,
and
they
also
showed
that
this
works
even
under
low
cost
training
scheme.
So
if
you
use
low
precision,
precision
training,
for
example,
just
eight
bits-
training
which
are
even
faster,
you
can
still
find
the
same
mass,
so
you
can
make
it
even
faster.
B
So
it's
very
empirical
work,
but
they
got
good
results
and
the
next
one
is
by
the
same
motors,
actually
frank
on
carbon
plus
render
who's
the
first
author
and
what
they
did.
They
they
extended
the
idea
of
the
lottery
tickets.
The
lottery
ticket
says
you
could
rewind
the
ways
to
their
original
values
and
what
they
investigated
here
is
they
tried
rewinding
their
weights
to
intermediate
values
as
well.
They
showed
that
it's
better,
you
don't
have
to
rewind
it
all
the
way
you
can
rewind
it
just
a
little
bit
so
they
introduced.
B
This
notion
of
you
know
not
resetting
the
weights,
but
just
rewinding
a
little
bit,
and
then
they
also
tested
just
rewind
in
the
learning
rate
set
up
the
weight.
So
you
just
go
back
and
you're
learning
where
it's
kind
of
like
a
few
steps,
and
they
showed
in
the
end
that
if
you
do
so
the
orange
ones,
learning
rate
rewinding
and
the
blue
ones
weight
rewinding.
So
if
you
just
rewind
the
learning
rate,
it's
enough,
so
you
don't
even
have
to
rewind
the
weights
and.
C
B
B
B
G
B
Is
different
from
the
fine
tuning,
so
the
the
one,
the
worst
one
they're
comparing
is
fine
tuning,
because
what
they
did
in
pruning
before
is
that
after
you
prone
you
kept
training,
but
you
kept
training
with
smaller
learning
rates.
I
mean
the
idea
is
that
you
just
fine-tune
the
network.
All
right
had
a
good
network,
but
you
made
some
changes
and
then
you
find
one,
but
they.
A
Show
here
actually,
if
you
increase
the
learning
rate
again
and
go
back
down.
G
Sorry
does
it
show
that
you
need
effectively
less
number
of
epics
in
the
retraining
when
you
just
rewind
learning
rate,
as
opposed
to
just
relying
the
weight
like?
I
presume
that
that
would
be
the
case
like
let's
say
before.
If
you
re-round
the
weights,
maybe
you
would
take
like
100
epic
to
get
back
to
that
original
accuracy.
But
if
you
run
the
learning
rate,
does
it
take
less
ethic.
B
B
I'm
not
sure,
michelangelo,
maybe.
G
B
I
I
I
actually
don't
think
so.
If
you
look
at
the
pruning
algorithm
here,
they
they
train
for
the
original
training
time.
So
the
trend
to
completion,
prune,
20,
lowest
magnitude,
weight,
retrain
using
learning
rate
rewind
for
original
training
time,
and
they
just
keep
doing
this
until
you
get
to
this
parts
that
you
want,
but.
I
G
G
B
All
right,
so
I
thought
I
had
to
mention
the
lottery
ticket
just
because
you
know
subtitle
and
agree.
It
was
the
best
paper
last
year,
but
they
had
a
lot
of
issues,
but
it
also
brought
some
new
perspective
on
the
pruning.
B
Approach
I
had
cool
branded
stuff,
which
yeah
I
thought
I
didn't
have
time
for
it,
and
I
was
right.
Some
sub
just
talk
very
quickly
about
it,
so
there
are
some
papers
on
using
deep
learning
for
mathematics
and
especially,
I
think,
the
one
I
like
the
deep
learning
for
symbolic
mathematics.
They
were
using
it
for
what
is
it
good?
B
They
had
this
bold
paper
called,
can
quentin
algorithms
for
deep
conclusion
or
networks
where
they
showed
a
possible
approach
that
could
be
used.
I
say
this
with
care
because
we
know
that
you
know
there
is
no
algorithm
right
now
that
we
know
of
that
would
work
better
in
quantum
computers
than
regular
architecture
architectures,
but
they
they
show,
they
show
a
step
towards
it,
and
it
was
interesting
to
see
that
someone
is
working
on
the
problem
and
yeah
there's
some
other
papers
that
won't
go
there,
so
yeah
that's!
B
A
C
B
B
B
Maybe
they
added
the
number
of
altars
per
paper,
I'd
guess
like
300
400
papers,
that's
my
guess!.
F
A
B
C
Mean
it's
interesting,
how
you
say
yeah,
I
understand
what
you're
saying
when
you're
in
a
conference
and
you're
physically
there
you're
sort
of
forced
to
keep
going
at
it,
but
on
the
other
hand,
I
find
when
there's
so
much
data
and
you
don't
know
how
to
sort
through
it.
H
F
D
I
think
another
another
factor
that
is
when
you
go
to
a
conference.
You,
like
you,
put
it
on
your
calendar.
You
are
out
of
the
office.
You
are
mentally
engaged
in
that
conference,
no
matter
what
search
process
you're
using
having
that
blocked
off
time,
where
you
really
are
mentally.
There
is
so
valuable.
C
Yeah,
I
agree
it's
so
like.
Okay,
I'm
dedicating
100
of
my
time
versus
the
time
I'm
spending
might
be
more
fruitful
in
an
offline
search,
but
there's
a
balance
between
those
two.
I
agree
with
that.
So
and
sometimes
you
sit
through
a
presentation
and
you
just
you,
don't
think
it's
got
any
relevance
at
all
and
then
all
of
a
sudden,
you
know
sort
of
the
end
of
it's
like
holy
crap.
You
know
look
at
that.
That
was
a
great
idea.
I
do
I've
gladly
heard
that
it's
just
interesting
how
difficult
it
is.
C
I'm
just
going
through,
so
you
assume
it.
I
sent
out
that
paper
this
last
week
or
this
weekend
or
something
about
the
the
layer.
Six
paper
remember
that
type
of
time.
Yeah
yeah
and
it
was
like
a
layer,
six
anatomy,
and
I
was
going
through
that
paper
and
there
was
like
dozens
and
dozens
of
references
on
layer
six.
It
was
like
my
head's
swimming,
it's
like,
oh,
my
god.
These
are
all
new
references,
they're
all
new
papers,
so
many
different
things
and
it's
like
it
just
makes.
C
Sometimes
it's
just
overwhelming
and
you
just
you
have
to
sort
of
give
up
in
some
sense.
But
at
the
moment
I
don't
know
where
to
begin
here.
I
have
to
come
back
another
day.
It's
just
an
interesting
problem.
We
have
there's
so
much
data
both
on
the
neuroscience
side
and
on
the
machine
learning
side,
it's
difficult
to
figure
out
the
big
piece
of
something.
B
Anyway,
that's
a
good
summary.
Thank
you.
I
think,
there's
a
third
component.
There
is
that
I
used
to
run,
I
used
to
run
a
half
marathon
and
when
I
was
training
for
it,
I
was
training
every
day.
I
could
run
like
max
five
miles
six
miles
and
I
was
just
exhausted
because
I
was
training
by
myself,
but
when
I
was
at
the
marathon
you
know
I
could
just
keep
running
after
finish.
B
I
B
A
Have
for
you
know
these
online
things
you
know,
can
they
reproduce
the
social
aspects
of
it?
And
you
know
all
the
different
ways
you
know
I
find
going
to
this
conference.
Is
that
the
talks
that
I
have
within
people?
Those
are
as
much
as
valuable
as
listening
to
the
papers
themselves
and
you
know:
can
they
really
effectively
reproduce
these
kind
of
social?
And
you
know
in-person,
communication
aspects
of
it.
C
Well,
I
thought
I
thought
you
said
earlier:
they
had
like
separate
chat
sessions
for
each
each
poster.
Is
that
right
so.