►
From YouTube: 08 - Deep Learning Reproducibility - Jessica Forde
Description
Deep Learning for Science School 2019 - Lawrence Berkeley National Lab
Agenda and talk slides are available at: https://dl4sci-school.lbl.gov/agenda
A
So
first
up
we
have
Jessica
Ford,
so
jessica
is
currently
at
Facebook
AI
research,
but
she's.
Also
a
member
of
project
Jupiter.
So
her
research
focus
is
on
the
intersection
between
machine
learning
and
open
science,
very,
very
relevant
to
what
we're
talking
about
in
this
school
on
Jupiter.
She
she's
build
open
source,
open
source
tools
for
reproducibility,
a
very
big
open
challenge
and
and
any
kind
of
applied
machine
learning
or
data
analytics.
She's
worked
as
a
machine
learning
researcher
and
data
scientists
in
a
variety
of
applications,
including
healthcare,
energy
and
human
capital.
B
So
I
don't
I'm
not
necessarily
committed
to
like
having
these
slides.
It's
like
like
going
through
these
slides
in
this
sort
of
format
like
we.
This
is
the
small
enough
group
in
which,
like
I
I'm,
totally
open
with
having
you
ask
me,
questions
like
you
know.
Please
raise
your
hand
like
you.
Can
you
can
go
backwards
and
forwards
like
this
is
this
is
a
talk
that
I've
given
in
the
past
and
I
kind
of
want
this
to
be
to
like
serve
your
questions
and
needs.
B
So,
like
you
know,
we
can
have
the
see
as
like
as
well
as
we
want
it
to
be,
but
any
but
more
broadly,
I
think
like
reproducibility
is
a
really
important
question
in
the
sciences
in
general
and
I.
Think
there's
a
lot
of
really
interesting.
That's
been
work
been
done
and
the
other
science
is
in
an
open
source.
With
regards
to
thinking
of
how
we
do
reproducibility
and
machine
learning
has
been
thinking
about
this
as
well,
and
my
work
kind
of
spans
both
of
it.
B
So
this
is
some
of
the
works
that
I've
been
working
on
and
with
colleagues
also
at
other
institutions
such
as
Harvard
and
Facebook,
AI
research
and
project
oopah,
and
so
this
is
kind
of
a
more
general
discussion
of
the
area.
But
please,
like
honestly,
like
raise
your
hand,
stop
me
ask
questions.
This
is
gonna,
be
it's
not
too
technical
talk,
so
it's
totally
like
up
for
discussion.
I
didn't
necessarily
set
my
thing
to
not
freeze,
so
I
apologize,
okay.
B
So
some
of
the
challenges
that
we
face
in
reproducibility
in
research
is
with
the
simple
problem,
which
is:
how
do
we
know
what
the
state
of
the
art
is
now
state
of
the
art
is
particularly
important
in
machine
learning
research,
because
this
is
generally
how
we
establish
new
baselines,
new
methods
that
are
of
particular
novelty,
but
it's
actually
like
it
sounds
like
a
pretty
straight
for
a
question.
That's
a
little
more
complicated,
as
you
will
see
in
terms
of
how
I
think
about
the
soap
for
state
of
the
art.
B
B
They
have
tools
for
establishing
the
art
for
various
benchmark
sort
of
typical
problems,
often
associated
with
particular
data
sets.
I
Jenna
I
generally
think
this
is
a
pretty
good
resource.
So
this
is
one
example.
Most
many
people
are
familiar
with
image
net
as
a
data
set.
It
also
was
H,
it's
also
a
challenge
problem.
So
this
is
a
really
important
challenge
that
established
a
lot
of
new
developments
in
image.
Classification
who's
worked
with
image
net
data.
Okay,
few
people
definitely
look
into
it.
B
I
think
it's
a
really
important
resource
as
well
for
the
fundamental
sort
of
problems
in
computer
vision,
classification-
and
this
is
another
example
from
colleagues
of
mine
at
Stanford.
Don
bench
is
a
benchmarking
library
that
they
work
on
as
well.
Khoda
lab
has
is
a
tool
for
reproducible
research
and
for
competitions.
A
lot
of
work
is
being
done
in
the
NLP
in
co2
lab
its
maintained
by
people
at
Microsoft,
and
so
it
sounds
pretty
straightforward.
There's
a
lot
of
this
is
typically
what
people
do
they
put
up?
B
A
competition
they
put
up
a
standard,
a
data
set,
and
you
know
the
task
is
pretty
straightforward.
It
should
seem
relatively
straightforward,
but
if
you
looked
actually
into
the
machine
learning
literature,
there's
a
lot
of
really
interesting
papers
that
have
come
out
all
RL
gansley.
That
equal
is
one
that
focused
on
Ganz
attention
is:
all
you
need
is
an
NLP
paper,
comparing
methods
as
an
independent
reimplementation
on
the
state
of
the
art.
Evaluation
of
neural
language
models
is
another
language
paper
and
the
Bayesian
bandit
showdown
is
another
bandits
paper
and
people
are
ello.
B
B
Additionally,
there's
there's
problems
in
terms
of
generalization
and
in
machine
learning
when
we
think
about
generalization.
We
often
think
about
this
in
terms
of
how
does
the
model
perform
after
the
fact
on
additional
data
that
might
be
collected
after,
like
from
Edition
a
different
data
set.
So
this
is
a
paper
from
people
down
the
road
at
UC,
Berkeley,
looking
at
see
part
N
and
M
eyes,
image
set
data,
and
so
they
have
a.
They
have
pre
trained
models
on
imagenet
RC
410.
B
These
are
standard
data
sets
in
the
machine,
learning,
image,
classification,
literature
and
they
collect
new
data,
trying
to
use
these
sort
of
methods
for
data
collection
and
as
faithful
as
possible,
and
they
find
that
the
performance
isn't
perfectly
linear,
I
mean
I,
didn't
like
perfectly
the
same.
So,
as
a
result,
the
accuracy
ends
up
degrading,
and
so
we
don't
necessarily
have
the
same
performance
that
we
would
expect
on
these
classes
on
this
data
set.
That
was
collected
at
a
different
time,
suggesting
that
learning
isn't
necessarily
as
generalizable
as
we
hope.
B
This
is
a
paper
in
the
reinforcement,
learning
literature
on
how
long
it
takes
to
how
many
seeds
does
it
take
to.
This
is
a
one
with
when
they
changed
the
domain.
I
have
additional
noise
and
so
generalization
doesn't
necessarily
work
when
we
change
the
actual
input,
you
are
reinforcement
learning
model
such
that
these
aren't
the
models,
aren't
necessarily
robust
to
these
kinds
of
interference.
B
This
is
another
paper
and
there
was
the
reinforcement,
learning
literature
in
which
the
dynamics
of
the
stimuli
or
changed
and
in
in
this
case,
if
you
change
the
way
this
game
of
Omid
are
in
this
game
of
Omid
are.
Have
you
changed?
How
the
dynamics
of
this
game
work
such
that
the
simulator
it
changes
in
some
way,
you
don't
necessarily
have
the
ability
to
play
the
game
of
anima
dar
as
well,
and
so
these
questions
around
generalization
have
real-world
impact.
B
We
will
be
hearing
from
Emily
and
in
an
hour
or
so,
and
she
will
discuss
this
in
more
detail,
but
this
is
some
research
from
some
colleagues
of
mine
who
were
at
Facebook,
and
this
is
images
of
athletes.
As
you
can
see,
you
can
kind
of
guess
what
these
these
sports,
that
they
play.
So
these
there's
the
people
on
the
left
are
soccer
players.
The
people
in
the
middle
are
hockey
players,
I,
think
the
people
on
the
right
are
weightlifters
and
you
have
some
boxers
and
again
some
soccer
players,
but
these
classes
are
wrong
right.
B
This
is
a
really
well
sighted
study
of
commercially
available
implementations
for
facial
recognition.
This
is
this
gender
shades
paper
I
really
strongly
recommend
this
paper.
This
study
studies
how
commercially
available
image
recognition
software
performs
on
people
relative
to
their
gender
and
skin
color.
They
find
that
women
of
color
are
often
misclassified
at
higher
rates
than
the
rest
of
the
rest
of
the
faces.
In
a
dataset,
this
is
a
paper
with
colleagues
of
mine
on
hospital
data.
B
Should
we
be
using
these
these
blackbox
deployments,
given
their
challenges
in
situations
in
which,
like
the
the
decisions,
really
do
matter?
Well,
okay,
so
this.
So
this
is
the
interesting
case
like
it
really
like,
at
least
in
the
fairness
literature.
People
have
argued
that
part
of
this
problem
is
even
just
how
this
is
measured.
So
when
people
are
thinking
about
accuracy,
they
it
belies
the
distinctions
in
the
distribution.
So,
in
the
examples
you
might
say,
okay,
the
performance
is
pretty
good
and
we
aren't
necessarily
inspecting
the
errors,
but
the
errors
are
not
normal.
B
Like
are
not
like
uniformly
distributed
and
they're
associated
with
certain
cases.
Then
this
has
like
certain
implications
right.
So,
even
if
there
is
like
better
than
human
performance
in
a
certain
domain,
it
doesn't
necessarily
imply
that
the
performance
is
as
good
as
a
human
in
general
right.
So,
like
superhuman
performance
that
might
not
necessarily
imply
superhuman
performance
in
certain
cases
right
does
that
does
that
make
sense,
I
mean
so,
for
example,
like
people,
if
there's
a
recent
work
on
various
games,
so,
like
reinforcement,
learning
is
a
really
interesting
domain.
B
I
mean
it
really
I
think
this
is
like
this
is
the
what
I?
What
I'm
trying
to
argue
is
that
not
only
is
this
like
this
is
part
of
the
problem
is
how
we're
framing
evaluation.
So
when
we
say
like
this
is
a
particular
classification
task
we
have
like
this
is
the
accuracy
of
people
and
like
we
are
exceeding
that
we
are
exceeding
the
performance
of
people.
That
doesn't
necessarily
mean
that,
like
that
that
that
will
necessarily
hold
consistently.
B
So
even
if
you
achieve
superhuman
performance
on
that
give
data
set
it,
there
is
evidence
that
suggests
that
this
will
not
hold
consistently
for
images
in
general
collected
of
that
type.
So,
for
example,
you
might
be
able
to
get
really
good
performance
on
digit
classification,
but
that
performance
does
not
necessarily
hold
when
you
now
introduce
it
to
new
images
of
digit
handwritten
digits
that
were
not
collected
at
that
same
time,
so,
like
slight
distribution
shifts,
do
not
necessarily
lead
to
superhuman
performance.
B
No
I'm,
saying
that
some
researchers
have
argued
that
this
is
that
the
that
this
might
that
we
might
not
even
be
making
the
right
decision
in
deploying
back
box
models
for
high-stakes
decisions.
That's
what
this
author
is
arguing:
okay,
any
more
questions
law
in
here;
okay,
so
some
responses
from
the
scientific
community
and
machine
learning.
This
is
who
has
seen
this
talk
from
nerves.
B
It's
a
very
good
talk,
so
this
is
the
test
of
time.
This
is
a
text
of
a
test
of
time.
Talk
from
neuropathy
2017,
a
really
great
talk
by
Ali
Rahimi
for
a
paper
that
him
and
Ben
Rexroat
he's
been
America's.
Also,
the
author
of
this
talk
really
great
talk
strongly
recommend
it,
which
reminded
the
community
of
the
importance
of
experimental
rigor
and
in
rigor
in
terms
of
theory,
rigor
and
evaluation.
B
B
So,
in
response
to
this
I
know,
there's
a
there's
a
yawn.
Looking
typically
argues
that
theory
follows
invention
and
so
that
you
know
it
might
not
necessarily
be
a
concern
that
we
don't
necessarily
have
the
theory,
because
in
many
scientific
domains
the
science,
the
sight,
the
theoretical
understanding
of
the
phenomena
that
we
observe
is
then
documented
after
the
fact,
and
so
this
is
an
example
of
Boyle's
pump.
And
so
you
know
this
is
the
kind
of
things
that
John
is
like
sighting
but
I.
B
Think
more
fundamentally,
at
least
in
terms
of
like
machine
learning
research,
then
the
question
is
then
for
machine
learning,
researchers
and
people
who
are
implementing
machine
learning
models
to
understand
think
ultimately
are:
are
we
scientists
or
are
we
engineers?
Are
we
really
fundamentally
interested
in
building
something
and
getting
it
to
work
or
building
something
in
order
to
create
greater
understanding,
and
so,
like
I,
think
this
I
think
they
think
in
terms
of
the
understanding
it
this
sort
of
question?
This
has
been
something
that's
been
debated
for
a
while.
B
This
is
a
quote
machine
learning,
literature
from
1988
and
I
think
it
really
suggests
that
people
have
been
thinking
about
this
problem
for
a
while
in
the
literature.
I.
Think
that,
fundamentally,
we
really
need
to
understand
when
we
create
a
machine
learning
model
cut
it
up,
how
it
operates,
and
why
is
the
behavior
of
these
models
performing
in
a
certain
way
that
it
is,
interestingly
enough,
also
and
again,
I
think
we
should
all
see
this
see
this
talk,
because
it's
a
really
useful
talk,
but,
interestingly
enough,
Boyle
happened
to
also
be
an
alchemist.
B
So
when
we
talk
about
these
reproducibility
I
mean
one
of
the
things
that
people
often
fight
is
boils.
Air-Pump
boil
also
actually
happened
to
be
an
alchemist
so
anyway,
but
but
I
think
really,
when
trying
to
square
this
problem,
fundamentally,
I
think,
regardless
of
whether
or
not
we
we
focused
particularly
on
the
engineering
or
the
science
side,
I
think
really.
Ultimately,
when
we
are
doing
science
with
computation-
and
probably
you
all
know,
this
better
than
I
do
ultimately
be
our
computational
scientists
right.
B
We
are
scientists
that
ultimately
use
computation
to
try
to
understand
the
problem
that
we're
trying
to
solve,
and
the
computation
serves
this
broader
need
of
trying
to
gain
knowledge,
and
so
ultimately,
then,
as
all
of
you
are,
we
are
all
then
scientists,
you
write
software
and
so
to
kind
of
dig
into
this
sort
of
quote
more
more
deeply
than
in
the
machine
learning
community.
At
least.
B
B
So
in
this
case,
then,
if
we're
thinking
about
this
about
reproducibility
in
the
sciences
from
a
computational
science
perspective,
then
reproducibility
is
ultimately
a
software
problem.
There's
various
definitions
of
reproducibility
people
talk
about
reproducibility,
we
replicability
versus
repeatability
I
generally,
am
not
one
of
these
people
who
really
wants
to
be
very
specific
about
what
the
definitions
are.
I
want
to
focus
more
on
what
the
problems
are
in
the
fields
in
general
that
we
need
to
deal
with.
B
So
again
these
definitions
of
reproducibility
can
vary,
and
so
in
general,
the
problem
I'm
really
interested
in
dealing
with
is
going
from
the
system
specs,
you
have
data,
algorithm,
hyper
parameters
and
your
analysis
to
like
consumers,
the
results
of
one
or
more
models,
and
so
you
have
this
sort
of
pipeline,
and
you
want
to
be
able
to
have
relative
confidence
that
this
pipeline
can
work
consistently
in
the
description
of
this
part
of
this
pipeline
can
be
repeated,
and
so,
as
I
mentioned
earlier
repeat,
reaching
consistent
conclusions
is
particularly
important
than
in
this
question
of
reproducibility
as
I
framed
it
in
terms
of
software.
B
So
then,
if
we're
going
to
do
that,
then
we
need
to
then
think
about
the
consistency
and
precision
of
the
setup
in
order
to
get
to
those
conclusions.
Because,
fundamentally,
this
is
our
problem,
and
this
is
then,
these
sorts
of
system
specifications
are
going
to
be
the
ones
I'm
going
to
be
talking
about.
B
So,
if
we
think
about
what
a
reproducible
pipeline
is,
if
you
think
about
the
kind
of
software
that
you've
written
for
your
problem
for
your
analysis,
you
might
have
something
along
the
lines
of
you
might
have
the
data
that
you've
collected
that
describes
your
scientific
question.
You
want
to
answer,
you
will
might
have
your
dependencies,
your
hyper
parameters,
the
scripts
to
run
your
jobs.
You
might
have
your
analysis
code
and
you
have
the
documentation
to
explain
what
you
did
and
oftentimes.
B
This
documentation
may
come
in
the
paper
by
come
in
a
readme
or
something
like
that.
But
if
anyone
actually
has
dealt
with
these
sorts
of
problems
in
real
life
and
I
love,
this
X
PCP
comic.
This
is
a
lot
more
complicated
when
you
actually
have
to
start
working
with
other
people's
code,
so
who's
actually
had
this
problem
where
they
had
tried
to
run
someone
else's
code
and
they
got
tied
up
just
even
getting
things
started.
B
Okay,
great
so
this
is
this
is
this
is
like
a
fundamental
problem
that
I
think
is
particularly
important
and
and
we're
going
to
discuss
more
on
so
so
I
think
dependencies.
There
actually
is
an
under
sundar
emphasised
problem
in
the
reproducibility
literature
and
is
the
one
I'm
going
to
focus
on
so
nurbs
2017
had
released
code
links
in
their
schedule,
who's
familiar
with
Europe's,
okay
Europe's
is
like
a
mainstream
machine
learning
conference.
B
It's
where
people
publish
fundamental
work
in
machine
learning
and
at
their
conference
they
included
links
to
their
papers
and
included
links
to
where
the
authors
said
they
would
have
put
code
in
their
camera
ready,
so
I
just
crawled.
All
of
that
and
so
2017
everyone
shared
their
paper.
We've
got
a
cat
who
got
a
presentation
and.
B
Little
more
care
of
their
code,
but
most
of
those
people
shared
their
code
on
github
and
within
the
sciences,
at
least
this.
This
number
is
about
the
same
so
within
this
is
a
study
from
Victoria,
Stodden
and
her
colleagues,
and
that
in
of
180
papers
in
the
journal,
science,
36.1%
provided
data
and
code
when
contacted
this
is
a
study
in
which
they
were
contacting
authors
to
request
code
and
about
36
of
those
respondents
ended
up
providing
code.
B
Similarly,
in
the
machine,
learning
literature
only
about
seven
source
actions,
and
so
again
going
back
to
this,
at
least
within
the
machine
learning
literature
people
who
are
producing
these
algorithms,
that
you're
using
for
your
research
or
a
baby
you're
building
on
pond
to
do
new
novel
algorithms
of
the
people
who
do
share
code.
The
overwhelming
majority
of
them
are
sharing
this
code
on
github
who's
using
github
right
now
like
publish
their
research
using
other
people's
code,
okay,
yeah,
so
I
mean
I,
mean
you
hear
all
evidence
of
this
as
well.
B
Right
github
is
now
become
a
standard
for
open
source
yeah.
Oh,
we
were
about
to
get
there
about
the
other,
we're
about
to
get
there.
Okay,
so
I
would
simplify
another
nine
percent
of
these
papers
by
like
snooping
around
through
the
actual
links
they
provided
like
some
people
are
providing
like
a
link
to
the
website
or
instead
of
the
actual
code.
So
we
keep
going
down
the
rabbit
hole.
They
end
up,
getting
to
actually
get
hep
repository.
B
So
now
that
we're
in
the
universe
of
people
who
have
shared
their
code
on
github
among
these
knobs
papers,
an
overwhelming
majority
of
them
we're
publishing,
get
recovery
flows
in
Python,
so
over
half,
and
if
you
look
even
then
like
most
of
these
libraries
are
like
open
languages
right
I
mean
MATLAB
is
like
a
is
a
number
two,
but
even
then
that's
like
much
lower
than
that.
Jupiter
notebooks.
B
It's
considered
a
language
I,
don't
consider
Jupiter
notebooks
a
language,
but
you
know
duds
but
yeah,
but
so,
if
you
hope,
presumably
some
of
those
github
ones
are
also
including
Python
and
then
from
there.
You
have
a
number
of
languages
that
are
also
again
like
and
there's
an
additional
44
repos
that
have
at
least
some
Python
in
the
repository,
and
so
this
sounds
pretty
good,
as
I
mentioned,
because
at
least
now
we're
on
in
a
universe
in
which
everybody
sharing
the
code
of
the
same
place.
B
B
Who's
used,
docker,
okay,
great
awesome,
you're
better
than
when
I
give
this
talk.
I'd
like
some
machine
learning
community.
It's
not
everybody
exhausts
so
so
then
who's
who's,
familiar
with
Jupiter
repo
to
docker.
Okay.
So
we
put
a
backer.
It's
re,
not
taking
it
personally
Reuben
a
docker
is
an
open
source
and
project,
so
I'm
involved
with
that
handles
who's
actually
written
their
own
doctor
file,
Brit
who's
who's
written
their
own
docker
file
with
doesn't
who
has
who
have
gone
through
the
process
of
having
a
docker
file
written
okay.
B
That's
three
for
the
docker
repo
docker
is
a
tool
that
takes
some
standard
configuration
files
like
we
see,
might
be
discussed
in
here
and
helps
with
the
creation
of
a
docker
image
based
on
these
configuring,
and
these
configuration
files
then
establish
the
environment
and
the
environment
is
necessary
to
run
the
code,
and
so
rebooted
docker
is
how
it
works.
So.
C
C
B
C
B
B
They've
been
added
to
this,
but
at
least
at
the
time
that
I
publish
these
results
like
set
off
that
pyrite
uses
what
you
used
to
figure
out:
Python,
okay,
but
even
then
like,
if
you
look,
this
is
actually
pretty
rare,
so
I
placed
in
terms
of
Knox
2017,
most
people
who
are
not
publishing
the
dependencies
of
their
libraries
in
the
machine,
readable
format
and
I
think
were
they
were
trying
to
use.
B
They
were
trying
to
use
it
such
that
they
could
install
the
repository
as
a
Python
libraries
and
so
again
the
month
to
the
point
early
like
doing
this
is
really
easy.
Most
of
you,
people
actually
know
this,
so
you
actually
all
know
the
drive
home
point
of
this
talk,
which
is
to
do
Conda,
explore
tour,
fit
freeze,
but
keep
doing
that.
Freight
makes
me
happy,
but
there's.
B
Refrozen
but
basically,
if
you
have
those
sorts
of
files,
then
you
can
build
a
docker
image
automatically
recorded
ocker,
like
you're
good,
to
go
like
try
it
out
really
great,
and
so
one
of
the
things
that
I
think
was
like
you
kind
of
saw
from
the
video
we
discussed
earlier
with
the
reboot
of
docker,
like
opening
up
a
jupiter
notebook,
is
that
avoiding
dependency.
Health
like
allows
us
to
build
and
serve
this
repository
anywhere
and
so
like
this,
because
you
then
can
build
a
docker
image
and
you.
C
B
So
I'm
not
necessarily
systems
expert
but
for
like,
but
on
a
lot
of
lately
description
doctor
be
allowed,
but
it's
a
lightweight
virtual
image
of
what
you're
working
with.
So
it
allows
you
to
have
your
just
dependencies
in
the
whole
environment
set
up
handle
in
the
more
likely
way
that
allows
you
to
then
have
much
more
control
of
your
environment,
and
you
can
then
have
a
lot
of
different
kinds
of
environments
going
on
at
the
same
time,
and
it
makes
a
lot
easier,
so
Reubens
awkward
actually
also
been
used
in
competitions
and
spacely's.
B
This
is
a
musical
genre
classification,
competition
recruited
or
is
used
for
that.
That
is
a
machine
learning.
This
is
a
machine
learning
competition
that
many
people
are
using
GPUs.
So
this
is
an
example
in
which
people
are
using
Reaper
to
dr.
Whitby
peers
and
for
the
submitting
of
competition
images
for
independent
evaluation.
B
C
C
B
You
will
see
this
I'm
like
filling
out
this
form
and
has
a
github
feeling.
If
you
looked
at
log
it
like
says.
Oh,
it
found
the
ready
a
built
image
of
this
image
of
this
github
repository.
So
that's
basically
the
repo
to
docker
part,
and
then
it's
launching
a
Jupiter
server
of
this
built
image.
So
it
should
be
going.
C
B
B
B
B
Nerf
paper
of
machine
learning
for
jet
engines
and
high-energy
physics,
and
you
can
rerun
someone
else's
analysis.
This
is
their
notebook
with
their
analysis
and
with
Jupiter
lab.
You
can
actually
put
everything
together.
So
you
have
a
paper
here.
You
can
have
the
analysis,
you
can
look
at
everything
and
you
can
run
it
on
someone
else's
compute
and
you
never
have
to
install
anything.
You
just
have
to
start
just
running
the
notebook.
B
So
now
we've
talked
about
dependencies
and
how
like
dealing
with
dependency,
hell
and
getting
out
of
the
way
then
allows
us
to
run
the
code,
and
so
I
would
like
to
then
think
about
what
what
is
we?
What
do
we
mean
about
having
good
analysis
code
so
in
various
venues
code,
submission
is
becoming
more
important
who
are
submitting
to
like
then
like
venues
or
journals,
or
something
that
requires
like
a
code,
availability
statement
or
code
submission
right
now.
B
B
And,
at
least
in
the
machine
learning
literature
sharing
code
is
associated
with
higher
expectations,
so
we
can
see
here
the
average
number
of
citations
generally
statistically
higher,
but
again
like
this
does
not
control
for
the
institution,
the
code
quality,
the
URL
etc.
So
you
know
take
it.
B
B
That
was
published
in
around
2004.
This
is
the
one
of
the
early
implementations
of
it
at
the
time.
This
was
like.
This
is
something
that
they
popped,
that
the
author
had
published
with
their
paper.
So
at
least
this
is
something
people
tend
to
do
at
least
like
at
some
point.
It's
not
totally
uncommon
that
people
do
it,
and
this
actually
was
useful
for
the
open
source
project.
B
It's
still
available
online
in
case
you
ever
want
to
do
any
Bayesian,
nonparametric
clustering,
the
algorithms
are
all
available,
but
I
mean
for
the
two,
despite
the
fact
that,
like
machine
learning,
researchers,
love
thinking
about
state
of
the
art
for
many
use
cases
and
I'm
zooming
for
many
of
your
use
cases,
state-of-the-art,
isn't
necessarily
the
primary
thing
you're
considering
when
selecting
an
algorithm
to
use
for
your
problem
right.
So.
B
These
models
are
all
pre
trained
models
out
of
like
tensorflow
or
PI
torch
libraries.
These
are.
These
are
standard
implementations
that
hasn't
been
like
carefully
engineered
that
people
are
using
who's
using
like,
like
pre
implemented
algorithms
for
their
research
right
now,
who's,
not
who's,
not
like
rolling
their
own
models.
B
Yeah
I
feel,
like
most
people,
are
doing
that,
like
a
lot
of
people
are
doing
it
for
baselines
these
days
right
in
terms
of
like
how
people
are
using
pre
trained
models
in
the
sciences.
This
is
a
paper
for
skin
cancer
detection.
They
used
a
pre
trained
image
classification
model
to
start
the
training
of
their
image.
B
Classification
task
for
for
skin
cancer
ended
up
publishing
it
in
a
medical
journal,
establishing
really
great
baselines
for
automatic
image
classification
for
skin
cancer,
and
this
has
led
the
community
then
to
take
greater
interest
in
careful
documentation
of
these
pre
trained
models.
So
this
is
a
paper
on
on
documenting
pre-trained
models
for
people
who
are
using
these
models
to
understand.
How
is
this
model
trained?
What
data
was
it
used
for
this
model?
What's
the
model
itself
such
that
practitioners
have
a
better
understanding
of
the
implications
of
model
choice.
B
B
This
is
a
this
is
a
discussion
from
two
authors
on
a
Bayesian
paper
that
was
doing
some
like
sampling
for
computational
do
not
make.
Indeed,
you
see
the
author
then
realizes
that
there's
a
bug
in
their
scientific
code,
and
this
actually
ended
up,
resulting
then
in
a
paper
on
the
testing
of
MCM
and,
ultimately
then
the
authors
had
to
retract
the
paper
and
then
reissue
the
paper
with
their
new
result.
B
Similar
findings
have
been
found
in
their
machine
learning
where
the
RL
paper
and
the
authors
note
that
this
paper
was
inspired
by
a
bug
that
was
found
in
some
co-author,
some,
not
sorry,
snot
co-authors,
some
colleagues
code.
They
ended
up
looking
at
their
implementation
and
finding
something
was
amiss
with
their
analysis
and
ended
up
building.
B
So
we
think,
then,
about
best
practices
for
research
code.
None
of
this
is
necessarily
novel
from
a
software
engineering
perspective,
but
I
think
it's
something
that
we
need
to
think
about
as
scientists.
So
when
you
are
publishing
code
and
I
hope
you
all
publish
code
associated
with
your
papers
as
a
little
reminder
at
least
to
think
about
how
do
we
do
testing?
How
do
we
can
we
can?
B
We
include
testing
include
code,
review,
doing
good
design,
thinking
about
variable
names
right
so,
like
you
know
my
there,
no
no,
my
bar
like
coming
up
with
meaningful
variable
names.
Even
you
know,
if
we're
thinking
about
like
parameter
names
like
instead
of
doing
like
lambda
right.
Maybe
we
need
a
really
mean
like
something
like
this
is
the
learning
rate
or
something
like
that
like
have
like
variable
names
that
really
are
expressive
and
be
willing
to
refactor
our
code
when
necessary,
so
to
kind
of
to
kind
of
put
in
a
deeper
example
of
this.
B
B
It's
a
lot
more
involved,
a
lot
more
challenging
and
oftentimes
there's
a
lot
more
trepidation
involved,
whereas
in
this
case
someone's
finding
a
bomb
in
an
underlying
library
which
is
important
to
know
because,
ultimately,
like
we
are
here,
judging
our
conclusions
based
on
the
software,
but
still
at
the
same
time
like
this
is
considered
an
improvement
for
the
general
community
right,
it's
relatively
straightforward.
Somebody
finds
the
bug
somebody
thinks
about
like
everybody
benefits,
but
that's
almost
so
the
case
in
the
sciences.
B
If
I
find
a
problem
with
your
analysis
or
if
I
think
that
there
is
something
that
can
be
improved
like
going
through,
the
whole
process
of
a
retraction
is
very,
is
have
a
lot
of
trepidation,
so
one
of
the
things
I
kind
of
want
to
leave.
You
all
say
me
about
is
like
how
do
we
brave
the
sciences
close
to
this
sort
of
model
right
where
incremental
improvements
and
collaboration
is
considered
a
positive
right,
I,
don't
necessarily
have
an
answer,
but
I
like
I.
Don't
leave
this
to
you
as
both
you
know.
B
Something
to
be
aware
of
in
this
ends
that,
like
your
underlying
software,
does
have
bugs
and
like
your
libraries
that
you
work
with
have
bugs,
but
also
to
think
about
how
do
we
foster
that
kind
of
environment
in
our
research.
B
So
yeah
this
is
kind
of
this
is
basically
one
thing
that
I
don't
necessarily
have
an
answer
for,
but
I
kind
of
want
to
leave
you
all
thinking
about
a
little
bit
so
we've
kind
of
talked
about.
We
talked
about
the
dependencies.
We
talked
about
writing
good
software
and
hopefully
you
know
these
pieces,
like
give
you
a
lot
to
think
about
in
terms
of
implementing
your
own
models
and
doing
your
analysis,
but
maybe
that's
not
necessarily
enough
to
ensure
the
reproducibility
of
our
own
research.
So.
B
B
B
These
days,
it's
very
computationally,
expensive,
it's
very
energy
intensive,
and
so
a
lot
of
these
results
may
not
necessarily
be
reproducible
for
the
average
user
or
for
the
average
researcher
right,
like
Facebook,
recently
published
a
reimplementation
of
alphago
and
alpha
zero,
but
they're
one
of
the
few
institutions
that
have
the
computational
resources
to
ensure
that
their
implementation
works,
and
so
this
is
all
open
go,
and
this
is
the
reimplementation
that
they
have
it's
available
to
the
public.
But
again,
like
you
know,
this
is
this.
This
works.
B
These
large-scale
GPU
farms
to
work
together
to
like
path
gradients
together
to
like
coordinate
such
that
they
can
do
distributed
training
and
achieve
super
human
performance
and
go
is
not
necessarily
the
important
thing
you
need.
What
you
really
want
to
know
is
the
underlying
ideas
of
the
algorithm
that
allow
it
to
work
and
so
similar
problem
who's.
Who
does
this?
Who
does
neuroscience?
B
Okay,
so
good?
So
that's
at
least
then
I
don't
have
to
assume
that
someone
knows
this
library
better
than
me.
I,
don't
necessarily
use
this
library,
but
this
is
a
this
is
a
study
of
and
the
results
of,
an
open
source
library
for
brain
volume,
measurements,
and
this
looks
at
what's
the
effects
of
these,
this
library
performance
on
neuro
and
brain
data,
respect
to
system
setup,
and
it
sounds
like
relatively
straightforward.
You
would
think
that
results
in
T
relative
with
consistent,
but
it's
not
so.
C
B
B
I
to
me,
that's
very
scary
because
absolutely
frightening,
but
you
would
might
end
up
making
dramatically
different
conclusions
based
on
how
your
setup
is
design
right,
and
so
these
are
types
of
decisions
and
details
that
nutty
researchers
don't
even
necessarily
consider
we
talked
about
saying
like
okay,
you
need
them
PI
right,
but
how
many
people
actually
specify
the
actual
version
number
of
them
PI's
are
using
right.
How
many
people
are
specifying
the
actual
like
underlying,
like
type
of
blast
they're
using
right?
Are
they
using
class?
B
Oh
they
using
MPA,
which
version
of
that
are
they're
using
right.
What
operating
system
version
they're
using?
Not
necessarily
everybody
includes
those
details
yep.
So
this
with
regards
to
machine
learning,
that's
not
necessarily
like
you
can
unit
text
like
your
implementation,
but
given
a
lot
of
these
details,
you
can't
necessarily
unit
tests
like
the
results
right.
It
really
depends.
B
I
mean
ultimately
like
yes,
you
if
you
have
a
unit
test
that
might
miss
that
might
work
for
that
case,
but
you
still
need
to
be
very,
very
particular
about
the
environmental
setup
of
what
you're
running
and
even
then,
when
you're
doing
like
when
you're
doing
something
with
a
distributed
system.
It's
not
necessarily
certain
that
the
distributed
system
is
going
to
end
up
giving
you
consistent
results
across
machines
right
because,
like
this
one
of
the
things
that
they
matter
here
that
they
discussed
here
is
workstations.
B
So
if
you're
running
this
on
the
distributed
system
and
you're,
not
necessarily
hitting
the
same
machines,
you
might
not
necessarily
get
the
same
result.
Yes,
although
one
of
the
things
that
we
don't
necessarily
have
a
full
handle
on
given
the
fact
that
these
implementations
are
non-deterministic
is
what
our
fault,
tolerance
really
is
yes,
but
even
but,
but
we
can
get
into
that
like
even
that
is
problematic.
I
mean
well
we'll
discuss
there.
B
B
Of
this
testing,
the
actual
type
of
workstation,
the
actual
kind
of
setup
is
even
that's
like
when
you
try
to
be
as
a
neurotic
it's
possible.
That
doesn't
necessarily
mean
that
this
is
reproducible
for
all
time.
So
this
is
a
paper
in
hood
dynamics.
We're
going
to
borrow
my
there's
a
lot
of
really
great
work
on
open
science.
I
would
definitely
read
all
of
her
stuff
he's
doing
a
roll
of
stuff,
also
on
like
alternative
scholarship
models
and
he's
really
involved
in
thigh-high
into
the
Sakai
conference.
B
Yeah
Lona
Barbra
awesome
Creedmoor
learner
in
Bartlett
anyway,
so
she
finds
that
even
with
really
careful
protocols
within
her
own
research
group,
she
is
unable
to
reproduce
her
results
from
previous
years,
simply
because
of
changes
and
hardware
and
kingdoms
and
underlying
software.
So
she
does
a
lot
of
stuff.
B
Tp
used
her
fluid
dynamics
and
finds
that,
because
of
the
changes
and
hardware
that
occur
over
like
a
five-year
period
and
changes
in
Kuta
like
you
can't
get
the
same
results
and
that
you
have
to
work
very,
very
hard
to
get
consistent
results
with
this
sorts
of
changes
in
software
and
hardware
for
overtime.
So
even
then,
when
we
think
about
introduced
ability,
this
is
this.
We
need
to
think
about
it,
then,
with
regards
to
some
narrow
band
of
time,
given
the
fact
that,
like
changes
in
hardware
and
software
are
going
to
happen,
I
mean
I.
B
C
B
With
people,
so
I
was
talking
with
some
cemente
nerds
at
singularity
of
singularity,
and
they
find
that
even
like,
because
of
because
of
how
docker
is
set
up
like
you
can't
necessarily
because
you
have
way
you
have
when
you
were
working
with
the
GPU
you
have
to
like
get
out
of
the
GPU
in
order
to
like
hit
the
GPU
I.
Think
you
start,
you
have
to
get
out
of
docker
to
hit
the
GPU
you're,
not
going
to
necessarily
ensure
that,
as
a
result
until
like
multi-gpu
with
a
docker,
why
not
work
yeah
yeah?
B
And
then
you
know
when
we
think
about
that
they
sorts
of
concerns
like
it
might
not
even
necessarily
make
sense
to
make
all
our
research
fully
transparent.
So
many
of
you
have
probably
who've
worked
with
sensitive
data
like
data
that
you
can't.
So
there
are
certain
concerns
with
sharing
either
legally
ethically,
you've
had
to
get
an
IRB
some
things.
B
B
B
That
is
probably
not
the
right
choice
to
do
in
terms
of
transparent
research,
disease
researchers
ended
up
publishing
all
of
their
data
and
results
for
OKCupid
data
collecting
people's
information
off
of
OkCupid
the
commute
the
public
did
not
necessarily
respond
well
to
having
all
of
people's
personal
information,
aggregated
and
public
online
in
the
Google
place,
and
when
we
think
about
data
sharing
and
ethical
considerations.
This.
C
B
B
So
when
we
think
about
what
science,
the
scientific
method
is
right
like
ultimately,
the
scientific
method
has
to
do
with
the
testing
of
hypotheses
right,
you
form
a
hypothesis.
You
say
this
is
what
I
expect
to
happen.
You
identify
all
of
the
experimental
controls
that
you
want
to
control
for,
and
then
you
test
with
relation
to
those
right
and
so
statistical
testing
has
been
a
really
old
practice.
This
is
probably
this
is
like
the
earliest
statistical
testing
research
I
have
found.
B
Which
is
on
the
likelihood
of
the
birth
baby,
males
and
baby
female,
and
so
they,
the
the
scientist,
was
looking
at
the
Cystic
alike.
We
heard
they
may
be
born
for
employer
girls
and
find
that,
through
statistical
testing
and
birth
records
data
in
Britain,
they
were
able
to
establish
that
they
were
equally
likely
and
under
which
they
this
they.
They
argued
for
divine
providence,
yeah,
yeah,
divine,
okay,
not
private,
probably
Nantes,
Providence,
divine
providence,
and
so
a
statistical
testing
is
something
that
many
of
you
probably
use
for
in
your
own
research.
B
But
it's
something
that
has
been
underutilized
in
the
machine
learning
literature.
This
is
some
of
the
work.
That's
been
done
for
machine
learning,
statistical
testing
of
multiple
classifiers
from
2006
I.
Think
it's
one
of
the
later
works
that
was
done
and
follow-up
work
on
this
has
not
necessarily
followed.
B
This
is
a
paper
on
the
reproducibility
of
RL
baselines
and
they
find
that,
despite
the
fact
that
they
tried
to
do
various
kinds
of
statistical
testing,
this
statistical
testing,
for
our
reinforcement,
learning
at
least,
is
not
established
right
now
in
terms
of
best
practices
and
in
terms
of
the
theory
to
handle
these
sorts
of
problems.
So
additional
work
is
at
least
done.
B
It
need
to
be
done,
especially
NRL,
to
create
better
statistical
testing
of
the
comparison
of
models,
and
so
you
know,
given
the
fact
that
I
might
be
arguing
that
we
should
be
doing
statistical
testing
of
our
machine
learning
models.
Many
of
you
probably
are
familiar
with
like
problems
around
peak
hacking
problems
around
creating
statistical
significance
and
in
other
sciences.
B
Okay,
Open
Science
framework
is
another
database
that
allows
for
not
only
the
coordination
and
maintenance
of
your
own
scientific
studies,
but
also
for
the
pre-registration
of
studies,
so
that
you
have
all
of
this
available.
Prior
to
doing
your
analysis
such
that
you
have
a
very
specific
plan
of
what
you
want
to
do,
and
then
you
just
have
to
go
off
and
do
it
that
way
that
everybody
has
agreed
to
the
analysis
and
whether
or
not
it's
it's
statistically
significant.
B
Then
you
at
least
know
after
the
fact,
so
that
you
you
have
come
up
with
a
distinct
plan
of
what
you
want
to
do
and
then
you
go
off
and
do
it
rather
than
the
doing
it.
The
other
way
around,
which
often
leads
to
concerns
around
pee
hacking,
so
in
the
scientific
literature,
at
least
in
other
studies
outside
outside
of
machine
learning,
they
have
found
that
pre-registration
then
leads
to
the
publication
of
negative
results.
Lack
of
negative
results
can
be
very
probable
and
challenging
to
a
lot
of
fields.
B
In
that,
then
we
don't
necessarily
what
doesn't
know
what
doesn't
work
and
they
have
found
that
that
rep,
that
registered
reports,
which
is
basically
the
model
that
allows
for
the
publication
of
peer-reviewed
pre
registered
studies
such
that
publication,
then
is
independent
of
significance
right.
So
you,
you
register
a
study,
it's
peer,
reviewed
and
then
based
on
the
quality
of
your
analysis
plan
and
the
quality
of
your
methodology.
It
is
then
sent
off
for
pub
for
publication.
That's
a
saying,
okay,
you
get!
Yes,
you
can
publish
this.
You
know
come
back
with
your
results.
B
B
But
when
we
think
about
machine
learning,
research
at
least
its
might
be
a
little
bit
different
when
we
think
about
pre-registration
right
for
machine
learning,
literature,
you
don't
necessarily
have
to
collect
novel
data
and
many
of
your
problems.
You
already
have
the
data
on
hand
as
well
right.
So
then,
what
do
we
think
about?
How
would
we
then
do
pre-registration
in
our
own
fields
right
in
our
fields?
Maybe
many
of
you
have
not
done
registered
reports.
Is
there
an
opportunity
for
us
as
scientists
to
adopt
registered
reports
in
our
own
communities?
B
So
one
example:
one
alternative
method
of
thinking
about
registered
reports
is
to
do
registration
at
the
time
of
review.
So
in
this
scenario
we
might
say
I'm
going
to
describe
my
model,
describe
my
analysis,
describe
my
plan
and
then
the
actual
running
of
the
code
will
be
done
by
an
independent
body.
An
independent
agent
right
and
those
results
are
the
ones
that
are
published
right
just
say
this
could
be
done
not
saying
it
hasn't
been
done
by
anywhere,
but
this
isn't
it.
B
B
So
so,
at
least
in
the
machine
learning
community,
a
lot
number
of
a
number
of
research
groups
and
sponsors
have
cloud
companies.
So,
for
example
like
so,
for
example,
like
there's
like
some
of
the
main
sponsors
of
machine
learning,
research
are
Amazon.
Research,
Amazon,
Google
ISM
is
a
huge
funder
of
machine
learning.
Research
Microsoft
as
well.
They
all
have
cloud
services
great,
and
so
we
can
imagine
a
scenario
in
which,
or
even
like
universities
have
their
own
compute.
B
We
can
imagine,
for
example,
that,
like
the
public
public,
the
submission
of
the
docker
image
is
the
is
the
registration
and
then
someone
else
takes
that
and
then
runs
it
right
and
that
those
results
then
are
done
by
someone
else
is
the
results
that
are
published
yes
and
that
what
I'm
arguing
is
like
you
could
imagine
doing.
The
results
like
we're
doing
the
whole
experiment
on
your
own
to
say
like
this
is
I.
B
B
Maybe
these
examples
that
I
have
given
of
registered
reports
and
if
statistical
testing
has
probably
choose
to
computationally
simple.
So
at
least
let's
talk
then
about
like
how
would
we
do
statistical
testing
and
controlled
experiments
in
another
domain
and
I'm,
not
necessarily
the
expert
with
regards
to
high
energy
physics,
but
I
think
at
least
in
other
domains
of
experimental
science.
B
This
is
part
of
the
reason
why
bringing
me
and
Rex
identify
physics
as
a
possible
area
for
us
to
consider
for
the
testing
of
black
box
behaviors
and
models
and
thinking
about
how
do
we
do
testing
in
general
and
so
I
think
there
are
a
few
high-energy
physicists
here
who
know
this
problem
better
than
me
so
I
apologize
for
like
for
describing
higher
energy
physics,
not
necessarily
well,
but
an
example
of
what
you
might
think
about
in
terms
of
higher
physics,
is
least
there's
multiple
institutions
that
are
able
to
validate
each
other's
work
and
confirm
each
other's
work
right,
and
so
we
can
imagine,
then,
for
the
registration.
B
This
is
this
is
the
Large
Hadron
Collider,
and
so
these
so
people
can
then
share
resources
in
terms
of
compute
and
for
registration
and
for
validation
of
brain
and
even
in
this
example
right
we
have
the
higher
the
physics
community
has
a
better
understanding
of
the
things
that
they
want
to
test
for
and
the
things
they
want
to
control
great
in
machine
learning
community.
We
don't
necessarily
have
these
clear
definitions
right.
B
We
think
we
just
want
to
establish
state
of
the
art,
but
we
aren't
necessarily
careful
of
trying
to
control
for
all
these
other
experimental
factors
right
and
you
don't
necessarily
know
how
to
test
for
these
experimental
factors,
and
so
this
is
something
that
we
think
we
need
to
develop
better
methods
and
tools
for,
and
so,
as
I
mentioned
hierarchy.
Physicists
are
interested
in
the
discovery
of
new
particles
and
when
they
think
about
these
sorts
of
questions,
then
they
think
about
the
the
thing
that
they
want
to
test
and
the
nuisance
parameters.
B
So
maybe
this
means
you
might
have
to
run
your
analysis
a
lot
more
times
with
a
lot
more
variability
in
terms
of
your
system
setup.
In
order
to
have
these
sorts
of
things
being
confirmed
great,
and
even
though
the
one
of
the
questions
that
I'm
interested
in
leaving
leaving
you
all
to
think
about
in
terms
of
how
you
do
statistical
testing
and
how
do
we
develop
that
our
statistical
tests
I
think
that's
a
really
important
question
for
us
all
to
consider
as
we're
moving
towards
more
deep
learning
models.
B
So
if
we
were
to
borrow
this
practice,
one
of
the
you
many
of
you
are
probably
familiar
with
how
you
do
how
you
do
hypothesis
testing
in
your
fields
like,
but
at
least
for
machine
learning.
We
don't
necessarily
think
about
how
to
formulate
hypotheses
right,
but
at
least
in
machine
learning.
B
One
of
the
things
we're
interested
in
is
thinking
about
the
expected
behavior,
with
or
without
some
sort
of
experimental
change
in
our
model,
setup
right
and
trying
to
record
outcomes
with
regards
to
various
baseline
models
and
considering
this
sort
of
variability,
and
then
we
might
want
to
be
able
to
then
think
about
some
sort
of
statistical
improvement
and
make
some
sort
of
assumptions
about
what
those
might
look
like.
So
this
is
some
work
that
I
recently
presented
with
Michaela
who's
in
the
room.
B
The
poster
is
kind
of
fun,
I,
really
like
this
poster,
and
so
one
so,
as
I
mentioned,
like
being
able
to
establish
these
sorts
of
systems,
level
distinctions
and
get
them
independent
of
data
and
dependent
of
our
other
experimental
selections
is
really
important
and
I
think
this
is
something
that
the
machine
learning
community
at
least,
is
beginning
to
start
to
adapt.
I
assume
that
many
of
your
other
fields
think
about
these
sorts
of
questions
in
a
deep
way.
B
But
I
know
that
this
is
something
that
machine
learning
is
beginning
to
wrap
its
arms
around
and
I.
At
least
I
am
interested
in
talking
to
all
of
you
about
how
you
are
working
on
those
problems
with
regards
to
bringing
machine
learning
into
your
fields
and
thinking
about
experimental
testing
and
experimental
controls
given
blackbox
models,
because
this
is
something
that
at
least
the
machine
learning
community
hasn't
fully
dealt
with
and
to
come
up
with
good
practices
to
deal
with
these
sorts
of
questions.
B
So
one
of
these
problems,
but
some
of
the
problems
that
we
think
about
in
terms
of
statistical
tools,
controls
and
machine
learning
is
compute,
and
so,
as
you
know,
obviously
like
compute
is
a
challenge
but
being
able
to
control
for
the
finding
a
result
and
controlling
for
compute.
We
don't
necessarily
know
how
to
do
controlling
for
the
random
seed
being
able
to
the
actual
assistants.
Random
seeds
between
machines
is
still
hard
so
being
able
to
say
that,
like
your
results,
is,
is
independent
of
your
seed
selection.
B
You
probably
will
have
it
to
run
it
a
lot
more
and
among
on
more
machines
in
order
to
establish
that
your
result
is
consistent,
regardless
of
what
machine
and
what
random
seeds
you
ended
up.
Picking
but
I
think
all
of
this
sorts
of
concerns
about
the
verify
ability
and
the
consistency
over
the
results
belies
a
more
fundamental
point,
which
is:
is
these
verification
and
assertions
really
what
we
want,
or
we
are
we
really
just
trying
to
do
all
of
this
such
that
we
can
like
catch.
B
B
What
often
times
we're
actually
trying
to
use
other
people's
code,
because
we
have
a
problem
that
for
ourselves
that
we're
trying
to
solve
we're
not
just
going
off
and
trying
to
hunt
other
people
down
like
trying
to
shame
them
right
and
so
fundamentally,
I
think.
Ultimately,
the
purpose
of
having
code
is
to
illustrate
the
problems
and
the
findings
that
you're
trying
to
do
right.
B
And
so
when
it
regards
when
it
comes
to
showing
then
like,
are
these
implementation
details
in
the
paper?
The
main
thing
that
we're
looking
for
right,
I
argue
they're,
not
right,
like
it's
actually
the
most
important
ideas
in
the
paper
or
not
like
what
operating
system
you
used
or
like
what
version
of
numpy
like
all
of
this
is
important,
but
that's
not
the
main
point
right.
B
The
main
point
are
the
underlying
ideas
on
that
you're,
trying
to
explain
or
the
underlying
findings
that
you're
trying
to
communicate,
and
so
this
is
a
paper
on
the
reproducibility
and
on
computational
science.
That
is
often
cited,
and
it
makes
this
very
radical
claim
that
I
think
is
actually
still
radical
today,
which
is
the
scholarship
itself
in
computational.
Science
is
not
the
paper.
The
paper
is
just
a
description
of
the
idea.
The
actual
scholarship-
and
this
is
this-
is
from
like
95.
B
So
this
is
an
old
idea,
even
but
even
still,
it
hasn't
necessarily
been
fully
realized.
The
actual
scholarship
is
the
software
that
produced
that
result.
If
you're
doing
computational
science,
like
the
actual
deliverable
of
your
finding,
is
your
code
right.
Your
code
got
you
that
result,
that's
your
scholarship
right,
and
so
these
implementation
details
will
be
in
your
code.
They
don't
necessarily
need
to
be
in
those
people,
your
paper
right
and
hopefully,
then
we
can
then
build
software
that
helps
people
do
their
own
research.
B
That
helps
them
understand
problems
by
enabling
them
to
use
that
code,
as
opposed
to
just
be
focusing
on
these
details
of
how
to
get
the
code
to
even
run,
and
so
maybe
this
means,
then
we
need
to
be
communicating
with
each
other
outside
of
this
PDF.
Many
of
you
are
using
it.
How
many
of
you
are
interacting
with
each
other
on
github,
like
code,
is
becoming
a
fundamental
open?
B
Source
code
is
becoming
a
fundamental
part
of
how
we're
doing
science
today,
and
so
we
can
start
moving
science
towards
this
sort
of
model
in
which
open
source
becomes
a
greater
part
of
how
scholarship
and
shared
and
so
I
think,
one
of
the
things
that
is
important
in
terms
of
thinking
about
what
we're
doing
then,
is
that
we
need
transparency
in
what
we're
doing,
but
we
need
we
need
ultimately,
so
we
still
need
them
to
be
independently
verified.
So
an
example
of
what
we
might
be
thinking
about,
then,
is
like
simple
implementations.
B
So
this
is
alpha
zero,
and
this
is
an
example
that
uses
tic-tac-toe
right.
This
is
this
implement.
This,
like
has
the
model
of
alpha
zero
right,
which
allows
general-purpose
training
of
models
for
two-player
games,
but
it
also
includes
tic-tac-toe
and
tic-tac-toe,
is
actually
much
more
useful
to
the
average
scientist,
because
it
can
run
on
your
own
machine.
It
doesn't
cost
as
much
money
and
it
has
all
of
the
implementation
details
of
the
algorithm
right.
So
we
can
run
the
whole
thing
on
our
own
and
we
can
even
put
it
on
a
different
game
right.
B
B
Pub
is
a
venue
that
allows
for
explanatory
science
in
machine
learning
literature,
and
this
is
an
example
of
differentiable
privatizations
and
all
of
these
are
collaboratory
notebooks
that
communicate
the
ideas
that
are
presented
in
this
paper
right,
and
so
any
of
these
note
any
of
these
example
then
can
be
independently
run
and
interacted
with
in
the
cloud
right.
So
collaboratory
allows
people
to
run
a
GPU
on
google
cloud
and
they
can
interact
with
this
so
that
they
can
run
the
notebook,
but
also
can
ask
their
own
questions.
B
This
is
an
example
of
a
paper
we
discussed
earlier,
which
is
attention.
You
is
all
you
need,
which
I
mentioned
established
the
fact
that
implementation
improvements
in
NLP
don't
necessarily
have
consistent
results,
and
so,
as
a
result,
this
is
this
is
available
online
in
a
notebook,
you
can
just
go
through
it
and
see
an
independent
implementation
and
work
through
it
on
your
own
and
so
to
go
back
to
what
I
mentioned
earlier
with
the
notebook.
A
lot
of
you
have
used
snowed
books
in
the
past.
B
Many
of
us
use
notebooks
not
necessarily
to
present
science,
but
just
to
interact
with
a
piece
of
data
right.
So
in
this
example,
someone
is
using
ipython
they're,
trying
to
plot
the
data
they're
trying
to
load
the
data
right
and
so
oftentimes.
When
we're
trying
to
think
about
a
problem,
we
often
interact
with
it.
Computationally
right.
We
just
want
to
be
able
to
see
what
happens
and
begin
to
ask
our
own
questions
independently
right.
B
And
so
what
I'd?
Like
you
all
to
think
about
is
through
opens
implementations
and,
like
really
good
implementations,
we
think
it
can
begin
to
immediately
build
off
of
other
people's
code.
So
this
is
an
example
yeah,
a
research
repository
on
binder
and
in
this
example,
they're
doing
some
machine
learning
for
explain
ability
and
for
interpretability
and
they
are
presenting
their
analysis
on
the
left.
This
is
their
notebook
right,
but
because
I
have
the
code
and
because
I
am
able
to
run
everything
independently.
B
B
The
more
importantly
they're,
using
this
example
of
like
creating
with
simulated
data
with
four
classes,
great
I'm
now
able
to
say.
But
what,
if
you
did
this
experiment
with
five
classes?
How
was
it
for
him
right
and
I
can
create
my
own
simulated
data
and
run
this
novel
experiment
to
ask
my
own
independent
questions
of
their
model
of
their
method
without
having
to
do
any
additional
implementation
work,
because
it's
all
there
right
if
this
were,
if
this
were
another
repo
that
was
not
necessarily
as
well
maintained.
B
If
I
didn't
have
this
sort
of
setup
ready
to
go,
this
would
have
taken
me
a
lot
more
work
right.
Just
getting
to
this
point
of
saying.
Well,
what,
if
you
did
this
experiment,
this
way
right,
I'm
sure
many
of
you
as
have
gotten
reviews
in
which
they
said
well?
Why
did
you
do
this
experiment?
This
way?
Why
didn't
you
do
it
that
way?
How
many
gotten
that
response?
B
B
Think,
more
importantly,
we
need
to
think
about
what
we
really
mean
by
the
term
reproduce
great,
and
that's
part
of
the
reason
why
I'm
not
particularly
interested
in
what
the
definition
of
reproducibility
is,
because,
ultimately,
what
I'm
really
interested
in
is
the
question
of
how
would
we
get
research
to
beget
better
research
right
and
that's
ultimately,
what
the
purpose
of
science
is?
The
purpose
of
science
is
to
create
knowledge
such
that
other
people
are
empowered
to
ask
new
questions
based
on
that
knowledge
and
so
oftentimes.
Then
we
would
call
this
an
extension
of
previous
work.
B
So
fundamentally
that
means
that
I'm
interested
in
the
question
of
how
do
we
build
extensible
science?
How
do
we
build
extensible,
computational
pipelines
such
that
other
people
can
immediately
build
off
of
what
we
do
and
ensure
that
the
community
of
researchers
around
a
particular
question
grows,
and
so
this
is
this
is
the
this?
B
Is
the
same
data
set
that
I
discussed
earlier,
which
was
the
nur
ups
2017
dataset,
and
we
were
talking
about
citations,
we're
talking
about
research
practices
with
regards
to
dependency
setup,
but
some
of
the
things
that
I
wanted
to
point
out
here
is
when
we
look
at
people
who
include
dependency
files,
write
files
that
can
actually
identify
denta
fie.
What
software
was
used
in
the
machine
made
readable
format.
We
find
that
people
who
are
publishing
these
details
have
more
engagement
on
github
right.
B
B
So
when
you
fork
a
repo
that
probably
means
you're
gonna
use
it
to
your
own
purposes,
right,
you
could,
you
could
be
using
it
just
to
like
make
a
pull
request,
which
is
great
like,
but
you
also
could
be
using
it
to
repurpose
the
repo
for
your
own
problem
right
and
then
fine
then,
but,
like
notably,
more
people
are
repurposing
these
repositories
for
their
own
problem,
using
these
machine
learning,
algorithms,
great
and
so
that
men
means
that
they
people
are
applying
these
repo.
These
methods
to
their
own
problem.
B
And
so
we
think
last
year,
I
was
with
some
colleagues
at
product
dinner
and
we
presented
a
implementation
of
binder
that
included
GPUs.
So
we
took
a
bunch
of
repos
of
machine
learning,
algorithms
that
were
available
on
github
and
we
ensured
that
they
were
able
to
be
put
on
binder
and
run
with
a
GPU.
So
you
could
just
log
on
and
get
like
a
one
GPU
to
yourself
to
re-implement
the
to
rerun
the
analysis,
and
so
it
can
be
done.
B
If
you
have
any
more
questions
about
how
to
work
with
a
repo
to
dock
and
GPUs.
Please
stop
and
find
me,
but
that's
the
this
that,
but
having
a
sort
of
place
in
which
independent
running
of
repositories,
I
think
is
particularly
important
and
notable
to
think
about.
I
would
love
to
talk
about
this
with
you
more
and.
B
B
So
I
think
one
of
the
things
that
I'm
really
interested
in
and
I
think
this
is
one
of
the
things
I
was
like
kind
of
pointing
out
with
this
slide
is
like
I
think
that
machine
learning,
one
of
the
challenges
in
the
focus
for
state
of
the
art
is
that
there
is.
There
is
a
less
developed
community
for
empirical
research
that
tries
to
understand
the
behavior
of
machine
learning
models.
B
Now
this
is
something
that
I
am
working
on
and
it's
a
really
important
question
I
think
that
drawing
attention
to
that
in
the
community
and
trying
to
develop
that
as
a
field
is
like
a
really
important
problem,
and
that's
one
of
the
things
that
I
am
interested
in
working
on,
but
I
think
it's
it
still
hasn't
been
quite
as
established.
We
had
there's
interesting
work.
B
So
there
are
workshops
that
try
to
do
a
fix,
there's
a
ICCB
we,
which
is
a
computer
vision
workshop
that
is
interested
in
a
pre-registration
model.
So
that's
an
example
of
one
thing
that
might
be
for
that.
There's.
A
recent
workshop
I
went
to
at
ICML,
which
is
on
understanding
phenomena
and
deep
learning,
but
it's
still
a
young
field
and
I
think
that
bringing
along
the
community
such
the
people
know
how
to
review
these
sorts
of
papers
and
people
know
how
to
reward.
B
This
kind
of
work
is
still
an
ongoing
process,
so
top
the
liner
is
your
question.
I
mean
to
me:
I
mean
it's
I,
think
the
pie
thing
good
to
me.
It
goes
back
to
the
venues
I
think,
ultimately,
like
the
venues
define
what
the
unit
of
scholarship
is,
the
venue's
describe
to
find
what
the
method
of
evaluation
is
the
venue's
define.
B
You
know
like
what
a
good
paper
is
and
how
to
reward
a
good
work.
So
to
me
I
think,
fundamentally,
we
need
to
be
like
going
to
our
communities
and
like
arguing
for
research
that
values
software
so
there's
interesting
work
in
terms
of
like
there's
journals
like
there's
the
Journal
of
open
source
software.
B
So
that
is
something
that
we
need
to
do,
but
fundamentally
I
think
like
like
we
could
try
to
talk
to
our
research
community,
like
our
departments
and
like
have
those
sorts
of
conversations
about
funding,
but
I
think
like
to
me
like.
Ultimately,
if
I
give
a
docker
image,
if
you
tell
me
to
submit
a
docker
image,
it
spits
out
a
PDF
brain
versus,
like
give
you
a
PDF
like,
like
I,
feel
like.
If
the
you
know
the
place
gets
the
PDF
like
they
still
get
the
PDF
so
like.
B
If
we
can
move
towards
that,
then
that
would
be
better,
but
I
mean
it
is
it
is.
It
is
a
challenge
right
and
so,
like
I
and
I've,
been
to
conferences
that,
like
think
about
scientific
scholarship
and
I,
think
the
I
think
the
journals
are
still
trying
to
figure
that
out,
but
they're
also
having
a
lot
of
challenges.
I
mean,
in
my
mind,
like
if
we're
thinking
about
like
what
people
would
pay
for
in
terms
of
scholarship,
I,
think
people
would
pay
for
for
hosting
and
compute
of
of,
like
someone
else's
research
right.
B
How
do
we
bring
that
in?
How
do
we
like,
like
honor,
that
sort
of
alternate
form
of
like
scientific
communication
and
like
bring
that
it
like
tormal
eyes
that,
because,
like
a
lot
of
you
are
spending
a
lot
of
time
like
talking
about
science
on
twitter
and
like
promoting
your
work
on
twitter
and
like
communicating
it,
and
if
there's
a
way
that
we
can
do
that
in
a
better
way
such
that
we
can
connect
these
two
code?
That
would
be
really
great.
But
it's
like
it's
hard.