►
From YouTube: DevoWorm ML: Week 4 (Input Data lecture)
Description
Fourth DevoWormML meeting, September 25. Attendees: Richard Gordon, Bradly Alicea, Jesse Parent, Aidan Rocke, Vinay Varma, and Abraham Kohrman
A
Sure
I
think
we
can
at
least
get
started
now
so
welcome
again
to
diva
warm
mo
and
thanks
again
to
Vinay
for
last
week's
presentation.
It
was
very
enjoyable.
You
know,
I
learned
a
lot
about
what
he
was
doing
so
after
his
presentation,
I
thought
you
know,
I
want
to
make
sure
that
people
were
sort
I
don't
want
to
give
like
a
super
technical.
A
A
A
A
A
So
this
topic
is
input
data
and
basically
the
idea
here
is
that
if
you
put
data
into
a
model
or
algorithm
or
whatever
it's
going
to
reflect,
what
comes
out
of
it
and
so
I
gave
it
a
little
Bayesian
notation,
it's
cute
I
mean,
but
it
kind
of
means.
You
know
there
is
a
kind
of
conditional
relationship
there
that
if
you
put
bad
data
in
or
something
that
isn't
very
good,
you're,
not
gonna
get
a
very
good
result.
A
It
does
hold
true,
however,
that
if
you
put
very
good
data
in,
you
can
still
get
a
bad
result,
but
you
know,
if
you
want
to
talk
about
that
some
more
in
future
weeks.
We
can
do
so,
but
our
focus
here
today
is
this
processing
of
the
data,
but
to
have
a
good,
clean
set
of
input
data
that
we
can
then
make
inferences
on,
and
so
this
is,
this
figure
is
actually
something
from
a
cybernetics
article,
I
read
and
I
can't
remember
which
article
it
was
but
I
think
it's
a
nice
graphic.
A
So
I
wanted
to
introduce
people
with
you
haven't
visited.
This
already
is
the
Devo
zoo
and
so
I
mentioned
I.
Think
in
the
first
or
second
meeting
that
Devo
worm
has
data
that
we've
collected
from
different
places.
This
is
developmental
data.
It's
largely
embryogenesis
that,
although
there
are
other
examples
in
this
repository-
and
so
this.
C
A
Link
to
the
repository-
and
we
basically
we've
assembled
a
bunch
of
datasets
that
are,
you
know,
from
different
publications
or
from
different
sources
and
they're
labeled
here
and
you
click
through
and
you
find
it
in
in
a
location
and
you
download
it
and
you
can
use
it
for
different
things.
You
might
use
it
for
training
a
model.
You
might
use
it
for
analysis
whatever,
and
we've
had
a
little
bit
of
interest
in
it.
A
I've
used
it
for
google
Summer
of
Code
I've,
eased
it
for
people
who
are
interested
in
contributing
to
the
open
worm
foundation
I've,
given
them
some
datasets
just
kind
of
right
here.
So
this
when
we
talk
about
input
data,
we're
talking
about
various
data
sets
like
you
know
that
come
from
a
place
like
this,
and
you
know,
take
like
image
data,
or
sometimes
it's
tabular
data,
and
then
you
know,
put
them
into
a
model
and
then
find
you
know
you
can
categorize
it.
A
Embl,
for
example,
is
a
good
place
to
find
molecular
data.
If
you're
interested
in
gene
expression
or
sequence,
data
EMBL
usually
has
saman
it's
associated
with
a
publication,
so
you
can
download
it
in
tabular
form
and
it's
you
know
in
you
have
to
do
a
little
bit
of
research
on
what
the
variables
mean
sometimes,
but
you
usually
can
get
a
nice
data
set.
That's
maybe
suits
your
needs
image.
J
has
public
data
sets
so
image.
A
J
is
a
image
processing
platform
that
is
open
source
and
they
are,
you
know,
it's
run
by
the
nih,
so
they
actually
have
public
datasets
of
these
to
calibrate
their
program
for
different
algorithms
of
these
for
image,
processing,
I
also
Kegel,
which
is
a
machine
learning
company
Kaggle
comm.
They
have
a
number
of
data
sets
that
are
open,
so
I
gave
this
example
here
of
avian
vocalizations.
This
is
some
person
who
collected
these
data
from
you're
just
sitting
it
with
a
mic
or
a
mic.
A
Public
datasets,
so
there
are
a
lot
of
datasets
for
C
elegans
or
for
zebrafish
or
for
Drosophila,
and
they
have
data.
That's
in
tabular
form,
taken
from
cell
tracking
data
taken
from
other
types
of
experimental
setups,
and
so
there
are
different
types
of
open
data
and
I
wanted
to
bring
this
up.
In
case
people
were
wondering:
how
do
you
get
open
data
and
use
it
for
these
models,
but
they
all
share
this
one
problem,
and
that
is
that
there's
a
criterion
that
we
have
to
evaluate
it
on
or
chip
criteria.
A
A
So
that's
something
you
have
to
think
about,
and
that's
that
takes
a
bit
of
research
on
the
researchers
part
to
figure
out
how
they
want
to
represent
their
data
or,
if
it's
something
that
they
can
easily
deal
with.
And
then
is
it
enough
data.
So
in
some
of
these
instances
we'll
talk
about
today,
is
it
enough
data
to
do
what
you
want
to
make
sure
that
the
algorithm
is
trained
properly
and
so
forth?
A
So
there
are
four
terms
that
I'll
start
with
the
letter:
V
that
are
important
attributes
of
data,
and
so
the
first
of
these
is
volume
in
minutes
how
much
data
is
available.
So
you
know
when
you
download
one
of
these
datasets,
how
much
data
is
there
for
you
to
use
for
you
to
train
a
model
or
whatever?
This
may
be
you're
interested
in
a
lot
of
samples,
maybe
you're
just
interested
in
quantity
in
terms
of
bytes.
A
So
you
know
there's
a
lot
of
very
high
throughput
data,
so
there
are
a
lot
of
bytes,
but
in
terms
of
useful
samples
that
might
be
a
little
tough
to
to
get.
You
know
useful
samples
out
of
that,
and
so
those
are
trade-offs
that
you
have
to
think
about.
Another
V
word
is
velocity.
She
says
something
that
you
know.
You
say
what
does
velocity
have
to
do
with
data
and
it
has
to
do
with
how
much
change
over
time
does
the
data
set
capture?
A
So,
if
you're
interested
in
like
movement
of
like
a
cheetah
and
you
have
images
of
a
cheetah
moving,
but
maybe
you
only
have
six
images
that
are
sort
of
sequential
you
know
of
a
cheetah
running
across
the
plane.
How
can
that
really
capture
the
Cheetahs
movement,
or
do
you
need
much
more
data
for
that?
A
Similarly,
if
you're
looking
at
gene
expression-
and
you
want
to
know
about
fluctuations
in
gene
expression
over
time-
and
you
have
three
time
points-
is
that
enough
data
to
really
capture
what
you
want,
and
so
you
have
to
keep
that
in
mind
when
you
get
public
data
sets
and
work
with
them.
Getting
are
you
sampling
it
out
the
right
number
of
samples
per
unit
time.
A
So
maybe
you're,
you
know,
there's
a
better
data
set
UV
to
use
and
then
two
more
V
words
variety,
which
is
how
much
natural
variation
is
captured
by
your
data
and
we'll
talk
a
little
bit
about
how
you
might
be
able
to
improve
upon
this
later
on
and
then
veracity,
which
is
how
reliable
is
the
data
before
and
after
transformation
veracity,
of
course
minds
truth.
So
can
you
ground
truth
or
biology?
Truth.
A
The
data,
if
you
have
some
data
and
it's
just
something
that
you've
collected
some
repository,
does
it
actually
match
the
measurements
or
other
such
data
sets
that
are
out
there,
and
so
those
are
those
are
it's
a
lot
of
a
lot
of
V
words
to
remember,
but
it's
a
I
think
it's
a
good
framework
for
sort
of
approaching
trying
to
take
data
and
then
apply
it
to
some
machine
learning
model.
So
you
know
how
do
we
know
if
our
data
set
is
usable
and
this
there?
A
This
paper
that
I
have
here
in
this
slide
lays
out
sort
of
a
taxonomy.
So
you
know
there's
usability
how
easy
it
is.
How
easy
is
it
to
figure
out?
What's
in
the
data
set
the
context
of
the
data,
the
availability,
the
reliability?
So
is
the
data
reliable
and
the
presentation
quality?
Those
are
all
factors,
and
you
know
this
is
again
something
you
really
want
to
evaluate.
Your
data
set
your
input.
A
So
we
want
to
have
something
called
training
data
to
put
into
the
machine
learning
model,
which
is
a
mathematical
model
or
some
algorithm,
and
then
eventually
we
want
to
make
a
prediction.
So
we
want
to
take
if
we
saw
this
in
the
digital
battle
area
stuff,
where
we
had
some
model
here
and
we
took
training
data
which
were
microscopy
images
and
we
applied
it
to
this
model,
and
then
we
made
predictions
about
the
shape
of
cells
and
their
locations.
A
So
the
training
data
is,
of
course,
very
important
here,
because
if
I
want
to
train
the
model
to
make
good
predictions,
the
training
data
has
to
have
you
know
easily
identifiable
features
or
in
the
case
of
the
digital
barrier,
data
cells.
Individual
cells,
that
you
can
that
the
machine
can
purse,
and
so
this
training
data,
then
all
the
stuff
that
I've
mentioned
just
previous
to
this
is
going
to
be
very
important
for
training
and,
for
you
know,
a
proper
model
that
can
make
proper
predictions.
A
You
know
that
people
think
of
machine
learning
sometimes
is
like
magic,
but
it's
really
not
managing
it's
just
you're
taking
data
you're
taking
a
good
model,
and
you
know
again
what
model
you
choose
is
you
know
something?
That's
again,
you
have
to
do
a
lot
of
research
on,
but
you
just
have
to
match
these
up,
and
then
you
can
get
a
good
prediction.
C
A
A
So,
when
I
talk
about
training
data,
we
could
use
open
data
or
were
sometimes
for
different
tasks.
They
have
especially
you
know,
benchmarked
and
you
know
assembled
training
data
sets.
So
this
is
the
MN
ist
database
and
this
is
a
famous
benchmark
data
set
for
generic
machine
learning
applications.
So
you
can
see
that's
a
series
of
digits
from
one
night
or
from
zero
to
nine,
and
it's
just
people
writing
these
numbers
by
hand
but
notice.
A
A
You
know
this
is
almost
looks
like
a
trident,
but
it's
still
four,
you
know
and
so
forth,
and
so
there's
a
lot
of
appointments
showing
this
is
that
the
input
data
should
have
a
lot
of
variation
in
it.
You
know
each
of
these
columns
is
a
sample
which
is
you
know
something.
That's
like
an
individual
case.
You
know
like
okay,
this
is
one
case,
and
then
this
is
another
case
in
a
biological
context.
A
You
might
think
of
it
as
like
individual
organisms
or
something
like
that
and
so
you're,
just
showing
the
variation
in
this
training
set
and
you're
lining
it
up
and
you're,
giving
the
model
all
the
stata
you
know
just
over
and
over,
but
different
variations
on
it,
so
that
the
algorithm
can
pick
out
say
that
this
is
a
four.
But
this
is
also
a
four,
and
this
is
also
a
four
and
it
would
be
hard
if
you
just
gave
the
Machine
this
column
for
the
machine,
then
to
say
oh
yeah.
A
This
is
a
four,
even
though
it's
training
on
this
floor
and
so
the
same
thing
with
sevens.
Sometimes
people
put
across
through
their
sevens
and
sometimes
they
don't
and
so
again
the
machine
wouldn't
know
necessarily
that
sevens
exist
in
both
ways.
Unless
you
show
the
machine,
then
through
this
training
set,
and
so
this
is
a
benchmark
set,
meaning
that
it's
something
that
people
can
use
in
publications
and
say
we
trained
it
on
em
this
database
people
know
what
that
means,
and
it's
always
the
same,
and
it
always
should
give
a
very
similar
result.
A
And
so
the
thing
I
think
the
thing
to
remember
about
this
is
that
biology
in
particular,
developmental
biology
does
not
really
have
these
types
of
datasets
I
mean
we
have
some
that
are
famous,
but
there
isn't,
like
a
you,
know,
a
benchmark
data
set
that
people
know.
You
know
how
it's
gonna
perform
when
they
put
it
into
a
machine
learning
model.
A
But
there
are
also
other
types
of
benchmark
data
sets
for
training.
So
if
you're
not
interested
in
handwritten
letters
and
numbers,
we
have
celebrity
faces
so
there's
a
database
where
they
just
give
it
a
bunch
of
faces
that
are
celebrities
and
they
train
it
on
this,
and
you
can
you
know
this
in
this
case
you
have
a
label
which
is
the
name
of
a
celebrity
in
a
face,
and
then
the
algorithm,
knowing
all
of
that
can
distinguish,
faces,
work
and
you
know
identify
faces.
A
There's
the
Stanford
cars
data
set,
which
is
again
different
models
of
cars
and
again
it's
all
about
like
the
specific
shape
and
features
of
the
car
that
if
you
present
enough
variation
to
the
model,
the
model
can
distinguish
between
different
model
of
cars.
It's
pretty
simple
and
then
the
same
holds
true
for
the
iris
dataset
and
the
iris
data
set
is
interesting
because
it's
actually
a
biological
data
set
that
was
created
by
RA
Fisher
like
a
hundred
years
ago,
and
it's
basically
taking
pictures
of
irises
or
the
flower
iris.
A
You
know
in
many
different
versions
of
the
flower,
and
so
you
know
you
have
a
you.
Assemble
this
huge
data
set
of
you
know
irises
different
shapes
different
variants
and
you
can
identify
different
individual
or
you
know.
Even
if
you
have
something
that
looks
odd,
you
know
that
doesn't
really
even
look
like
an
iris
so
much,
but
it
has
distinguishing
features
that
allow
the
algorithm
to
pick
on.
You
know
pick
up
these
features
and
then
you
know
I
directly,
identify
it
as
an
iris.
A
That's
something
and
that
I'm
not
gonna
talk
about
in
this
lecture,
but
yeah
you
bring
it
up.
I
will
mention
that,
so
the
iris
data
set
is
good
for
identifying
irises,
but
when
you
start-
and
you
have
like
a
lot
of
variation
there,
but
then,
when
you
start
getting
away
from
like
the
cannot
or
the
typical
iris
the
average
iris,
then
the
performance
starts
to
degrade.
A
So
what
people
have
done
in
in
the
more
recent
past
is
they've,
come
up
with
something
called
adversarial
training,
and
so
this
is
the
idea
that
you
train
it
on
irises,
but
you
also
train
the
Machine
on
things
that
aren't
irises
that
might
look
similar
or
they
might
be,
like
you
know,
related
things
that
aren't
viruses,
because
you
know
that
you,
like
you
know.
You
pointed
out
that
you
know
you
can
have
variation
in
the
data
set.
But
if
you
present
it
was
something
that
looks.
Maybe
something
like
it.
A
It's
really
ambiguous
to
other
machine
whether
or
not
it's
an
then
you
have
this
adversarial
training
set
that
you
can
use
say,
like
you
know
tulips
that
basically
or
you
know
different
varieties
of
flower,
where
you
have
you
know
different
shapes
that
maybe
you're
similar
to
the
IRS,
but
aren't
so
you
can
put
a
label
on
those
things
and
then
eventually
you
know
the
machine
might
walk
on
to
something
that
maybe
looks
like
an
IRS
but
isn't
know,
but
is
so
it's
like
you?
They
can
deal
with
those
boundary
cases
a
lot
better.
A
So
yeah
people
have
used
adversarial
training
for
that
so
yeah.
It
is
important
to
kind
of
give
it
other
instances
and
again
what
you're
gonna
end
up
looking
at
is
how
the
sort
of
the
success
rate
in
the
error
rate
of
the
model,
so
the
success
rate
would
be.
How
often
does
it
correctly
identify
an
iris
and
then
the
error
rate
is.
How
often
did
it
identify
something?
That's
not
an
iris
as
an
iris,
and
so
that's
a
method.
Yeah,
that's
a
topic
that
is
very
intricate,
but
I'm
just
giving
you
a
high
level
view.
A
Yeah
and
I
think
they
call
them
that
in
machine
learning,
I
just
want
to
know
give
a
like
a
more
colloquial
example
for
people
yeah,
so
that's
I
mean
that,
but
that's
the
idea.
We
want
to
be
able
to
identify
things
that
are
cars
or
irises
or
celebrity
faces.
We
want
to
reject
things
that
are
not,
and
so
I
mean
the
same
holds
for
like
developmental
biology.
If
we
wanna
identify
like
say
a
cell
boundary,
we
want
to
find
the
cell
boundary.
We
don't
necessarily
want
to
lock
on
to
things
that
look
like
a
boundary.
A
We
want
to
be
able
to
wait
for
finer
or,
like
you
know,
our
search
for
features
that
are
likely
to
be
a
cell
boundary,
and
so
so
now
we
get
into
something
called
pre
train
models,
and
so
we
have
training
data,
which
is
input
data,
but
we
also
have
pre
train
models
that
we
can
use,
and
I
mentioned
this
during
the
talk
on
digital
vassal
area.
That's
what
we
were
using
as
a
pre
train
model,
and
so
the
definition
of
this
is
in
architecture
that
is
validated
by
a
benchmark
data
set.
A
So
the
thing
I
just
showed
you
to
solve
a
similar
problem,
and
so
pre
tree
models
are
common
in
deep
learning
and
non-linear
programming,
which
is
a
method
for
usually
it's
used
for
linguistics
research
or
text
identification.
But
a
you
know.
We
that's
not
something
we're
going
to
talk
about
today,
but
so
pre
train
models
are
reliant
on
that
benchmark
data
set
and
the
architecture
is
like
a
neural
network
architecture
that
is
like
pre
weighted.
So
you
you
find
the
weights
and
then
you
apply
it
to
something.
That's
similar.
A
The
question,
of
course,
is
what
is
a
similar
problem
and
so
a
similar
problem
might
be.
You
know
if
you
want
to
if
you
train
it
on
identifying
cars,
it
might
also
work
for
you
know
irises,
because
you're
looking
at
things
with
features
where
it
might
be
like
you
could
create
a
pre
train
model
just
for
some
domain,
like
I,
can't
handwriting,
and
then
it
may
be
very
good
in
that
domain.
But
then
the
question
is
what
a
similar
mean?
A
Can
you
transfer
it
to
other
problems
and
that's
that's
sort
of
the
down
side
of
pre
train
models?
Is
you
don't
know
that,
and
so
in
some
cases
the
model
it
may
be
generalizable
to
host
a
problem.
So
there
some
pre
train
models
such
as
deep
net
or
masks,
are
CNN
which
are
just
to
pre
training
models
that
exist
in
the
deep
learning
literature.
A
Those
models
are
pretty
well
generalizable
people
have
used
them
on
a
wide
range
of
problems
and
gotten
pretty
good
results.
But
it's
worth
noting
that
if
you
have
a
pre
train
model,
it
may
not
be
good
for
your
problem
domain,
so
you
know
especially
like
in
biology
where
we
have
a
lot
of
variation
and
we
have
a
lot
of
things
that
are
moving
or
you
know
unclear
boundaries
or
whatever
that's
gonna,
be
an
issue
to
think
about
very
hard
before
you
think.
A
Pre-Trained
model
is
going
to
solve
all
your
problems,
and
you
know
in
deep
learning
again
they
have
this
wide
range
of
pre-trained
models,
image
that
ResNet
conception,
dgg
I,
mean
just
there's
a
huge
list,
and
the
acronyms
aren't
important.
But
what's
important
is
to
recognize
that
a
lot
of
these
models
exist,
but
not
all
of
them
are
made
for
specific
areas
of
interest
or
fields
of
research.
And
so
again,
here
I
got
two
examples
of
publications
that
mention
pre-trained
models.
One
is
an
archive
free
print,
an
analysis
of
deep
neural
network
models
for
practical
applications.
A
So
they
talked
a
lot
about
pre
trade
models
and
that
and
then
opportunities
and
obstacles
for
deep
learning
in
biology
and
medicine,
and
then
that
that
actually
is
an
example
of
pre
tree
models
for
biology
and
medicine.
But
they
don't
really
propose
any
specific
pre
train
models.
They
just
kind
of
go
over
the
idea
so,
and
if
that
what
I'm
saying
here
is
that
there's
a
lot
of
opportunity
to
perhaps
come
up
with
pre
train
models
for
specific
types
of
data.
A
Pre,
train
models
also
allow
for
clearly
defined
features
and
classes
to
be
built
from
data
of
a
specific
type.
So
you
know
you
could
have
scenes
faces
curves
or
even
tomatoes,
and
you
know
we
have
a
lot
of
variation,
but
each
of
those
types
of
things
are
very
different
in
the
way
the
variation
is
distributed,
and
so
here's
an
article
it's
a
blog
post
on
highlighting
ten
pre
train
model
types
in
the
datasets
that
come
with
them.
So
you
know
this
is
mostly
non
biological
stuff
and
then
overall
pre-trained
models
allow
for
faster
training.
A
A
But
that's
not
always
the
case,
so
you
might
want
to
approach
these
models
with
caution
and
here's
a
medium
article
discussing
why
that
is
and
again
why
that
is
is
because
you
don't
know,
you
know,
whereas
a
pre
trained
model
might
be
very
good
for
identifying
cars.
That
might
not
be
good
for
identifying
cells,
and
so
that's
you
know,
but
pre
training
models
are
an
option
so
and
then
the
next
thing
I
want
to
talk
about
beyond
pre
train
models.
Is
that
augmentation?
So
we
have
our
input
data
and
we
have.
A
Of
course
we
can't
capture
our
instance
of
variation,
and
we
know
that,
like
the
machine
doesn't
know
like
how
to
automatically
make
mental
transformations,
it's
not
like
the
human
brain.
In
that
way,
it
just
knows
what
it
sees,
and
so
one
of
the
things
that
people
do
is
they
augment
their
data
set
so
that
they
can
give
the
the
model
wider
range
of
examples
of
what
these
things
may
look
like
in
the
real
world.
A
Sort
of
at
this
angle,
like
straight
up,
but
also
if
they're
pictures
where
the
dog
is
at
an
angle,
for
whatever
reason
or
there's
some
skew
in
the
in
the
way
the
dog
sits,
then
it
can
identify
the
dog
properly
and
so
I
mean
dogs
are
kind
of
a
toy
example.
But
you
can
imagine,
like
other
things,
like
yuvan
cars,
if
you
took
a
picture
of
a
car,
you
know
the
picture
was
at
an
angle
or
you
know
the
car
was
going
up
a
hill.
A
You
might
not
be
able
to
identify
something
as
a
car,
but
if
you
give
it
examples
where
it's
rotated
in
different
ways,
it
can
pick
up
the
variation
of
how
those
those
attributes
it's
using
my
clean
up.
Another
example
is
this
picture
of
giraffe,
where
we
have
just
pictures
of
a
giraffe
that
are,
you
know,
taken
in
nature,
they're
photographs,
but
the
thing
is
is
not
just
important
to
have
the
information
about
the
giraffe.
A
It's
also
important
to
have
information
about
the
shape,
and
maybe
the
shape
of
the
background,
so
we
can
do
is
create
masks
of
each
base
image
and
we
can
train
them
the
algorithm
on
the
masks
as
well.
So
this
way
now
the
algorithm
has
shape
information
both
of
the
object
in
the
background,
so
it
can
identify
the
background
elements
and
the
foreground
elements
and
then
make
a
correct
prediction.
A
So
this
is
just
the
citations
of
these
images,
but
using
data
augmentation.
In
this
way
it
allows
you
to
kind
of
account
you
know
kind
of
make
up
for
some
of
the
variation
you
don't
have
in
the
input
data
set.
So
if
there's
you
know
not
very
much
variation
in
your
input
data
set,
you
might
consider
that
augmentation
to
remedy
that
and
I
think
vinay
actually
used
some
data
augmentation
strategies
in
his
project
the
summer,
and
so
it's
it's
important
for
biological
work.
A
E
A
A
A
A
So
you
know
in
a
biological
data
set,
you
might
have
animals
of
different,
you
know
striping
patterns
or
you
might
have
cells
of
different
shapes
that
are,
you
know,
sort
of
in
the
same
class
of
type
of
cell
and
say
these
are
sort
of
classes
that
you
can
label
before
you
train
the
model,
but
also
you
know
kind
of
that.
The
model
will
sort
of
lock
onto
these
as
natural
categories.
So
what?
But
your
dad
aren't
always
evenly
distributed
in
terms
of
these
categories.
A
So
you
know
you
might
have
like
if
you
might
be
looking
at
fibroblasts,
which
is
our
type
of
cell,
and
you
have
different
shapes
that
they
have
and
then,
of
course,
some
shapes
are
gonna,
be
over-represented
in
your
input.
Data
set
and
some
are
going
to
be
underrepresented,
and
so
what
that
means
is
that
to
give
the
Machine
a
little
bit
of
help
in
identifying
each
class,
you
can
artificially
enhance
the
number
of
samples
in
a
given
class.
So
up
sampling
means
that
you
have
a
number
of
classes
and
some
of
the
classes.
A
You
have
very
few
examples
of
in
some
classes.
You
have
a
lot
of
examples
of,
and
so
you
would
up
sample
the
ones
that
you
have
very
few
examples
of
just
by
giving
them.
You
know
doing
things
like
data
augmentation
or
maybe
just
going
and
finding
more
samples
from
that
class
to
balance
out
the
data
set
input
data
set
so
that
you
have
the
similar
number
of
samples
from
each
class
and
down
sampling
is.
This
is
the
reverse
of
that?
A
So
maybe
you
have
a
lot
of
samples
of
a
certain
class,
because
it's
a
very
common
cell
type,
but
you
don't
really
need
that
many
samples
to
really
train
the
model
on
what
it
looks
like.
So
you
might
just
get
rid
of
a
number
of
those
samples
and
just
even
out
you're
in
data
set
so
that
it
represents
a
number
of
samples
evenly.
So
this
is
a
schematic
here
where
the
original
data
set
has
this
class.
Maybe
you
have
like
six
instances
here
and
in
this
class
you've
60?
A
Well,
you
want
to
do
something
so
that
you
can
make
this
equal
in
terms
of
its
representation
in
the
in
the
sample.
So
you
know
you
might
comment
this.
These
data
in
different
ways
so
that
you
can
get
to
this
even
number
of
samples
per
class
and
there's
a
practical
reason
for
this,
of
course,
and
that
is
the
machine,
isn't
maybe
will
identify
some
of
these
smaller
classes.
A
If
you
just
leave
it
unbalanced
like
that,
so
the
algorithm
is
going
to
pick
up
on
what
it
sees
and
if
it
sees
things
in
the
in
the
class,
where
you
have
a
lot
of
examples,
it
may
just
identify
the
ones
in
this
red
category
is
the
same
thing
because
it
doesn't
know
any
better.
It
hasn't
seen
enough
instances
of
this
red
class,
and
so
you
want
to
even
out
the
number
of
samples
in
each
class
that
you
give
the
machine
to
train
on
and
so
again
this
is
another
figure
from
Google.
A
They
explain
it
in
terms
of
weighting
each
class
by
the
number
of
samples,
and
so
you
know
you
can
approach
this
sort
of
systematically.
You
know
you,
you
might
calculate
how
many
classes
you
have
how
many
samples
of
each
class,
and
then
you
know,
figure
out
how
much
you
need
to
up
weight
or
down
weight
each
one
so
that
you
have
a
uniform
data
set,
and
so
again
these
are
the
links
to
these
for
these
images
that
I
took,
but
also
they
it
contains
some
more
information
about
these
kind
of
strategies.
A
Finally,
I'd
like
to
conclude
with
talking
about
synthetic
and
pseudo
data,
so
I
think
we've
talked
about
augmentation
and
balancing
your
data
set
now
I'd
like
to
talk
about
synthetic
and
Oh
data,
and
so
this
is
something
people
use
to
again.
You
know
provide
training
data
for
their
model
that
they
may
not
otherwise
be
able
to
get
so.
The
working
definition
of
this
is
a
model
of
data
that
generates
something
not
found
in
the
original
measurement
or
something
that
is
not
directly
measurable.
A
So
we
have
examples
such
as
interpolation
between
samples,
dynamical,
modeling
of
a
hypothetical
regulatory
process.
So,
like
suppose
that
we
had
two
different
types
of
images
and
then
we
train
the
motto
on
them
and
of
course
it
can
identify
the
two
different
types
of
images
pretty
well,
but
then
we
give
it
an
image,
an
input
image
that
is
in
between
the
two,
like,
maybe
like
a
hybrid
organism.
You
know,
then
we'll
be
able
to
identify
that
sample
is
its
own
thing,
or
is
it
going
to
throw
it
in
one
category
or
another?
A
So
you
would
need
to
create
some
sort
of
synthetic
data
to
give
it
information
about
that
intermediate
case
again
with
even
with
like
regulatory
processes,
you
know
you're
not
going
to
necessarily
get
all
the
data.
You
need
to
look
at
something.
So
if
you
want
them
to
look
at
some
molecular
pathway
where
you
know
you
know
kind
of
the
components,
but
you
don't
know
sort
of
all
the
data
for
each
of
those
components,
you
might
create
a
data
set
that
sort
of
simulates
that
missing
piece,
and
so
there
are
at
least
three
approaches
to
this.
A
Some
of
this,
you
may
have
encountered
in
statistics
class
resampling
of
the
data
using
methods
such
as
jackknifing
and
bootstrapping
small
sample
and
inference,
and
then
data
dependent
priors,
which
is
a
Bayesian
method.
Specifically
those
are
things
you
know
you
can
look
up
on
your
own,
but
those
are
like
basically
the
three
approaches
that
are
in
the
literature,
so
how
you
might
make
a
pseudo
data
set.
You
create
our
protocol
labels,
so
you
would
label
these
fake
variables
or
fake
instances.
A
You
would
might
then
use
a
distribution
like
a
Gaussian
distribution
or
something
to
create
a
series
of
plausible
values.
So
you
make
an
estimate-
and
you
say
my
process
is
Gaussian
or
it's
you
know
normally
distributed
it's
random,
so
we'll
jet,
you
know,
generate
these
values,
and
this
is
what
they
might
look
like
in
the
real
world
and
then
that's
how
I'm
gonna
model
it,
and
then
you
might
sample
these
fake
data
in
ways
that
test
hypotheses
or
allow
for
variation.
A
So
you
might
say
well
we
know
it
generates
these
values,
but
there's
some
values
that
are
more
likely
than
others.
So
we
can
make
some
other
assumptions
and
from
that
you
have
a
nice
synthetic
data
set
that
you
can
then
use
to
train
the
model
on
and
then,
of
course,
you
can
also
use
labeling
in
the
same
way.
So
they
have
something
called
pseudo
labeling,
which
is
used
a
lot
in
semi-supervised
learning.
So
again,
this
is
where
you
create
labels
from
unlabeled
data.
You
guess
what
the
labels
should
be.
A
But
then
you
can
also
use
the
real
information
to
filter
out
things
that
are
obviously
wrong
and
so
consider
the
following:
spherical
systems:
here's
a
image
or
an
animation
of
a
C
elegans
embryo
in
early
embryogenesis-
and
this
is
a
drosophila
embryo
and,
of
course,
they're
very
different
in
terms
of
how
they're
you
know
the
processes
going
on
here.
But
if
we
just
threw
this
into
a
machine
learning
algorithm,
we
may
or
may
not
get
like
be
able
to
really
describe
this
or
make
predictions.
So
this
is
a
nice
example
because
things
are
moving
around.
A
They
don't
have
very
clear
boundaries
a
lot
of
times
and
there
are
no
labels,
at
least
not
in
these
images.
And
so
you
know,
how
do
you
deal
with
those
using
the
strategies?
I
just
show
it's
just
something
to
think
about
mostly
come
up
with
some
questions
for
application
to
developmental
biology,
and
they
think
we've
covered
a
lot
of
these,
but
just
keep
in
mind
that
we
don't
have
good
training
sets
for
a
developmental
biology.
So
what
would
a
good
training
set?
A
Look
like
what
property
should
it
have
and
just
leave
you
with
that
thought,
because
I
think
that's
probably
the
most
important
thing
to
remember
about
this
is
that
you
know.
On
the
one
hand,
we
have
very
good
methods
for
machine
learning
that
are,
you
know,
emerging,
but
they're
not
really
applied
to
developmental
biology.
So
much
so
that's
something
that
that's
why
I
kind
of
created
this
group.
So
we
can
like
think
through
these
issues,.
A
A
A
F
A
I
think
I
think
in
like
if
it's
like
about
like
in
the
data
augmentation
example.
What
people
are
using
these
for
are,
of
course,
distinct
classes.
So
people
are
thinking
more
of,
like
you
know,
they
think
of
their
data
is
having
discrete
states
and
that's
an
assumption
that
you
know
me
and
Rinat
be
true,
but
you
have
these
distinct
things
that
you
want
to
identify,
and
so
you
know
the
idea.
Is
you
don't
want
to
miss
things?
A
There
may
be
rare,
but,
like
you
know,
because
the
idea
is
that
the
Machine
doesn't
know
whether
it's
rare
or
not.
It
just
knows
if
it
sees
it
and
it
needs
to
know
how
to
classify
so
if
it's
rare
and
you
it
comes
up
in
the
data.
The
machine,
of
course,
is
going
to
just
misidentify
because
it
doesn't
have
doesn't
really
know
what
the
what
the
correct
label
should
be.
So
you
might
want
to
you
know
in
in
a
case
like
that
you
might
want
to
have
more
instances
of
outliers.
A
You
might
want
to
give
it
a
lot
of
instances
of
outliers,
so
I
can
identify
those,
but
the
idea
is
that
these
things
exist
in
discrete
States,
and
that's
that's
kind
of
the
problem,
too,
is
that
you
have
a
lot
of
things,
maybe
in
biology
that
are
intermediate
states.
You
know
like
if
things
are
changing
and
that's
a
problem.
A
I
mean
in
this
of
course,
applies
to
like
mathematical
modeling
as
well
right.
So
if
you
wait
come
up
with
a
model
and
you
say
well,
this
is
the
way
we
think
it
works.
You
know.
Sometimes
it's
just
not
a
very
good
model.
So
it's
it's.
You
know
it's
I,
guess
it's
hard
to
really
I
mean
like
basically
with
regulatory
stuff,
people
have
been
doing
like
work
with
high-throughput
data,
which
is
just
like
gobs
of
data
and
they
thrown
it
at
like
a
model
and
they've
gotten
an
outcomes.
A
A
Yeah
I
mean
I,
don't
know
yeah,
there
are
no
established
methods,
so
no
one's
really
published
a
paper
saying
this
is
how
you
would
verify
this
model.
I
guess
one
way
it
would
just
be
to
try
different
models
like
maybe
try
something
where
you
know.
Maybe
you
know
that
something
is
you
know
doing
some
sort
of
regulatory,
okay,
yeah
so
like.
C
A
paper
on
the
French
flag
approach
to
as
a
model
of
developmental
biology,
they
misquoted
some
of
the
work
and
the
way
we're
resolving
this
is
that
we're
ready
joint
paper
with
them
at
the
invitation
of
the
editor
of
the
journal
on
discussing
the
French
flag
model
versus
differentiation
waves.
Okay,
now
this
is
of
course
well.
This
is
this
is
the
paper
is
not
going
to
be
based
on
any
kind
of
detailed
machine
language
learning,
but
it's
an
example
of
trying
to
discuss
two
entirely
different
models
for
this
yeah.
A
C
A
A
A
F
C
A
Comment
if
we're
talking
about
predicting
temporal
events,
we
might
need
these
deep,
sequential
models.
So
that's
another
technique
that
and
then
talked
about
in
this
talk,
but
so
we
need
machine
learning
models
that
can
also
do
causal
inference
exactly
so.
We
need
yeah.
So
there
are
a
lot
of
you
four
options
for
doing
things
like,
especially
with,
like
you
know,
when
you're
talking
about
you
know,
we
talked
about
just
putting
data
into
the
model,
so
now
we're
talking
about
what
you
can
do
with
models
themselves
like
because
there's
a
back
end
to
it.
A
C
C
A
A
Anyways,
while
we
can,
we
can
discuss
that
later
and
set
that
up
so
again.
If
anyone
wants
to
give
a
talk
about
something
or
even
share
something
that
they've
seen
in
the
research
community,
you
know
please
bring
it
to
the
meeting,
we
would
welcome.
You
know,
let
me
know
in
advance
and
we'll
put
it
on
the
agenda,
and
you
know
we
can
talk
about
it.
So,
thanks
for
showing
up
and
I
glad
that,
if
we're
enjoyed
the
discussion
and
if
you
have
any
questions
over.