►
From YouTube: Keynote: Overcoming Dataset Bias in Machine Learning KateSaenko Boston University MIT-IBM Watson Lab
Description
Keynote: Overcoming Dataset Bias in Machine Learning
KateSaenko Boston University MIT-IBM Watson Lab
OpenShift Commons Gathering on Data Science
January 28, 2021
https://commons.openshift.org/gatherings/OpenShift_Commons_Gathering_on_Data_Science.html
Find out more about OpenShift Commons, please visit: https://commons.openshift.org
A
So
I'm
going
to
start
with
talking
about
the
success
of
ai
and
computer
vision,
so
computer
vision
is.
A
Ai
technology
that
can
analyze
visual
scenes-
and
you
can
see
here
an
example
of
it-
applied
to
detecting
cars
and
buses
and
pedestrians
and
images,
and
it's
quite
good
and
getting
better
so
here's
an
example
of
computer
vision
for
object
detection
in
a
different
scene.
We
can
also
train
computer
vision
models
to
classify
other
objects
that
are
maybe
even
cartoon
characters,
and
we
have
quite
accurate
models
for
face
recognition,
emotion,
recognition
and
a
lot
of
this
is
becoming
a
product
right.
So
we're
seeing
computer
vision
being
used
as
a
product.
A
A
A
So
what
do
I
mean
by
data
set
bias?
Well
suppose
that
you're
training
a
model
to
recognize
pedestrians,
and
you
collected
a
data
set.
That
looks
something
like
this
and
you
train
your
neural
network
and
it
seems
to
work
really
well
on
your
held
out
test
data
from
the
same
kind
of
data
that
you
collected,
and
now
you
deploy
your
model
on
your
product
on
a
car.
But
now
this
car
is
in
new
england,
whereas
your
training
data
was
in
california.
A
So
immediately,
you
see
a
very
different
visual
domain
with
different
weather
conditions
like
snow,
for
example,
that
you
didn't
have
in
training
data,
because
there's
not
much
snow
in
california,
but
also
pedestrians
will
look
different
because
they're
wearing
heavy
coats
and
so
on.
So
all
of
a
sudden.
The
model
that
worked
really
well
on
your
source
data
that
you've
trained
it
on
does
not
work
so
well
anymore,
and
so
we
call
this
problem.
Data
set
bias.
A
We
also
call
it
domain
shift,
so
the
so.
The
problem
of
data
set
bias
is
essentially
this
issue
that
the
training
data
looks
different
from
your
test
data
that
you're
actually
faced
with
it's
different
in
terms
of
the
distribution
of
data.
A
That's
the
more
general
way
of
putting
it,
but
you
might
qualify
it,
for
example,
as
a
difference
between
one
city
that
you
trained
on
and
a
new
city
that
you're
testing
on,
or
it
could
be,
a
data
set
bias
to
images
collected
on
the
web,
whereas
at
test
time
you
are
getting
images
from
a
robot
which
also
looks
look
different.
They
have
different
backgrounds,
different
lighting
and
different
pose.
A
This
could
also
happen
with
different
cultures.
Let's
say:
you're,
classifying
weddings
and
you
trained
on
western
weddings
of
from
from
western
cultures
and
then
a
test
time.
If
you
get
an
image
of
a
wedding
from
a
different
culture,
your
classifier
will
not
generalize
very
well
will
not
be
able
to
recognize
it.
So
so
there
are
lots
of
lots
of
different
ways
that
data
set
bias
could
happen.
A
That's
my
point
now:
let's
look
about
look
at
what
actually
this
means
in
terms
of
the
accuracy
of
the
machine
learning
model.
So
here
is
a
very
simple
example.
Very
famous
data
set
called
mnist.
Everyone
knows
what
mnist
is
it's
just
10
digits
that
are
hand
written.
So
if
we
train
on
this
data
set,
we
know
with
modern,
deep
learning.
We
can
get
very
high
accuracy
of
more
than
99
accuracy.
A
However,
if
we
train
for
the
same
10
classes
of
digits,
but
our
training
data
looks
like
this.
This
is
a
street
view
house
numbers
data
set.
Now
this
model
tested
on
the
mnist
data
set
achieves
much
lower
performance,
67
accuracy,
that's
really
really
bad
for
this
problem
and
by
the
way,
even
when
the
data
set
bias
is
not
as
extreme.
A
For
example,
we
train
on
the
usps
digits
which
look
to
the
human
eye
look
quite
similar
to
mnist,
and
yet
the
bias
in
the
data
leads
to
a
similar
drop
in
performance
and,
if
you're
curious,
if
we
swap
and
we
train
on
mnist
and
test
on
usps,
we
have
similarly
poor
performance.
So
so
that's
just
an
example
of
how
this
dataset
bias
could
affect,
even
in
a
simple
case
of
digit
classification
could
affect
accuracy.
A
Okay.
Now
what
about
real-world
implications
of
data
set
bias?
Have
we
seen
this
in
the
real
world?
Well,
yes,
I
believe
we
have.
This
is
one
example,
that's
quite
famous
now,
which
is
the
fact
that
in
face
recognition
or
gender
classification,
some
researchers
have
actually
evaluated
how
well
commercial
existing
commercial
systems
from
amazon
from
ibm
from
other
companies.
How
well
they
work?
What
is
the
accuracy
they
achieve
on
different
demographics?
A
A
If
we
don't
have
pedestrians,
not
in
crosswalks,
let's
just
collect
more
data
like
that
right.
Well,
there
are
a
few
problems
with
that.
The
first
is
that
some
types
of
events
just
might
be
rare,
like
jaywalking
pedestrians.
They
just
might
be
very
rare
events
and
we
may
not
necessarily
want
to
force
people
to
jaywalk,
so
we
can
collect
more
data.
A
So
that's
one
problem,
but
another
really
big
problem
is
the
cost
of
data
collection.
So
imagine
that
we
wanted
to
label
images
from
cars.
That's
an
example.
You
see
here.
This
is
from
the
berkeley.
Bdd
data
set
well
labeling
1
000
pedestrians
with
per
pixel
segmentation
labels
that
you
see
here
where
the
label
has
to
identify
each
pixel
that
belongs
to
that
pedestrian,
it's
quite
expensive,
so
it
costs
maybe
about
a
thousand
dollars
per
1000
pedestrians.
A
And
now,
if
you
imagine
the
huge,
sheer
variety
of
visual
data
that
we
would
want
to
cover
in
our
data
set,
we
want
multiple
poses.
We
want
multiple
genders
age,
race,
clothing,
style
and
so
on,
and
so
on,
like
and
somewhere
in
there.
We
want
people
who
riding
bicycles,
maybe
not
riding
bicycles
or
maybe
riding
tricycles
right.
So,
if
you
think
about
you
know
how
many
different
factors
of
variation
we
would
have
to
cover,
this
very
quickly
becomes
untenable
and
and
just
too
expensive
to
collect
label
data.
That's
balanced
across
all
of
these
variation
factors.
A
So
what
actually
causes
poor
performance
right?
You
might
be
wondering
that
as
well.
Well,
you
know
can't
my
deep
learning
algorithm
just
get
better.
Maybe
I
just
need
a
better
algorithm
that
will
generalize
and
do
better
on
test
data.
So
there
are
a
couple
of
problems
that
is
caused
by
dataset
bias
that
current
models
cannot
handle.
A
A
The
blue
points
and
the
red
points
are
from
these
two
digit
domains,
and
you
can
see
that
when
we
visualize
this
data,
we
do
this
by
extracting
features
from
these
images,
using
the
deep
learning
model
that
we
trained
and
then
plotting
it
in
a
t-sne
visualization.
So
this
is
what
we
get.
You
can
see
that
clearly,
the
distribution
of
the
training
blue
points
is
very
different
from
the
distribution
of
the
test
points,
and
so
this
is
a
theoretical
problem.
A
Actually,
when
these
distributions
are
different,
we
can
show
that
theoretically,
they're
actually
bound
on
how
well
our
model
will
generalize.
Another
problem
is
that
a
model
trained
on
the
blue
points
is
not
as
discriminative
so.
A
The
features
it
learned
are
not
as
discriminative
for
the
target
red
domain,
and
you
can
see
that
because
the
blue
points
are
much
better
clustered
into
different
categories
than
the
red
points
right,
so
you
just
may
not
be
learning
good
features
for
these
test
points
the
target
domain,
so
fortunately,
there
are
quite
a
few
techniques
that
we
can
use
to
alleviate
this.
A
I've
listed
a
bunch
here.
What
I
want
to
talk
about
here
today
is
the
technique
of
domain
adaptation,
but
you
know:
there's
always
data
augmentation,
there's
always
using
sort
of
batch
normalization.
Some
of
these
techniques
can
help
in
the
case
of
dataset
bias,
but
let's
talk
about
domain
adaptation,
so
in
domain
adaptation
we
design
a
new
machine
learning
approach
that
tries
to
adapt
the
knowledge
from
the
label
source
data
to
the
unlabeled
target
domain.
Okay.
So
our
goal
here
is
to
learn
a
classifier
that
achieves
a
low
expected
loss
under
the
target
distribution.
A
And
importantly
here
we
assume
that
we
have
a
lot
of
labeled
data
in
the
source
domain,
but
we
also
get
to
see
unlabeled
data
from
our
target
domain.
We
just
don't
get
to
see
the
labels
right
because
labels
are
expensive
to
collect.
So
we
assume
that
we
do
get
to
see
some
unlabeled
data,
at
least
from
the
target
domain.
So
what
can
we
do?
A
A
So
we
always
use
convolutional
networks
and
we
have
some
training
data
with
labels
and
now
so,
if
we
train
this
using
regular
classifier
loss,
we
can
generate
features
from
our
encoder
cnn,
and
here
I'm
just
showing
it
for
two
classes
for
clarity
and
then
the
last
layer
will
be
our
classifier
layer.
So
we
can
visualize
the
decision
boundary
that
it
learns
between
one
class
and
the
other
class.
A
Now
if
we
also
get
to
see
some
unlabeled
data
from
our
target
domain,
so
let's
say
we
put
a
camera
on
the
robot
and
it
can
explore
its
environment
and
snap
some
photos.
So
now
it
has
some
data,
it's
just
not
labeled,
and
so,
if
we
apply
the
encoder
cnn
train
on
the
source
directly
to
this
data,
we
already
know
that
we'll
see
a
data
set
shift
like
this,
so
the
distribution
of
the
target
points
will
be
shifted
with
respect
to
the
distribution
of
the
source,
blue
points
and
so
in
adversarial
domain
alignment.
A
A
Essentially,
then
we
iterate
between
the
domain
discriminator,
trying
to
separate
the
distributions
and
then,
in
the
next
step,
we
update
the
encoder
in
such
a
way
that
it
can
fool
the
discriminator,
so
the
discriminator's
accuracy
goes
down
and
in
the
process
the
encoder
learns
to
align
the
two
distributions
so
that
if
everything
goes
well,
the
discriminator
can
no
longer
cannot
tell
the
difference
between
the
domains
and
these
features
have
become
domain
and
variant.
Essentially,
so
that's
adversarial
alignment
and
here's
an
example
of
it
working
for
those
two
digit
domains
that
I
showed
you
earlier.
A
You
can
see
that,
in
fact,
after
adaptation
with
adversarial
alignment,
the
two
distributions
of
the
red
and
the
blue
points
have
now
been
aligned.
Almost
perfectly,
and
in
fact,
classification
also
goes
up
considerably,
so
it's
not
just
that
the
distributions
are
aligned.
It
actually
does
improve
classification
accuracy.
A
So
the
advantage
is
that
once
we've
done
that,
if
we're
able
to
train
this
generative
adversarial
network
that
can
translate
from
the
source
to
the
target
domain,
now
we
have
data
that
looks
like
it
came
from
the
target
domain,
but
it
has
labels
because
the
original
data
is
from
the
source.
So
it's
labeled
with
the
categories
that
we
need
for
training
and
by
the
way
we
can
still
add
feature
alignment
on
the
feature
space
to
this
overall
architecture
and
in
fact
we
have
experimented
with
that
in
our
paper,
which
is
at
the
bottom.
A
Okay,
well
that's
great.
This
pixel
space
alignment
seems
pretty
neat,
but
so
far
we've
been
assuming
that
we
have
unlabeled
target
data.
In
fact,
what
I
didn't
tell
you
is
that,
in
order
for
that
method
to
work,
it
needed
to
see
quite
a
lot
of
unlabeled
data
from
the
target
domain,
but
what
if
we
only
get
one
image
or
a
couple
images
from
our
target
domain?
A
So
instead,
what
we
need
to
do
is
take
a
source
domain
image,
which
is
our
content
essentially,
and
we
want
to
translate
it
to
a
new
visual
domain,
but
we
only
have
one
example,
let's
say
of
that
domain.
So
in
this
example,
our
content
is
a
dog
and
we
want
to
preserve
the
pose
of
the
dog,
but
we
want
to
change
the
style
or
the
domain
of
the
dog
into
this
other
breed.
I
don't,
unfortunately,
don't
know
what
breed
of
dog.
A
A
We
can
take
a
look
and
see
that
we're
able
to
even
just
using
a
few
sometimes
just
one-
we've
tried
one
or
a
couple
images
of
the
target
domain.
Here
the
domain
is
a
breed
of
the
animal
or
a
breed
of
an
animal,
so
we
can
change.
The
pose
is
the
same
from
the
dog,
but
the
sorry.
The
the
pose
is
the
same
from
the
content
image,
but
the
breed
is
is
taken
from
the
style
image.
A
So
you
can
see
how
this
is
working
quite
well
and
if
you're
curious
compared
to
the
previous
approach,
which
is
which
is
called
unit
that
we're
building
on
actually
we're
improving
on
that
quite
a
bit
because,
as
you
can
see,
if
unit
is
not
able
to
translate
images
using
just
a
single
style
image,
it's
kind
of
generating
fairly
poor
results.
In
this
case,
and
on
average,
when
we
evaluate
on
a
large
data
set,
we
also
see
significant
gain
using
our
coco
approach.
A
One
other
example
that
I
want
to
show
you
really
quickly
is
using
this
idea
for
adaptation
and
robotics.
So
here
we
have
a
robot,
that's
trying
to
insert
an
object
into
another
object,
let's
say
a
peg
into
a
hole
or
is
trying
to
more
generally.
We
can
apply
this
to
other
manipulation
tasks
and
our
input
data
is
coming
from
the
depth
sensor.
So
it
looks
like
this
there's
an
rgb
image,
but
what
we're
using
is
actually
the
depth
image,
so
you
can
see
it
in
the
in
the
middle
here,
but
to
train.
A
You
can
see
here
an
example:
real
depth,
view
image
and
then
a
similar
simulated
image
and
then
the
last
one
is
we
take
the
reel
and
we
translate
it
into
the
simulated
domain,
and
you
can
see
that
it's
now.
Looking
a
lot
more
like
the
simulated
data,
so
we're
closing
this
domain
gap.
Okay,
great!
So
I'm
going
to
wrap
up
here.
Just
to
recap
what
I
talked
about
so
data
set
bias
is
a
pretty
major
problem
for
machine
learning
in
general,
but
for
computer
vision.
Specifically,
that's
mostly
what
I
work
on.
A
I
also
think
you
know
we
could
discuss
if
we
have
time
after
this,
some
even
more
general
ethical
issues
related
to
data
sets,
for
example.
Recently
there
was
a
paper,
that's
generating
quite
a
lot
of
interest
that
looks
at
the
dangers
of
large
language
models
and
points
out
that
language
models
are
being
trained
on
progressively
larger
and
larger
data
sets.
A
So
it's
almost
like
the
opposite
of
the
problem
that
I
talked
about
where
we
have
a
huge
data
set
that
we're
training
on
and
now
the
problem
that
they're
pointing
out
is
that
this
data
set
might
contain
all
kinds
of
bad
data
like
offensive
data
or
just
you
know,
even
private
data
and
by
training
the
model
on
it.
We
don't
know
what
kind
of
biases
or
bad
undesirable
things
that
it's
learning
right.
A
So
that's
that's
kind
of
a
related,
but
different
ethical
issue,
and
this
paper
by
the
way
is
one
of
the
co-authors
is
tim
gabriel,
which
you
might
have
heard
that
actually
she
was
forced
to
leave
google
over
this
specific
paper
so
yeah.
So
there
are
quite
a
few
ethical
issues
and
I'm
happy
to
discuss
that
or
anything
related
to
what
I
talked
about
and
thank
you
very
much
for
your.