►
From YouTube: DevoWorm ML: Week 12 (Reinforcement Learning)
Description
Twelvth DevoWormML meeting, November 20. Attendees: Richard Gordon, Ujjwal Singh, Vinay Varma, Bradly Alicea and Jesse Parent
A
A
B
A
A
B
A
A
So
we
had
a
request
that
we
moved
the
meeting
time
to
Mondays.
At
the
same
time,
we
moved
it
up
one
hour
for
Vinay.
This
is
Niels.
Our
is
Emil.
He
liked
his
meal
plan
doesn't
alone
to
meet
up
the
an
hour
earlier
so
but
now
I'm
asking
if
people
would
be
okay
if
we
moved
the
meeting
to
Monday's
at
the
same
time,
so
it
would
be
just
a
change
in
day.
C
A
A
D
A
A
C
C
C
C
Yes,
of
course,
lakenya
like
then
it'll
choose
an
organization
and
then
in
a
few
it's
much
the
same
process
for
what
waiting
for
the
shock.
First
they'll
choose
an
organization
which
they're
interested
and
lineal
contact
members
in
our
organization,
administrators
and
then
they'll
have
to
get
themselves
tasks
into
themselves.
A
C
C
F
C
A
Well,
that
sounds
pretty
good
and
I.
Think
that's
a
good
opportunity
for
you.
You
know
it's
always
good
to
have
like
you
know,
to
learn
how
to
teach
people
things.
That's
always
a
good
skill,
because
you
can,
you
know
impart
knowledge
to
people
who
otherwise
wouldn't
have
the
opportunity
to
to
get
exposure.
I
mean
you
can
read
things
on
the
internet,
but
it's
always
better
to
have
someone.
You
know
help
you,
along
with
things
and
give
you
advice
and
so
I
think
that's
good.
So
keep
us
posted
on
that.
That
sounds
interesting.
A
A
So
that's
good
thanks
for
the
updates.
So
if
there's
nothing
else
that
anyone
wanted
to
talk
about,
we
can
go
into
the
presentation.
I've
prepared
here,
reinforcement
learning,
so
I
was
thinking
like
last
week.
I
did
a
presentation
on
game
theory
and
I.
Think
Jesse
was
here
for
it
and
I
recorded
it
and
it's
on
the
YouTube
channel.
So
it's
a
it
kind
of
leads
into
this
because
they
talked
we
talked
about
like
games
and
competition
and
how
that's
being
used
in
machine.
A
It's
actually
a
very
interesting
area,
especially
when
you
get
down
to
like
how
they're
using
game
theory
to
basically
as
sort
of
a
in
lieu
of
optimization
function.
So
a
lot
of
algorithms
use
loss,
functions
and
so
they're
using
game
theory
as
sort
of
a
stand-in.
For
that
and
the
reason
they
do.
That
is
because
you
have
a
lot
of
non
convex
spaces,
meaning
that
they're
not
normally.
A
They're
not
smooth
curves
that
you
can
hill
climb
or
you
can
optimize
easily.
So
you
have
a
lot
of
problems
where
you
have
a
very
what
they
call
non
convex
space
which
is
very
irregular
and
something
that's
not
easy
to
optimize,
and
so
they
play
games
between
agents
and
they
look
for
what
they
call
Nash
equilibria
and
it's
a
that's
actually
very
hard
to
do
in
mathematical
terms,
so
yeah,
it's
sort
of
like
a
rough
landscape.
A
It's
as
analogous
to
that
so
they're
using
Nash
equilibria
to
find
these
points
that
are
sort
of
you
know,
solutions
that
will
correspond
with
minimum
loss.
But
the
thing
is:
is
that's
very
hard
to
even
do
that
mathematical
analysis,
so
you
know
we're
talking
about
something
that
you
know
isn't
really
mathematically
tractable
a
lot
of
the
time.
So
you
know
people
are
applying.
These
models
is
to
get
a
good.
A
A
So
this
is
reinforcement,
learning
and
along
the
way.
We're
gonna
talk
about
we're,
gonna
kind
of
drift
into
biology
in
psycho
biology.
So
if
you
see
references
to
that,
it's
related.
So
this
is
what
I
mean.
This
is
basically
reinforcement.
Learning
is
learning
a
process
with
a
reward,
and
we
must
associate
this
type
of
thing
with
animal
behavior.
But
human
behavior
follows
this
as
well.
It's
been
demonstrated
in
animals
like
dogs
and
in
rodents.
A
Where
you
have
two
kinds
of
conditioning
you
have
classical
conditioning
where
you
associate
an
involuntary
response
in
a
stimulus,
so
there's
an
idea
of
you
know
pairing
these
two
stimuli
and
then
taking
one
away
and
you're
motivated
by
the
stimulus
that's
been
paired
so,
for
example,
if
this
is
the
example
of
the
Pavlov's
dog,
where
you
blow
whistle
and
present
food
to
the
dog
and
the
dog
sees
the
food
and
hears
the
whistle
and
associates
the
two.
So
the
dog
starts
to
salivate
at
the
food,
but
also
at
the
whistle.
A
Then
you,
after
you,
train
the
dog
in
that
for
a
while
that
paired
stimulus,
you
take
the
food
away
and
then
you
just
blow
the
whistle,
and
in
that
case
the
dog
will
still
salivate
expecting
the
food
reward
as
well.
So
you
can
assert
that
the
organism
associates
the
two
stimuli,
but
even
when
you
take
one
away,
it
still
maintains
that
association,
operant
conditioning
which
is
related,
is
associating
some
voluntary
behavior
in
a
consequence.
So
in
this
case
you
have
a
rat
that
pushes
a
Weaver
and
it
this
thing
here.
This
machine
dispenses
food.
A
The
rat
will
then
learn
that
if
you
press
the
Weaver
food
comes
out
now,
if
food
doesn't
come
out
and
they
press
the
lever,
they'll
still
associate
it
with
food,
so
the
rat
might
go
up
to
a
machine,
that's
empty
and
press
the
lever
and
expecting
a
food
reward,
and
even
if
food
doesn't
come
out,
they'll
still
press
the
lever
because
it's
you
know,
they're
associating
those
two
things
this
happens
over.
So
this
is
a
time
dependent
process.
This
is
a
diagram
of
what
they
call
trials.
So
this
is
when
they
do
this
training.
A
It's
an
experimental
context
where
they
preserve
these
things
in
successive
trials.
So
you
present
these
two
stimuli
in
in
one
trial,
two
trials
through
trials
in
this
case
up
to
20
trials.
So
in
the
first
trial
we
have,
we
have
a
bow,
and
then
we
have
food
and
salvation
and
then
the
20th
trial
salvation
is
going
from
an
unconditioned
response
to
a
conditioned
response,
so
you're
actually
conditioning
salivation
on
the
ringing
of
this
Bell
rather
than
the
presentation
of
food,
and
so
this
has
a
long
history
in
psychology.
A
This
goes
back
to
the
1890s
with
Pavlov
and
he
did
these
experiments.
Dogs
where
he
associated
the
ringing
of
a
bell
with
a
presentation
of
food
name,
is
able
to
condition
the
dog's
brain
on
these
items
and
there
was
Thorndike
who
came
up
with
a
law
of
effect,
and
you
can
read
more
on.
This
is
just
kind
of
to
give
you
an
idea
of
the
history
of
this
psychology
and
there's
some
famous
experiments
in
here.
The
next
person
to
really
do
some
groundbreaking
work
in
this
area
was
Skinner
and
you've
probably
heard
of
BF
Skinner.
A
He
was
the
one
who
demonstrated
operant
conditioning,
so
this
is
pressing
the
Weaver
and
getting
food
and
then
both
Albert
bandura,
who
was
more
recent,
who
came
up
with
something
called
social
learning
theory
which
is
based
on
you
know,
reinforcement
learning,
but
also
in
a
social
context.
So
this
has
a
long
history
in
psychology
and
this
person,
Richard
Sutton,
is
actually
known
for
learning
this
in
like
artificial
intelligence
and
machine
learning.
A
Now,
Richard
Sutton
actually
has
a
background
in
psychology,
so
he
came
to
computer
science
and
he
wrote
a
dissertation
called
temporal
credit
assignment
and
reinforcement
learning,
and
this
is
back
in
1984
and
Richard.
Sutton
Amy
and
Roberto
Liz's
doctoral
adviser
have
written
this
textbook
called
reinforcement
learning,
and
so
this
is
sort
of
a
the
landmark
book
of
the
field,
so
this
is
actually
available
online
for
free
through
this
link.
Here
this
is
like
a
ever
get
rough
draft
of
the
book,
but
it's
free
and
you
can
check
it
out.
Online
they've
been
writing
this
book.
A
They
think
the
first
edition
was
in
1998.
This
edition
that
you
can
download
is
from
this
year
so
they're
updating
it
constantly
with
new
examples,
so
wikipedia
defines
reinforcement,
learning
this
type
of
reinforcement,
learning
as
how
computational
agents
take
actions
in
an
environment
so
as
to
maximize
something
we
would
have
reward.
So
this
is
a
little
bit
different
than
some
of
the
things
that
people
do
in
behavioral
psychology
they
have.
You
know
you
have
to
set
up
like
a
reward
system
and
you
have
to
set
up.
A
You
know
how
you're
going
to
present
that
reward
over
time
and
how
it
optimizes
learning
by
the
algorithm.
So
this
is
the
basic
setup
of
reinforcement,
learning
algorithm.
So
you
have
an
agent
up
here
and
it
takes
an
action,
a
sub
T
and
it
goes
it
interacts
with
its
environment.
So
the
agent
is
embedded
in
an
environment,
they
take
an
action
in
them
and
then
that
they
display
behavioral
State.
So
you
do
something
you
just
play:
behavioral
stage
is
SMT
and
then
there's
also
a
reward.
R
sub
T.
A
So
every
state
that
every
action
you
take
results
in
some
reward
and
that
reward
action.
Coupling
is
a
state
and
so
to
change
the
state.
You
would
take
maybe
a
different
action,
but
to
take
a
different
action.
You
need
a
reward
structure
that
enables
you
to
do
that.
So
forward
is
sort
of
like
a
feedback
for
previous
interactions.
So
your
state
is
your
current
state
and
your
action,
then,
is
dependent
on
the
reward
that
you
get
for
that
next
action.
A
So
if
you're
doing
X,
and
then
you
have
a
bunch
of
rewards
that
you
know
a
reward
structure
that
you
kind
of
learn
through
doing
these
different
actions,
your
state
might
move
from
X
to
Y,
because
your
actions
change
based
on
the
reward,
and
so
this
is
a
this
is
definitely
a
sort
of
a
feedback
system.
Where
you
have
you
have
a
current
state,
you
have
a
current
action,
you
get
rewarded
and
then
maybe
you
move
to
a
new
state
based
on
that
reward
structure.
A
This
means
that
you
have
something
called
a
policy
which
Maps
the
age
between
the
agent
state
in
the
actions.
So,
for
example,
there's
a
poet,
you
design
a
policy
to
say:
okay,
there's
this
eight
there's,
this
desirable
state
that
we
want
and
the
agents
all
start
in
some
initial
state
which
isn't
really
biased
towards
anything.
We
want
to
give
them
a
range
of
actions
that
they
can
take,
but
also
rewards
that
will
guide
them
to
the
right
or
desired
state,
and
so
the
value,
then,
is
the
future
or
word
for
potential
actions.
A
A
That
might
be
useful
in
classifying
things
so
they're
supervised
learning,
then
there's
unsupervised
learning,
which
is
just
where
you
preserve
the
algorithm
with
data
blindly
with
no
identifiers
or
anything,
and
it's
you
know,
expect
the
algorithm
to
sort
it
out
and
put
it
into
categories.
So
that
would
be
like
cluster
analysis,
where
you're
just
saying
just
produce
some
clusters
and
see
if
they're,
you
know
accurate
or
not,
whereas
in
supervised
in
the
supervised
case,
you
might
create
categories
and
say
put
these
in
the
right
category
and
then
check
later
to
see.
A
If
that
happens,
reinforcement
learning
is
sort
of
its
own
beast
and
reinforcement
learning
involves
learning
from
mistakes,
and
so
you
know
to
get
a
correct
classification
or
reinforcement
learning.
You
have
to
train
the
algorithm
on
a
bunch
of
mistakes,
and
then
it
learns
from
those
mistakes
and
it's
the
right
classification
term.
So
this
is
an
article.
I
took
this
image
from
reinforcement,
learning
101,
and
this
is
from
towards
data
science,
and
they
have
more
information
about
it
in
that
blog,
post
and
I'll
make
these
slides
available
online,
as
always
to
so.
A
You
can
get
the
links
that
there
are
different
varieties
of
reinforcement
learning
as
well,
so
we
have
classical
reinforcement
learning,
which
is
where
we
just
have
the
rewards
state
action
structure
and
there's
maybe
interaction
between
you
and
the
computer,
then
there's
a
deep
reinforcement
learning,
which
is
where
the
agent,
instead
of
being
maybe
a
like
an
avatar
driven
by
a
person.
It's
a
deep
neural
network
or
something
like
that.
So
in
deep
RL
they're
using
a
deeper
neural
network
as
the
agent
and
they're
training
the
agent
on
something.
A
This
deep
neural
network
is
embedded
in
an
environment
and
there's,
of
course,
this
reward-
and
in
case
this
case,
observations
where
the
world
is
observed,
there's
an
action
and
then
there's
a
reward,
and
so
you
can
use
deep
learning.
You
can
also
use
something
called
model:
free
RL,
which
involves
Q
learning,
which
is
an
algorithm
that
I'm
not
going
to
get
into
today.
But
this
is
a
way
that
they
use.
A
You
know
different
update
strategies
for
this
for
the
optimization
process,
so
Q
learning
is
basically
that
same
structure
of
you
know:
iterative
learning,
but
you're
you
have
a
certain
set
of
weights
that
you're
using
to
weight.
Evidence
that
are,
you
know,
occur
a
certain
time
or
further
away
in
time
it
doesn't.
It
doesn't
imply
any
sort
of
a
priori
model,
but
it's
something
that
you
can
do
sort
of
dynamically
and
apply
the
Q
learning
algorithm
to
the
data.
A
That
way,
so
one
example
of
reinforcement,
learning,
which
is
exciting,
is
where
they're
using
it
to
look
at
video
games
like
this.
So
there
are
a
couple
of
examples.
I
put
in
this
talk
the
first
one.
Is
this
paper
playing
attorney
with
deep
reinforcement
learning.
So
this
is
an
example
of
where
they've
taken
some
of
the
old
Atari
games
and
they
use
these
games,
of
course,
because
their
complexity
is
fairly
low
in
terms
of
the
graphics
and
the
number
of
moves
you
can
make,
but
they're
still
challenging
to
the
algorithm.
A
So,
in
this
case,
you
have
this
camera
roll
with
this
name
of
this
game
is
it's
some
game
where
your
move,
it
OC
quest
where
you're
moving
a
submarine
around
and
you're
trying
to
avoid
being
shot
at
by
other
submarines,
and
you
know
there's
a
anyway
there's
a
reward
structure,
this
game
to
maximize
your
point
tool,
and
you
have
basically,
this
algorithm
has
been
applied
to
this
game.
Now,
normally,
humans
play
this
game
and
you
know
they're
evading
Tartt
there
obstacles
and
making
their
targets
and
trying
to
maximize
their
score.
A
Well,
the
algorithm
does
the
same
thing,
and
so
they
they
actually
in
this
paper
they
tested
the
algorithm,
unbel,
Seaquest
and
another
game
called
break
out
and
I
think
average.
A
screenshot
a
break
out
later
in
the
talk,
but
basically
break
out,
is
also
a
game
of
this
complexity,
where
you're
bouncing
a
ball
off
a
paddle
and
you're
trying
to
remove
bricks
from
up
from
a
ceiling
that
opens
up
when
you
win
the
game,
you
know
you're
trying
to
break
out
of
this
cave
or
whatever
that
you're
in.
So
it's
it's
pretty
simple.
A
The
number
of
moves,
a
number
of
like
states
that
you
can
be
in
so
that's
why
they
use
these
games
and
they
show
you
how
this
works,
that
we
have
training
epochs
here,
so
they've
trained
it
over
a
hundred
epochs
and
then
they
measure
the
average
reward
per
episode.
So
these
epochs
say
I
assume
our
game
plays
so
they've.
A
You
know
they're
training
it
over
and
over
again
losing
the
same
algorithm
but
they're,
presenting
with
the
game
over
and
over
and
the
algorithm
keeps
playing
the
game,
and
you
can
see
that
the
reward,
probably
measured
through
like
the
score
or
something
it
goes
up
over
time,
so
goes
up
between
zero
and
fifty
epochs.
It
really
starts
to
maximize
its
reward,
then
at
about
50.
After
about
50
epics,
it
starts
to
Plateau
and
the
reward
our
breakout.
At
least
it
there's
an
asymptote
there.
It's
kind
of
a
logarithmic
learning
curve
on
C
quest.
A
You
see
it's
actually
it's.
It
goes
up
to
50
epochs,
where
or
maybe
60
epochs,
where
it's
learning
its
maximizing
its
reward,
maybe
70
epochs.
And
then
you
start
to
see
some
instability
in
the
algorithm,
where,
if
you
keep
training
it
longer
and
longer,
it
starts
exploring
new
strategies,
perhaps,
and
it's
a
little
bit
more
honey
even
on
its
performance.
But
you
can
also
look
at
in
terms
of
q-learning.
So
this
is
the
Q
score,
which
is
the
value
on
this
Q
learning
algorithm
and
in
this
case
it's
a
little
bit.
A
The
results
are
a
little
bit
stronger
where
you
have
the
same
logarithmic
pattern
of
learning,
but
it
doesn't
like
you
to
don't
show
this
sort
of
variation
in
terms
of
the
reward
structure,
so
the
Q
value
actually
is
maximized
all
the
way
up
to
a
hundred
epochs.
But
it's
logarithmic.
So
you
know
you
spend
probably
epic
zero
to
fifty
really
training
the
algorithm
and
after
50,
it's
sort
of
increasing,
but
only
nominally
as
it's
really
kind
of
learned
how
to
play
this
game.
So
this
is
a
big
advance
when
I
was
announced
in
2013.
A
This
is
if
using
deep,
deep
reinforcement,
learning
so
they're
using
deep
learning
to
do
this
sort
of
thing-
and
this
is
totally
autonomous,
like
the
algorithm-
is
getting
no
input
from
the
from
the
human
it's
just
playing
the
scheme
on
its
own.
So
this
isn't
like
unprecedented
they've,
been
trying
to
get
out.
You
know
artificial
intelligence
to
play
games
for
a
long
time
in
1997
IBM's,
deep
blue
beat
Garry
Kasparov.
A
Who
was
the
world's
champion
at
chess
at
the
time,
and
he
got
so
angry
that
he
decided
he
was
gonna,
create
his
own
type
of
chess,
which
was
like
a
hybrid
chess
where
he
used
like
human
experts
and
algorithms
to
develop.
You
know
optimal
chess
styles.
I!
Don't
want
to
talk
too
much
about
that
in
this
talk,
but
suffice
it
to
say
that
this
is
a
classic
sort
of
problem
that
people
have
used
to
try
to
benchmark
algorithms,
and
so,
in
this
case
they
used
alphago
to
play
the
game
go
with.
A
Who
is
the
world
champion
at
the
time
of
this
game
and
go
is
like
chess,
but
it's
sort
of
a
variant
of
chess.
They
play
it
a
lot
in
China
and
they,
you
know
the
algorithm,
it's
it's
sort
of
the
computational
chess,
but
it's
a
little
bit
different
in.
In
this
case.
This
group
published
a
paper
in
nature
where
they
were
able
to
use
a
combination
of
deep
neural
networks
and
tree
search
to
beat
the
world
champion
ago,
and
so
this
was
another
landmark,
but
that
just
shows
you
what
these
algorithms
are
capable.
A
And
finally,
we
have
this
new
paper,
which
is
human
level,
performance
and
3d
multiplayer
games
with
population
based
reinforcement,
learning,
so
there's
a
reinforcement,
learning
using
less
but
using
a
number
of
different
strategies
using
a
population
of
agents
and
presenting
the
algorithm
with
different.
You
know
with
information
from
a
game.
This
is
a
I
think
you're,
some
sort
of
first
player
game,
it's
much
more
complex
than
the
Atari
games,
but
they've.
A
If
someone
wanted
to
read
it
and
comment
on
it,
that
would
be
great
but
anyways.
This
is
yet
another
type
of
approach
to
reinforcement,
learning
and
this
was
published
in
science
recently.
So
there
are
different
ways
that
this
is
done
so
reinforcement
learning.
If
we
look
at
like
how
it's
implemented
in
machine
learning
versus
like
classical
conditioning
and
then
even
synaptic
plasticity,
which
is
in
the
brain
itself,
so
classical
conditioning
happens
in
sort
of
a
behavioral
State.
But
it's
also
involves
a
number
of
brain
regions.
A
So
it
involves
things
like
heavy
and
learning
and
it
Raval
involves
neuronal
reward
systems
like
the
basal
ganglia,
which
is
a
part
of
the
brain
that
responds
to
rewards,
and
there
are
a
number
of
different
ways
that
this
is
a
very
complicated
diagram.
We
took
this
from
the
scholarpedia
site,
but
it
sort
of
shows
like
the
different
things
that
are
involved
in,
say,
like
the
machine
learning,
implementation
of
reinforcement,
learning
sort
of
the
classical
conditioning
form
of
it,
which
is
a
combination
of
like
you
know,
updating,
behavioral
States.
A
But
of
course
the
brain
is
doing
a
lot
of
work
in
that
and
then
synaptic
plasticity,
which
is
really
kind
of
separated
from
the
machine
instance,
because
it
involves
a
lot
of
signaling
between
different
neural
chemicals,
producing
states
like
long
term,
potentiation
and
short
term
potentiation,
and
things
like
that.
So
that's
I
mean
that's
how
it's
all
related
in
terms
of
the
biology
versus
the
machine
learning.
But
of
course
we
expect.
A
Reinforcement
learning
behave
like
a
human
like
a
human
brain
and
make
decisions
like
a
human,
that's
the
whole
or
even
like
I'm,
some
sort
of
animal,
that's
the
whole
logic
behind
it,
and
so
related
issues
and
topics.
So
now,
I
want
to
move
on
to
some
little
bit
harder
detail
on
these
models.
So
the
first
thing
I
talked
about
before
was
the
policy
gradient
and
that's
a
very
that's
sort
of
like
the
core
of
this
type
of
algorithm.
So
let's
walk
through
it.
A
little
bit.
A
I
took
this
from
medium
post
here,
so
you
can
look
that
up
in
more
detail
if
you
want,
but
let's
walk
through
it,
so
considering
what
they
something
like
on
instinct,
which
is
denoted
by
the
symbol,
pi
and
that's
described
by
an
action
given
an
initial
State.
So
we
have
our
notation
here,
an
action
of
course
it's
just
something
that
the
algorithm
does
and
then
it's
given
an
initial
state
that
it
starts
out
at
the
objective,
then,
is
to
find
a
policy
which
is
theta
that
creates
a
trajectory
which
yields
maximally
expected
rewards.
A
A
It
could
make
it
random,
but
generally
their
behaviors
that
are
more
likely
unless
lately,
of
course,
they're
not
rewarded
in
that
way,
but
they're,
more
accessible,
States
and
less
successful
States
and
then,
depending
on
the
rewards,
you
might
end
up
finding
the
more
success,
successful,
States
I'm,
trying
to
think
of
a
good
example
from
human
behavior.
But
let's
say
you
had
a
program
where
there
is
some
behavior
that
was
really
hard
for
people
to
attain.
If
you
like
stopping
smoking,
but
there
isn't,
it
isn't
really
easy
for
someone
who's
smoking
to
just
stop
smoking.
A
A
Like
you,
you
have
the
person
a
reward
if
they
don't
smoke
for
a
week
and
you
do
these
sorts
of
rewards,
positive
and
negative
to
kind
of
shape,
the
behavior
towards
a
more
desirable
state
that
maybe
is
less
accessible
at
the
beginning,
but
it
you
know
it's
that's
what
you
would
use
to
attain
that
behavior.
So
it's
it's!
A
combination
of
the
trajectory
of
behaviors
towards
the
desired
state
and
the
reward
structure.
The
trajectories
extend
to
some
time
horizon.
So
we
don't
want
to
project
it
infinitely
out
into
time.
We
generally
think.
A
Okay,
we
want
to
achieve
this.
The
Averill
changer
this
behavioral
optimization
in
a
certain
time
horizon,
and
you
saw
with
one
of
the
game
examples
they
measured
it
for
a
hundred
epochs.
So
that's
a
finite
time
horizon
and,
as
you
saw,
it
wasn't
really
needed.
We
had
about
50
to
70
epochs
that
really
maximized
our
par
training
on
this
trajectory,
so
the
state
itself
can
consist
of
either
specific
or
generalized
features.
A
A
You
know
maybe
cells.
Now
the
machine
is
no
understanding
what
cells
are
initially,
but
you
would
train
the
machine.
Okay,
you
if
it
picks
something
that's
ovoid
or
circular,
then
that's
a
reward
for
the
algorithm.
If
I
pick
something
that
lobby,
maybe
it
depends
on
the
cell
type
that
might
be
rewarded
or
with
a
negative
reward
as
well,
and
so
there
are
ways
you
can
optimize
this
so
that
the
reward
structure,
row
I
kind
of
arrows
down
narrow.
A
You
know,
and
it
rolls
down
to
the
features
that
you
want
and
their
properties
and
the
policy
objective
then
can
be
either
stochastic
or
a
series
of
directed
actions.
Like
I
said
you
can
start
off
with
a
stochastic
action,
but
just
picking
an
action
out
of
hat
or
you
can
direct
the
algorithm
to
certain
actions,
a
certain
subset
of
actions
and
train
it.
That
way
really
depends
on
how
the
algorithm
is
implemented,
and
that
also
determines
the
effectiveness
of
the
policy.
A
A
So
given
a
sequence
of
behavioral
states
and
rewards
this
is
this
is
the
signal
this
is
the
reward.
So
this
is
the
thing
that
your
the
state
actually
and
then
this
is
the
reward
and
you
keep
doing
this
iteratively.
So
you
have
a
state
you
ever
or
you
have
a
state.
You
have
a
reward.
This
state
might
change
as
you
get
rewarded
positively
or
negatively,
and
you
end
up
in
a
final
state
where
you
have.
You
know
a
bunch
of
rewards
that
have
shaped
you
to
that
point.
A
So
all
the
the
sequence
of
behavioral
states
and
rewards
produces
an
action
policy,
and
this
is
term
the
state
function,
and
this
is
an
expected
return
of
strategy.
So
your
strategy
should
have
some
sort
of
return
to
it
and,
of
course
a
return
should
be
optimal,
but
not
all
policies
are
optimal.
Some
policies
are,
you
know
pretty
bad
because
you
use
the
policy
based
on
like
you
know,
you
might
actually
get
the
policy
from
like
observing
the
data
set
and
figuring
out.
Maybe
what
certain
rules
are?
A
You
know
you
might
see
a
unsupervised
or
unlabeled
data
set
and
try
to
maybe
extract
some
statistical
features
from
it
and
use
those
statistical
features
to
shape
the
reward
structure
of
your
policy,
but
you
might
have
a
number
of
competing
policies
where
you
don't
really
know
what
the
actual
payoff
is.
You
think
they're
all
optimal,
but
maybe
some
are
so.
You
want
to
try
different
policies
and
you
want
to
evaluate
them
so
temporal
difference.
Learning
allows
you
to
do
this.
A
Where
you
have
this
iterative,
you
know
you
have
the
state
and
you
have
the
reward,
and
you
have
this
sort
of
structure.
That's
a
time
dependent
and
then
you
have
this
parameter,
which
is
a
discount
factor,
and
the
discounts
are
negative
weights
and
this
you
know
you
can
negatively
weight
things
that
are
further
into
the
future.
So
you
can
have
some
sort
of
open-loop
control
where
you
know
you
reward
things
early
on
and
then
their
reward
becomes
less
salient
as
you
go
move
on
as
this
desire
as
you
approach
the
desired
state.
A
So
this
is
to
sort
of
avoid
over
fitting,
where
you
don't
want
to
over
reward
the
algorithm
for
seeking
out.
Maybe
new
states-
or
maybe
you
know
you,
you
wanna,
you
have
a
target
state
that
you
want
to
achieve,
but
you
reward
it
too
much,
and
so
it
starts
to
explore
different
states
again.
I
mean
they're
different
ways.
This
can
play
out.
You
can
do
this.
A
There
open
loop
control,
which
is
using
this
sort
of
weight
scheme,
and
then
you
can
also
have
a
sort
of
what
they
call
the
neuronal
version,
which
is
how
it
sort
of
happens
in
the
brain.
You
can
actually
apply
this
to
machines
as
well,
where
you
have
a
neuron
called
V
and
that
can
predict
a
reward
R
and
this
updates
adaptively
until
the
algorithm
or
the
behavior
converges.
So
this
is
a
closed
loop
system.
A
So
this
is
kind
of
related
to
what
I
was
talking
about
with
overfitting
of
your
model
or
underfitting
for
that
matter.
So
you
have
this
idea
that
the
algorithm
can
exploit
different
areas
of
the
of
the
state
space.
Like
you
know,
if
you
wanna
like
change
behavior,
you
have
to
have
a
state
space
where
your
different
behavioral
states
that
are
possible,
but
you
don't
want
to
explore
every
state.
A
You
want
to
be
able
to
find
an
optimal
state,
but
you
also
want
to
explore
enough
states
so
that
you
find
the
optimal
state
that
you
desire
so
on
the
left.
This
is
an
example
of
this
trade-off
in
terms
of
the
amount
of
information
versus
returned,
and
so
in
this
case
it's
finding
figuring
out
if
the
sky
is
blue
or
not
by
asking
people.
So,
let's
suppose
you
don't
like
you're
blind
or
you
have
blindfold
on
you're
on
the
sky
is
blue,
so
you
start
asking
people
whether
the
sky
is
blue.
A
So
if
you
ask
one
person,
which
is
a
very
small
amount
of
information,
if
the
sky
is
blue
and
they
say
yes,
that's
a
pretty
high
return
on
your
investment,
but
you've
course
you
don't
know
if
they're
lying
to
you
or
if
they're
like
no.
If
there
can't
see
you
either,
so
you
might
ask
ten
people
whether
the
sky
is
blue.
A
Now,
there's
less
information
in
asking
ten
people,
but
the
return
is
also
a
little
bit
less
as
well,
because
you
ask
ten
people
you'll
get
a
lot
of
redundancy,
but,
on
the
other
hand,
you'll
get
an
average,
and
you
can
tell
that
way.
If
50
people
tell
you,
the
sky
is
blue.
That
reduces
your
mod
of
information.
It
also
introduces
your
return
on
investment
since
you're
asking
more
people
and
then
2,000
people
told
you
the
sky
is
blue.
A
Your
amount
of
information
is
very
low
relative
to
your
return,
and
that's
that's
assuming
that
you
know,
like
the
amount
of
people
lying
to
you
or
don't
truly
know,
is
very
low,
and
so
that's
the
idea
behind
this.
You
don't
need
to
explore
everything,
but
you
do
need
to
have
some
information
and
the
amount
of
information
decreases
as
you
explore.
So
another
way
to
think
of
this
is
the
one
arm
or
the
n
armed
bandit
problem.
A
You
know
you're
playing
yeah,
you
know
your
odds
of
winning
are
very
low
but
pulling
pulling
the
Weaver.
But
you
know
you
keep
playing,
you
might
win
and
it's
you
know
it's
it's
the
classic
gambling
problem.
So
the
idea
is
that
you
know
you're
playing
this
game
against
nature
and
you're
trying
to
get
as
some
sort
of
payoff
from
that.
A
But
the
idea
is
that
if
you
so
you
you
play
one
slot
machine
and
that's
exploration,
you're,
exploring
that
state
spacing
if
you
can
get
to
a
certain
state
and
since
you're
doing
it
randomly,
it
will
take
a
long
time
to
get
there.
But
if
you
add
a
bunch
of
machines
in
parallel,
which
is
the
entering
bandit,
you
can
keep
exploring
by
pulling
a
bunch
of
levers
at
the
same
time
and
I.
Think
if
you've
ever
visited
a
casino
you'll
see
this.
A
A
In
this
case,
it
would
allow
you
to
explore
a
vast
space
in
a
very
short
amount
of
time,
but
there's
a
this
trade-off
exists,
so
you
can
use
as
many
bandits
as
you
want,
but
you
know
just
because
you're
playing
an
infinite
number
of
bandits
doesn't
mean
you
have
a
better
chance
of
getting
to
that
point.
So
this
is
a
something
you
should
be
aware
of
the
Center
on
bandit
problem'
and
then
finally,
you
have
multi-level
optimization,
which
is
where
you
have
you
can
use.
A
You
can
break
the
problem
up
into
different
levels
where
you
have
different
agents
in
that
in
those
levels
exploiting
and
exploring
things
differentially.
So
you
can
break
the
problem
up
into
modules
and
you
can
explore
the
problem
that
way
using
your
reinforcement,
learning
algorithm
and
each
of
these
agents
would
be
a
single
agent
in
an
area
for
in
reinforcement,
learning
context.
Each
of
these
agents
would
have
you
know
the
ability
to
learn
from
rewards
and
then
eventually
get
to
the
answer,
and
so
those
are
all
the
slides
I
have.
A
There
is
a
for
those
of
you
who
are
interested
in
genetic
algorithms.
There
is
a
link,
of
course,
between
this
and
genetic
algorithms
I
and
highlighted
it
in
the
last
slide,
but
there's
definitely
like
a
lot
of
commonalities,
both
in
terms
of
using
a
biological
process
to
to
look
at
data
and
look
at
problems
related
to
search
and
sort
of
the
some
of
the
concepts
that
are
used.
So,
let's
see
what
our
chat
window
looks
like
here,
so.
A
A
Well,
they
didn't
really
I
didn't
I,
don't
know
if
they've
measured
that
too
much
I
know
that
the
that
I
know
from
like
some
of
the
literature
on
training
in
like
in
the
psychological
literature
that
there
are
curves
for
human
learning,
in
games
and
in
expertise.
So
you
see
that
same
pattern
where
you
have
this
initial
burst
of
learning,
where
you're
learning
the
parameters
of
the
game
and
then
there's
a
plateau
where
you're
kind
of
learning
the
particulars
of
the
game,
but
you've
already
kind
of
mastered
the
basic
aspect
of
it.
A
And
then
Richard
also
asked
a
question
about
an
application
is
use
of
use
of
x-ray,
photons
and
computer
tomography.
It
may
cause
cancer.
So
how
can
we
keep
the
number
of
photons
used
to
a
minimum
so
that
that
that
would
be
I?
Think
a
good
application?
I,
don't
know
people
have
done
that
I
mean
it
sounds
like
they
may.
Have
people
like
to
explore
these
type
of
problems,
not.
E
Really,
the
you
see,
Oh
already
goes
down.
The
pumps
goes
down.
Okay,
see.
The
question
is:
what
is
the
mid
one
more
photons?
You
need
to
achieve
sufficient
image
to
do
a
diagnosis,
okay,
so
the
number
progress
there,
but
no
one's
looking
at
that
question
seriously
no
yeah,
especially
because
most
of
the
algorithms
current
based.
So
you
have
to
change
the
community
of.
A
E
E
Well,
let
me
make
just
one
point
about
that:
computing,
our
computer
to
month.
We
now
produces
the
lion's
share
way
to
sorority
of
the
amount
of
dose
that
people
get
from
medical
x-rays.
So
yes,
so
that's
actually
the
serious
problem
and
therefore
many
papers,
at
least
that
contains
multiple
children
can
cause
more
cancers
than
the
first.
A
A
A
Okay,
Vinay
no
questions.
I.
Think
this
talk
you
a
very
nice
overview,
a
reinforcement,
learning,
okay,
yeah
I
think
that's
a
definitely
an
interesting
technique.
I
wanted
to
cover
it
because
it's
you
know
it's
in
that.
It's
like.
We
talked
about
machine
learning
a
lot,
but
it's
like
you
know,
they're
different
sort
of
versions
of
it
and,
like
you
know,
it's
it's
actually
become
pretty
popular,
especially
for
like
applying
it
to
games,
but
also
they're,
probably
a
lot
of
other
techniques
that
can
be
you
know.
A
Can
there
are
things
that
can
be
applied
to
deep
reinforcement,
learning
you'll,
see
it's
very
hot
in
the
literature
right
now,
like
the
number
of
the
opening,
like
groups
like
open
AI,
and
when
do
I
sleep
well,
I
just
put
that
together,
I
was
doing
some
reading
on
it
so
actually
become
pretty
adept
at
putting
together
talk.
So
your
slide
shows
like
this
so
yeah.
So,
okay,
well,
I,
guess
we're
at
near
the
top
of
the
hour.
Okay,
let's
see
otherwise
a
cup
man
Jesse.
A
A
Yeah,
okay,
yeah
so
he's
at
IIT
Delhi.
Does
that
correct
and
they're
working
on
the
self-driving
car
project
and
self-driving
cars
are
interesting
because
they
they
can
figure
out
some
of
the
aspects
of
self-driving
cars
but
like
they
still
can't
figure
out
whether
the
people
in
the
crosswalk
at
certain
times.
So
so
that's
I
mean
that's
interesting.
I
hope
you
guys
find
some
success
and
then
Jesse
says
I
have
things
but
I'll
say
on
slack
harder
to
say
here,
interesting
influences
of
trajectories
and
structures
that
support
capacities
for
development.
A
Good
talk
things,
yes,
like
I,
said:
there's
a
lot
of.
There
are
a
lot
of
connections
to
like
things
like
feedback
and
genetic
algorithms
and
other
things
that
aren't
immediately
apparent,
reinforcement,
learning
but
yeah.
We
can
talk
about
those
as
well
so
yeah.
You
know
thanks:
okay,
well
we're
at
the
top
of
the
hour,
and
so
when
we
okay,
what
about
an
algorithm
for
avoiding
self-driving
cars?
Oh
yeah,
you
dropped
out
when
we
were
talking
about
that.
A
So
I
said
that
they've
made
some
pretty
good
advances
in
self-driving
cars,
but
sometimes
the
self-driving
cars
can't
identify
whether
there's
a
pedestrian
in
the
crosswalk.
So
there's
a
there's,
a
problem
with
that
and
it's
something
that
needs
to
be
solved.
Obviously,
before
we
can
really
put
self-driving
cars
on
the
road
in
large
numbers.
So
it's
you
know
they
have
them.
Apparently
in
San
Francisco
they
have
a
lot
of
self-driving
cars
and
you
get
reports
of
people
getting
like.
You
know
an
accidents
or
things
like
that.
You
know
it's
a
problem,
but
it's
something.
A
Of
course
it
needs
to
be
solved
before
widespread
adoption.
So
all
right
so
well
thanks
for
attending
everyone,
and
if
you
need
to
contact
me
over
the
course
of
the
week,
send
me
a
slack
message
or
email
and
next
week
we'll
try
to
move
the
meeting
to
Monday
morning
instead
of
Wednesday
and
I'll,
send
an
email
out
about
that
all
right.
Well,
thank
you.
Everyone
have
a
good
week,
see
you
later.