►
From YouTube: Re-Work Deep RL / Applied AI Summit Day 1 Recap
Description
https://www.re-work.co/events/ -- Watch live at https://www.twitch.tv/rhyolight_
B
A
B
A
B
A
B
A
Attack
them,
she
talked
about
three
different
sort
of
ways.
You
might
attack
the
system,
one
being
integrity,
which
is
try
and
prevent
the
system
from
doing
what
it's
supposed
to
be
doing.
This
firm
like
by
like
acting
with
it
like
doing
certain
things
with
it
that
exploits
it
and
keeps
it
from
doing
what
its
what
it
should
be
doing
and
then
the
other
is
or
or
to
do
something.
The
attacker
once
tell
Jeremy
nice
to
see
you
so
the
the.
B
B
A
A
A
B
B
A
A
A
B
A
B
A
B
A
A
A
B
A
A
A
A
A
New
to
do
he's
walking
through
the
environment
and
he
gets
caught
up
in
like
in
a
hole
and
it
just
stuck
right,
you're
just
stuck
in
the
hole
and
and
what
do
you
do
you
reset
like?
You
have
to
basically
abandon
your
state
and
reset
to
an
initial
state,
so
you
can
continue,
and
this
wasn't
when
you
reset
like
this-
it's
not
continuous
learning
right,
oh
I'm.
Sorry,
thanks
for
I
can
fix
that.
I
can
fix
that
how's.
A
That
is,
that
better
I
should
have
all
right,
I
fixed
it
before
you
guys,
even
noticed,
okay
cool.
So
this
reset
free
learning
there
same
as
continuous
learning,
but
it
didn't
seem
like
continuous
learning
to
me
at
least
not
like
what
I
would
say
is
online
learning.
Some
people
call
continuous
learning
the
same
as
online
learning
you
still
the
the
reinforcement.
None
of
these
reinforcement
learning
systems
were
necessarily
online.
A
It
wasn't.
The
models
had
to
be
retrained
over.
You
know
over
and
over
new
environment,
new
new
features
of
the
environment,
retrained
rangering
and
it's
a
massive
amount
of
training.
You
know
to
get
to
a
point
where
you
have
a
model
that
can
can
run
through
a
test
environment
again,
but
there
was
some
effort
to
try
and
mitigate
these
resets,
and
this
was
mentioned
in
a
couple
of
the
presentations
exploration
was
defined
as
a
robot
moving
objects
and
getting
some
updated
image
caption.
A
It's
not
the
way,
I
define
exploration,
because
I
feel
when
it
when
I
talk
about
exploration,
I,
talk
about
I,
think
about
online
learning
because
to
explore
and
and
when
they
talk
about
exploration,
they're
talking
about
training,
a
system
to
explore,
not
not
like
inherently
exploring
and
updating
its
model
immediately.
None
of
these
things
that
they
their
model
immediately
when
they,
when
they
explore
it's
all
like
a
policy.
It's
all,
that's
all
some
of
the
policy
has
like
exploration
baked
into
it
because
it
was
trained
to
do
so.
I
didn't
quite
understand
this.
A
There's
this
there's
a
paper
about
this
Leave
No
Trace
thing
I'm
trying
to
get
the
you
think
of
online
learning
is
learning
while
also
outputting
yeah,
so
they're
online,
yeah,
I,
guess
so
yeah,
while
continuous
learning
to
be
an
offline
system.
That's
only
learning,
but
not
now
putting
anything
yet
that
could
be
all
of
these
reinforcement.
Learning
systems
are
outputting
like
they're
taking
actions
they're
moving
through
an
environment.
A
I
would
say
that
they're
they're,
the
I,
went
to
all
these
deep
reinforcement
learning
talks
because
because
I'm
interested
in
the
movement
aspect
of
of
them,
because
they
have
this
loop,
where
you
take
an
action
and
then
the
state
changes
right,
you
take
an
action
and
the
environment
changes.
So
that
is
familiar
to
me
because
that's
you
know
the
the
brain
does
the
same
thing.
It.
A
Action
and
then
and
then
the
sensory
input
is
updated.
When
you
move,
you
see
yourself
move
you
see
the
environment
change
so
I
like
that
sort
of
similarity
to
HTM.
So,
okay,
so
the
Leave
No
Trace
thing
I
didn't
quite
get.
But
if
you
know
what
a
cue
function
is
and
in
reinforcement
learning
or
like
cue
learning,
it's
like
this
huge
cue
cue
as
in
like
a
big
buffer
or
something
of
all
all
of
the
previous
states
and
actions
and
states
that
you've
had
so
you
can
sort
of
look
up
the
best.
The
best
thing.
A
It's
a
cheat
honestly.
It
feels
like
a
hack
to
me.
The
whole
cue
function,
things
I
have
but
they're
using
the
cue
function
to
learn
probability
of
a
bad
state
and
then
go
back
to
initial
state
I.
Think
that
was
a
way
to
avoid
these
resets
right.
So
so
before
you
get
to
a
place
where
you
need
to
reset
we're.
Resetting
would
be
expensive
in
a
way
because
you'd
have,
to
you
know,
cut
clear,
cut
off
your
progress
and
go
back
to
an
industrial
state
like
recognizing
a
bad
state
and
then
backtracking.
A
Somehow
I
did
not
even
quite
understand
that
honestly,
so
I
did
take
away
that
it's
difficult.
This
idea
of
exploration
isn't
is
difficult.
They
I
think
they
talked
about
this
word,
empowerment,
which
I
think
was
just
a
term
they
used
in
one
of
their
papers,
which
is
like
the
ability
to
predictably
change
the
future
state
of
the
world
like
take
an
action
that
changes
the
state
of
the
world
and
know
that
when
you
do
something
you're
going
to
change
the
world
and
they
can.
A
This
seems
like
some
sort
of
an
aspect
of
a
policy
that
adds
exploration
to
it
if
you're
empowered,
it's
sort
of
the
next
a
way
to
explore
the
state
space.
The
summary
talked
about,
rewards,
resets
and
exploration.
Okay,
next
talk,
so
this
was
a
deep
dive.
This
was
an
introduction
to
deep
reinforcement,
learning
which
I
thought
I
could
use.
So
this
is
sort
of
just
the
basics
of
reinforcement,
learning
by
Joshua
I,
don't
know
how
to
say
his
last
name
and
open
AI.
A
So
this
is
just
probably
some
basic
stuff
that
I'll
go
over
that
because
I'm
still
learning
about
this
stuff,
too
reinforcement
learning
is
useful
when
evaluating
behaviors
is
easier
than
generating
them,
so
reinforcement
learning
it
seems
like
the
trend.
Is
you
have
an
action
space
right?
You're?
You
have
to
define
that
action,
space
ahead
of
time
or
whatever
task
whatever
it
is.
You
want
your
agent
to
do
in
whatever
environment
it
is.
You
have
to
define
this
action
space.
Somehow
you
you're,
not.
It
doesn't
automatically
generate
new
actions.
A
A
So
but
but
none
of
this
stuff
is
generating
actions
you're
not
creating
ways
to
new
ways
to
interact
with
the
environment.
You're
you're,
you
are.
You
have
like
a
library
of
actions
that
you're
selecting
write
a
policy.
You
know
it's
a
reinforcement,
learning
policy
which
they
denote
as
PI
and
all
of
the
equations
is
I.
A
There
is
also
a
term
they
used
called
trajectory,
which
is
a
sequence
of
states,
actions
which
is
I,
think
you
know
what
gets
stored
in
the
Q
and
the
Q
learning,
but
they
call
this
thing,
a
trajectory
which
is
just
like
it,
just
a
sequence
of
states
like
in
order,
and
they
also
call
it
an
episode
or
roll
out.
These
are
all
new
terms
for
me:
the
reward
function
I
generally
new.
It's
like
it's.
Basically,
it's
a
function.
You
run
to
give
you.
How
good
is
your
stay?
A
A
So
how
how
valuable
was
that
it
wasn't
totally
I
didn't
totally
understand
that
I
like
this
this
this
graph
here,
it's
sort
of
I
know
it's
not
super
easy
to
see,
but
there
we
go
so
breaking
up
the
reinforcement,
algorithms
into
model
free
and
model-based
reinforcement,
learning,
here's
Q
learning,
which
I
talked
about
a
lot
of
the
interesting
things.
It
seems
is
happening
in
this
deep,
deep
Q,
our
DQ
ends
or
deep
Q
learning
networks
and
then
there's
all
these
other
things.
A
lot
of
them
think
they
called
PPOs.
A
That
was
really
I,
don't
know,
even
though
that
stands
for
something
about
policy,
I'm
sure
optimization,
some
type
of
policy
optimization
and
then
the
model
based
stuff
was
they
didn't
talk
nearly
as
much
about
I,
don't
I
think
that's
harder.
It
seemed
to
me,
like
the
model
based
stuff
was
harder
to
figure
out
so
most
of
the
work
that
seems.
B
A
B
A
A
So
this
is
a
direct
quote
from
Joshua
learning.
Models
is
a
really
hard
problem
and
I
like
to
point
that
out,
because
that's
where
I
think
the
big
opportunity
is
for
HTM
anis,
with
association
with
reinforcement
learning
is
how
can
we
use
HTM
models,
along
with
reinforcement,
learning,
because
I
think
we
have
a
good
mechanism
for
creating
models
as
sensory
motor
integration
evolves
and
we
and
we
have
a
way
to
generate
movements.
A
Order
at
least
become
have
some
cause
causality
in
in
the
movement
space,
and
this
is
one
of
the
problems,
because
reinforcement
learning
isn't
good
at
generating
movements.
You
have
to
have
like
this
action
space.
So
I
don't
know
how
to
how
to
resolve
this,
but
so
that's
one
one
take
away.
I
didn't
know
that
was
the
case.
A
Maybe
that's
an
opportunity
for
us
also
reward
function.
Design
is
apparently
very
hard,
and
a
big
takeaway
is
that
this
stuff
is
route
still
really
new.
I
mean
this
guy's
from
open,
AI
and
he's
saying
that
most
of
these
deep
reinforcement,
learning
implementations
first
of
all,
there's
not
very
many
of
them
and
they're
all
tuned
for
research.
So,
if
you're
going
to
try
and
deploy
any
of
this
stuff
in
production
like
you're,
basically,
it's
agree.
A
A
Okay,
this
is
from
Google
brain
Raw,
Google
representation
in
Google
brain
there's
a
deep
mind.
There
was
so
learning
abstractions
with
hierarchical
reinforcement.
Learning.
This
is
sort
of
interesting
now
what
they
mean
by
hierarchy.
Is
that
say
you
have
an
agent
and
it's
got
legs.
You
know
so.
First
of
all,
the
reinforcement
learning
actions
space
for
that
agents.
A
Locomotion
involves
things
like
move,
this
joint
30
degrees
out
or
move
this
other
joint
20
degrees
in
you
know,
that's
the
sort
of
action
space
granularity,
we're
talking
about
and
the
lower
level
of
the
hierarchy
now
they're
talking
about
another
hierarchical
level,
that's
more
concerned
with
navigation
through
a
bigger
space.
So
say
you
have
a
maze
and
you
know
you
need
to
get
from
point
A
around
some
obstacles
to
point
B.
A
A
So
that's
what
they're
talking
about
when
they're
talking
about
hierarchy,
high
level
versus
low
level,
so
some
of
the
things
I
noted
was
that
these
high
level,
this
high
level,
excuse
me
high
level
part
of
it
operates
at
a
lower
frequency.
So
it's
easier
to
learn
in
the
space
and
exploration
is
easier
because
you
don't
have
to
worry
about
the
details
of
locomotion.
If
you
just
off,
you
know
to
talk
about
you
put
that
off
to
the
lower
level
of
the
hierarchy,
then
trip
with
foxes.
B
A
Understand
at
the
reinforcement,
learning,
predefined
rewards
and
action
spaces
has
always
been
the
biggest
problem.
The
space
has
become
infinitely
large
and
we're
trying
to
hard
code
this
stuff,
essentially
trying
to
account
for
anything
and
everything
that
could
happen
ahead
of
time
yeah,
and
you
can't
do
that
right.
I
mean
your
environment
is
going
to
change
in
ways
that
you
will
not
anticipate
and
I.
Think
we
I
think
we
understand
that
okay,
so
so
one
of
the
things
he
said
that
I
didn't
agree
with
was
he
said
we
humans
don't
explore
by
just
flailing.
A
You
know
and
he's
talking
about
having
a
sort
of
a
high-level
policy
that
that
does
navigation
versus
the
low-level
policy.
But
that's
not
quite
true,
like
we
learn
how
to
explore
by
flailing
what
reviews,
babies
flail
and
that's
how
they
learn
how
their
limbs
locomote,
you
know,
move
through
space,
so
there
were
a
few
quotes
about
the
brain
and
analogies
to
human
development
and
stuff
that
were
we're.
A
little
bit
off.
I
would
say
versed
on
how
I
think
about
the
brain.
A
High-Level
abstractions
must
be
created,
so
so
this
is
so.
You
have
the
hard-cooked.
This
stuff,
it's
not
it
doesn't
just
learn.
High-Level
abstractions
Jeremy,
says
should
say
you
should
hear
me
trying
to
learn
a
Bach
Prelude
yeah.
Definitely
some
flailing
any,
and
it's
true.
You
know
you
do
flail.
You
do
flail
when
I'm
learning
the
guitar
one
of
the
things
that
they
tell
you
to
do
is
meander
and
and
meandering
is
just
exploring
it's
just
like
what
happens.
If
I
do
this,
you
know
cuz
you
already
when
you
do
that.
A
B
A
Okay,
so
so
the
big
decision
in
designing
this
system
with
a
high-level
abstraction
is
what
are
the
abstractions.
If
you're
talking
about
navigation
through
a
maze,
you
might
decide,
move
right,
move
left
or
just
Northeast
South
turn
some
stuff
like
that
could
be
the
abstractions
and
then
the
low-level
is
just
the
locomotion
required
to
execute
that
movement.
A
You
know
the
high-level
movement,
but
every
system
is
going
to
be
different
if
you're,
if
you're
modeling
a
hand
or
something
that,
with
a
completely
different
navigation
system,
you
might
have
totally
different
high-level
abstractions
like,
for
example,
open
and
close
your
grip
or
and
for
a
hand.
There's
a
there's
like
a
huge
array
of
different
grips
that
you
can
do
with
your
hand.
You
can
look
it
up.
Like
google,
google
hand
grips,
there's
a
ton
of
them
and
that's
just
from
you
know,
inspecting
humans
and
and
how
they
operate
with
different
tools.
A
There's
a
there's,
a
lot
of
different
groups
you
might
use.
So
all
those
high
level
abstractions
have
to
be
hand
coded
essentially
for
for
whatever
system
you're
creating
Google
conditioned
hierarchical,
reinforcement,
learning
I,
don't
remember!
Writing
this
low-level
goals.
Just
need
to
figure
out
how
to
accomplish
high-level
steps,
yeah
and
I.
Don't
remember
why
I
wrote
that
so
I
think
some
of
the
some
of
the
open
questions
is:
how
do
we
learn
efficiently?
A
We
they're
saying
that
these
high-level
policies
must
be
modular
like
move
left.
It
must
be
something
you
can
apply
wherever
you're
at
in
the
space
that
can't
be
dependent
upon
some
some
state
I
think
so
the
hierarchical
or
the
high-level
training
depends
on
the
ability
of
the
low-level
policy.
So
there
you
can't,
you
can't
have
high-level
goals
that
the
low
level
policies
can't
achieve,
for
example
like
if
you
want
to
move
up
some
stairs
the
low
the
there
has
to
be
a
low
level
policy.
That
knows
how
to
climb
stairs
right.
A
High
level.
Training
must
be
on
policy.
So
I'm
still
trying
to
understand
this
idea
of
on
policy
versus
off
policy.
I
wrote
a
definition
of
it
later
so
we'll
get
to
that.
So
I'm
not
going
to
go
into
that
right
now,
but
you
can't
go
off
policy
for
the
high
level
training
because
it's
too
inefficient
and
it's
unfeasible
in
the
real
world
and
they
did
say
a
little
bit
about
off
policy
Corrections.
But
it
just
seemed
really
happy
to
me
so
I
just
say
it
was
a
hack.
A
A
A
Once
again,
it
calls
equality,
diversity,
QD,
algorithms
and-
and
these
really
have
like
an
evolutionary
flavor
to
me
but
the,
but
it
seems
like
what
it
is,
is
a
way
of
jumping
around
some
space
of
solutions
like
not
just
looking
looking
at
one
specific
area
but
being
able
to
jump
from
from
one's
from
the
search
from
one
location
in
the
search
base
of
solutions
to
another
location.
And
this
this
is
not.
This
did
not
seem
very
brain-based
to
me,
but
it
was
more
about
like
the
evolution
of
organisms
of
aura
of
cultures.
A
Even
so,
he
relates
us
to
use
the
term
called
adaptive
radiations
like,
for
example,
there's
different
types
of
fish
in
different
ponds
in
Africa,
and
they
all
have
adapted
specifically
for
their
own
environments,
but
they
all
came
from
a
common
ancestor,
you
know
so
like,
but
they
but
they've
all
become
very
efficient
and
there's
different
of
different
areas.
The
computer
was
another
example
that
we
used
of
adaptive
radiations.
You
know
we
start
with
one
basic
type,
not
basic
but
like
very
specific
types
of
figures,
but
now
there's
all
different
types
of
computers.
Doing
specific
things.
A
A
These
are
all
open-ended
algorithms,
meaning
that
they
will
continue
to
improve
as
long
as
they
have
more
things
to
train
on.
When
we
talk
about
alpha
star,
that's
all
we'll
talk
about
open-ended
algorithms
as
well,
so
this
poet,
what
was
the
poet?
What
did
it
stand
for?
I
forgot
paired
open-ended
trailblazer,
so
this
is
some
framework
that
he
and
some
colleagues
created
that
periodically
generates
new
environments
and
then
it
optimizes
on
it
optimizes
in
one
environment
and
then
we'll
like
systematically
generate
new
environments
and
I.
A
Don't
think
these
are
completely
generated
from
scratch
like
there's
some
hard
coding
in
there,
and
then
it
will
transfer
its
learning,
so
it
actually
transfer
weights
from
what
it
learned
in
one
environment
to
the
next
environment.
So
taking
with
it.
Like
here's.
Here's
an
example:
here's
like
you've
got
a
little
agent
he's
walking
he's
learned
to
walk
across
a
flat
space.
Okay,
so
let's
take
that
transfer
it
to
a
new
environment.
The
new
environments
got
flat
space,
but
it's
also
got
these
little
stumps.
A
So
so
the
the
agent
now
knows
how
to
how
to
navigate
through
flat
space.
But
now
it's
got
to
learn
more
about
the
stump.
So
so
it's
it's
sort
of
a
way
to
separate
learning
about
different
aspects
of
different
environments.
You
might
have
another
environment,
that's
that's,
got
rocky
terrain,
another
one!
That's
got
pitfalls
or
things
you
don't
want
to
fall
into,
and
so
you
learn
all
about
one
environment.
That's
sort
of
the
idea,
the
idea
behind
this.
You
learn
all
about
one
environment.
You
transfer
it
to
another.
A
You
learn
about
that
when
you
transfer
to
another
and
then
you
can
jump
sort
of
back
and
forth
and
try
different
environments
and
try
and
take
these
these.
This
knowledge
transfer-
and
you
do
this
like
all
in
parallel
and
try
and
find
like
the
sweet
spots
in
the
search
space
for
your
agent.
So
that
was
interesting.
You
had
a
lot
of
good
graphics.
You
should
try.
Look
it
up.
Look
up,
Jeff
clone
map
elites
or
poet,
and
there's
he's
got
talks
online
about
like
long
talks.
A
A
I
did
not
understand
this
talk,
so
I
tried,
I,
don't
understand
these
drawings
that
I
made
I
was
just
trying
to
follow
them.
I
did
not
understand
what
this.
What
this
was
all
about,
so
fail
big
fail.
This
was
interesting
just
because
it's
Starcraft,
you
know
I
like
to
start
craft
so
from
deep
mind
alpha
star,
which
is
mastering
the
realtime
strategy
game,
Starcraft
2.
A
So
some
of
the
challenges
that
they
said
about
this
is
in
StarCraft.
This
is
a
complicated
game,
so
I
mean
this
is
impressive.
This
is
really
impressive.
There's
a
hidden
information
in
StarCraft
because
you
only
get
to
see
what's
around
your
troops,
the
rest
of
the
map
is
clouded
and
you
won't.
You
only
see
the
map
as
you
move
things
through
it
and
their
explore,
and
there's
this
huge
action
space,
so
they've
defined
ten
to
the
eight
action
action
space,
as
you
can
do
so
many
things.
A
So
here's
sort
of
the
architecture
if
you're
interested
in
the
Alpha
star
architecture
we've
got
the
core
of
it-
is
a
deep
Ellis
TM
system,
but
they've
got
all
these
other
deep
networks,
maybe
not
deep,
but
at
least
neural
networks,
there's
a
resonant
here
as
a
feed-forward
net
and
then
transformers,
and
so
this
is
so
those
are.
This
is
highly
tuned
to
Starcraft
by
the
way.
A
So
like
this
isn't
this
would
not
be
easy
to
transfer
to
any
other
game,
even
something
like
Warcraft
or
even
maybe
Starcraft
one
I,
don't
think
you'd
be
I'm
sure
you
wouldn't
be
able
to
transfer
it.
But
this
you
have
these
ideas
of
spatial
observations,
economy,
observations
because
you're,
always
building
things
and
you've
got
materials
that
you're
trying
to
optimize
units
that
you're
bill,
but
at
the
core
of
it
is
a
deep
Ellis
TM
system
and,
and
what
comes
out
of
that
is
move
our
attack
or
mine.
A
A
The
action
space
I,
don't
know
how
they
define
it,
but
it's
totally
hard-coded
for
Starcraft
2,
so
it
would
be
like
given
yeah
I,
don't
know
because
you
can,
you
can
select
any
unit
and
move
them
to
any
place
or
to
give
them
at
an
action
in
any
spatial
location.
I
have
no
idea,
but
it's
I
guarantee
you
it's
highly
tuned
to
Starcraft.
A
So
the
way
they
started
training
these-
and
this
is
a
massive
amount
of
training
they.
Obviously,
if
you
don't
know
this,
you
know
alva
start
be
the
best
Starcraft
2
gamers
in
the
world
over
and
over
over
then
10
times
out
of
10.
So
it
was
a
big
win
for
AI,
but
they
started
by
getting
human
replays
from
Blizzard.
A
So
Blizzard
had
some
information
about
humans
playing
the
game,
and
so
they
initially
trained
on
humans
playing
the
game
and
that's
how
they
got
sort
of
their
seed
for
for
alpha
star
to
play
on,
and
once
they
got
some
agents
that
were
trained
on
human
players.
Then
they
would
start
and
they
would
create
new
agents
and
they
would
train
them
to
beat
those
agents
right
and
then
they
create
diverse
representations
of
those
agents
and
every
agent
they
make
the
goal
would
be
to
beat
all
the
previous
agents.
A
So
it
was
a
ton
of
agents
and
they
would
encourage
diversity
which
they
said
was
crucial.
They
had
to
do
this
if
they
did
encourage
diversity,
I,
don't
think
what
it
worked
so
the
way
they
did
this
was
they
would
give
the
different
agents
slightly
different
goals.
Okay,
so
so
like
for
one
agent
its
goal,
it
would
be
rewarded
for
beating
all
of
the
other
agents
in
the
league
for
some
of
the
agents.
A
They
would
reward
it
just
for
beating
one
or
another,
like
particular
agent,
because
it
would
develop
specific
strategies
just
to
beat
that
one
agent.
It
wouldn't
attempt
generalize
its
strategy
across
the
whole
space
of
agents,
and
that
would
inject
some
diversity
into
the
training
environment.
They
would
also
reward
some
of
their
agents
for
building
different
types
of
units,
so
they
would,
they
would
hard-code
some
reward
and
make
some
that
would
would
get
more
reward
throughout
the
game
for
for
building
particular
types
of
units
or
mining,
particular
types
of
resources
and
stuff,
like
that.
A
So
again,
hard-coding
things
for
sure.
I
do
not
know
what
the
Nash
strategy
is.
I
tried
to
follow
that,
but
I
didn't
I
didn't,
but
it's
some
type
of
probability,
distribution
over
all
of
the
agents.
That's
optimal.
The
interesting
thing
is
to
be
for
alpha
start
to
beat
these
human
grandmasters.
They
trained
over
600
agents.
Right
from
that,
in
that
scheme
of
everyone
has
to
beat
all
the
other
agents,
and
then
they
create
a
new
version
of
it.
A
It
has
to
beat
the
other
agents
each
one
of
these
agents
went
through
more
than
a
thousand
years
of
in-game
training.
That's
six
hundred
thousand
years
of
training,
which
blows
my
mind.
You
know
how
much
compute
power
is
that
that's
crazy,
but
that's
what
it
took
to
beat
these
to
grant
these
grandmasters.
A
Okay,
so
the
each
one
of
these
agents,
iteratively
learned
from
all
the
previous
versions,
and
so
this
was
sort
of
interesting
as
they
watch
the
evolution
of
these
agents.
They
saw
that
initially
the
agents
would
would
expand
their
bases
and
that
would
win
for
a
little
bit,
but
then
the
next
sort
of
generation
of
agents.
Some
would
be
more
aggressive,
because
they've
learned
that
being
aggressive,
you
could
take
over
all
those
bases,
so
being
aggressive
was
then
rewarded
right.
A
So
then,
the
next
generation
of
agents
were
rewarded
for
being
defensive
because
they
were
being
attacked
all
the
time.
So
so
you
go
through
sort
of
these
evolutions
of
strategies
and
to
the
end
where
they're
you
know
at
some
point
after
you've
developed
your
your
you're
getting
defensive,
you
realize
well
now
I
need
to
go
scout
and
see
where
I'm
going
to
be
attacked
from
and
by
what
so
there's
sort
of
this
evolution
of
strategy
over
time,
as
these
agents
are
constantly
fighting
each
other
and
trying
to
come
to
the
best
solution.
A
So
the
big
takeaway
from
this
was
you
should
encourage
diversity
in
your
reinforcement,
learning
agents
by
allowing
them
to
have
different
goals.
Okay,
there's
a
long
day,
okay
into
the
afternoon,
injecting
structure.
This
was
an
Nvidia
talk,
injecting
structure
for
generalization
again
generalization
in
robot
manipulation.
So
this
guy
did
his
examples.
He
gave
like
a
video.
You
remember,
Rosie
the
robot
from
the
Jetsons.
If
you're
my
age,
you
probably,
but
it
was
the
house
cleaning
robot,
basically
with
attitude,
she
had
a
lot
of
attitude
but
Sam's
attitude.
A
He
wants
to
build
robots
that
are
able
to
basically
do
lots
of
different
things
in
unstructured
environments.
How
to
so
the
big.
The
big
question
is:
how
do
we
generalize
in
all
these
unstructured
environments?
I'm,
not
sure
I
took
a
lot
from
this.
This
is
a
weird
I
took
some
weird
notes,
because
I
was
planning
on
talking
about
control
because
he
had
this
library
much
control
to
planning
the
perception,
but
then
that
didn't
go
anywhere.
So
so
I
just
made
notes
about
visual
motor
skills.
Diversity
of
skills.
A
Can
we
build
representations
that
can
transfer
to
similar
tasks?
Questions
he's
asking
he's
talking
about
sensor,
fusion
and
and
the
need
for
it
to
create
a
general
representation
sensor,
fusion,
meaning
you
might
have
torque
information
from
a
robot
alarm.
You
might
have
a
camera
information
and
one
of
the
nice
things
that
this
adds
some
robustness,
because
then
you
can
interfere
with
the
camera
and
it
can
still
do
some
things
because
it
has
has
other
Infirmary
information
coming
in.
A
Representation
transfers
between
track
tasks-
that's
a
challenge-
I
mean
I.
Really
haven't
nobody's
really
solved
this
yet,
but
you
want
to
be
able
to
do,
is
learn
from
when
you're
vacuuming
and
and
how
those
actions
can
be
applied
to
other
tasks.
Other
goals
that
you
have
but
and
currently
all
these
policies
need
to
be
relearned
when
you're,
when
you're
jumping
between
tasks,
model-based
tasks,
space
control,
I,
didn't
know
what
that
meant.
A
So
I
wrote
it
down
the
main
takeaway
from
this,
for
me,
was
action,
representations
and
self
self
supervision
provide
structure
and-
and
he
had
more
on
that
slide,
but
he
switched
so
fast.
I
couldn't
write
it
down.
This
guy
went
really
fast
through
his
slide,
so
it
was
hard
for
me
to
take
in
the
whole
thing
hello
mark
brown
I'm
from
I'm
more
than
halfway
through
my
recap
of
my
my
conference
day,
I
learned
a
lot
about
reinforcement,
learning
and
boy.
Are
my
arms,
tired,
quantifying
generalization
in
deep
reinforcement?
Learning
again
generalization
was
an
open.
A
A
This
open,
AI
has
some
system
has
some
platform
called
coin
run,
which
is
a
game
platform
and
what
it
does
is
generates
an
infinite
amount
of
levels
for
training
which
is
which
is
beneficial
because
it
because
you
can
help
train
generalizing.
You
can
help
too.
It
gives
you
an
environment
where
you're
forced
to
generalize
right
sort
of,
but
unlike
sonic
I
mean
because
that
when
they
did
the
sonic
the
hedgehog
thing
at
an
open
ai,
they
only
had
like
50
levels.
So
you
can
only
do
so
much
with
50
levels.
A
Red
fox
says
relearning
for
different
tasks,
a
common
theme,
yeah
deep
learning
is
do
incredible
things,
but
we're
not
really
any
closer
to
AGI.
I,
agree,
I,
agree.
Okay,
large
training
sets
are
better,
obviously
for
deep
learning.
Large
training
sets
are
better
deep.
Architectures
are
generalized,
better
yeah
file
that
under
duck
agents
can
over
fit
to
a
large
number
of
specific
environments,
so
I
nothing
mind-blowing.
Out
of
this
talk:
okay,
okay:
here
we
go
into
off
policy
versus
on
policy.
So
here's
where
I
wrote
down
I,
don't
quite
understand
on
policy
versus
off
policy.
A
Yet
I'm
gonna
need
a
night
to
sleep
on
it,
I
think
so.
This
is
called
off.
Policy
reinforcement,
learning
for
real-world,
robots
from
Google
brain
on
policy
means
you
can
only
train
on
data
from
your
one
agent
from
one
agent
or
current
agent,
and
that
data
is
not
reusable
for
new
environments.
So
that
mean
I.
Think
that
means
that
policy
is
tied
to
an
environment
and
an
agent.
So
when
they
say
off,
policy
they're
talking
I
think
they're
talking
about
learning
transfers,
something
like
that.
A
You
know
all
they're,
basically,
all
training
and
they're
and
they're
all
collecting
this
data
so
that
they
can
create
reinforcement,
learning
policies
off
of
it.
And
if
you
do
this
off
policy
thing,
it
lets
you
train
these
reinforcement,
learning
models
without
having
robots
in
the
training
loop,
which
is
great
because
robots
in
the
training
loop
are
expensive.
A
So
on
policy
is
good
for
specific
environments
like
Amazon
warehouse,
I,
guess
as
long
as
the
environment
doesn't
change
it's
as
long
as
the
agent
doesn't
changing
the
environment.
Doesn't
change.
There's
a
lot
of
a
lot
of
talk
about
q-learning
this.
This
guy
talked
about
a
specific
type
of
q-learning
called
QT
ops,
which
is
so
they
were
using
on
butt.
So
they
would.
A
They
would
train
they'd
have
an
off
policy
system,
but
they
would
use
on
policy
to
fine-tune
things
and
that
would
get
their
accuracy
from
like
85%
to
like
95
percent
or
something
so
they
would.
They
would
only
use
a
few
robots.
They
need
much
less
robots,
essentially
to
train
a
reinforcement,
learning
system
to
do
something.
So
this
seems
like
optimizations
that
will
law
on
policy
for
soft
policy
thing,
improving
QT
apps
to
use
less
real
robot
data
using
a
simulation
and
trying
to
transfer
that
learning
world.
A
So
the
difficulty
and
off
policy
evaluation
is
that
old
agent
behavior
does
not
equal
current
agent
behavior.
That
was
my
takeaway.
Is
anybody?
Remember
Zork
big
king
I'm,
looking
at
you,
so
this
was
about
reinforcement,
learning
in
interactive
fiction
game,
so
I
thought
it
was
interesting
because
I
used
to
play
there
was
a
Hitchhiker's
Guide
to
the
galaxy
text-based
game.
That
was
my
first
experience
with
text-based
games
with
with
and
I
thought
was
really
fun.
So
this
you
talked
about
Zoran,
which
is
an
entirely
text-based
control
and
state
sort
of
game
yeah.
But
it's
like.
A
A
B
A
Current
voice
assistants
are
not
reinforcement,
learning.
That
was
one
thing
he
said
so
when
you're
talking
to
Alexa
or
Syria
or
whatever.
That's
not,
reinforcement,
learning,
that's
just
deep,
deep
neural
networks,
because
apparently
they're
too
costly
to
train
and
they
still
need
to
study
how
they
work.
A
Hello,
hello,
maverick
watching,
spider-man,
okay,
sorry
and
on
subscribe,
and
you
won't
have
to
chop
down
door,
never
worked
yeah.
So
you
know
I'm
talking
about
you
played
that
you
played
Hitchhiker's
Guide.
That
was
great.
That
was
when
I
was
a
really
intriguing
game,
because
I
played
that
game.
Before
I
read
Hitchhiker's
Guide
to
the
galaxy,
it
sounds
really
young,
and
so
it
was
even
more
intriguing
because
I
did
not
know
the
storyline
or
anything
is
the
know.
The
mics
not
turned
off
you
guys.
A
A
So
so,
when
you're
dealing
with
text,
one
of
the
things
is
it's
a
huge
action
space,
because
when
you
are
dealing
with
like
a
game
like
pong
or
whatever
at
our
games
or
whatever
or
even
PlayStation,
you
have
to
be
a
buttons
and
there's
only
certain
combinations,
there's
a
very
finite
combination
of
buttons
and
actions
that
you
can
take
with
that
keyboard,
pedal
with
that
button
pad
that
keypad.
But
when
you're
dealing
with
text,
it's
a
huge
action
space
because
you're
entering
phrases
not
buttons.
A
A
Let's
see
must
return
a
history
of
must
retain
a
history
of
states,
that's
one
of
the
things
like
restrictions,
so
he
had
this
general
game
playing
agent
that
he
constructed
at
Microsoft
called
nail.
If
you
want
to
look
it
up,
it
doesn't
perform
very
well
on
any
one
game,
but
it
performs
decent
like
novice
level
across
20
different
text-based
games,
so
that
was
interesting
right.
A
Okay,
there's.
He
also
described
a
couple
of
other
algorithms,
aside
from
that
that
were
text-based,
a
star
search
which,
which
is
has
the
most
handicapped
handicap
being
you're,
giving
it
a
lot
of
a
priori
information
like
all
of
its
actions,
were
predefined,
specifically
for
Zork
right.
So
in
order
to
get-
and
this
did
really
well
because
it
was
finely
tuned
to
Zork-
it's
not
generalizable
at
all,
and
it
also
had
this
ability
to
travel
through
time,
which
means
which
is
like
a
replay.
A
You
know,
if
you
get
to
a
point
where
you
fail
at
the
game
or
you
die,
you
can
just
step
back
step
back
step
back
and
then
retry
right,
so
that
this
has
a
the
ability
to
do
something
like
that.
It
baked
in
to
this
system
the
next
thing
and
I
heard
this
idea
of
this
actor
critic
model
and
reinforcement,
learning
I,
don't
know
what
it
is,
but
they
call
this
A
to
C
or
advantage
actor
critic.
A
Single
policy,
but
multiple
parallel
environments
right
so
you're
running
a
bunch
at
the
same
time,
using
the
same
policy
again
it
has
a
fixed
action
set.
So
this
was
again
specifically
tuned
towards
each
game,
but
there's
no
time
travel
now
the
nail
was:
it
was
his
thing:
I
think
it's
open
source
and
you
can
look
it
up.
A
A
Nothing
too
interesting.
There
I
talked
to
him
about
what
we
do
in
the
Mensa
cuz.
He
has
talked
about
things
about
the
brain
and
he
was
super
interested
in
what
we
do,
but
I
did
much
from
his
talk.
Okay,
what
are
checkpoints
of
yeah
check
checkpoints,
meaning
like
time,
travel,
meaning
that
you
can
a
checkpoint?
Any
any
action
you
take
is
a
checkpoint
and
you
can
go
back
to
that
state
and
try
again,
you
know,
like
think
choose-your-own-adventure.
A
Every
time
you
die
you're
just
like
well
I'll
go
back
to
the
page.
It
was
before
and
try
something
new
yeah
with
what
Mark
said:
okay,
so
from
word
embeddings
of
pre-trained
models.
This
is
a
talk
from
Amazon.
This
was
just
sort
of
a
recap
of
recent
history
of
NLP
developments.
Word
of
betting's,
we
talked
about
word
to
back
a
lot
of
fast
text.
A
I
mean
just
look
up
word
embeddings
and
these
technologies.
One
thing
I
noted
was:
you
talked
about
this
thing
called
Elmo,
which
was,
and
she
referenced
a
blog
post
by
mihail
Eric,
who
used
to
be
an
intern
in
dementia,
but
he
currently
works
for
Amazon
and
for
Alexa,
and
there
was
this
other
thing
that
I
called
Burt
that
I
guess
Google
has
created
by
directional
encoder
representations
from
transformers
it's
another
type
of
language
processing.
You
know
P
type
of
thing:
I,
don't
I,
didn't
I,
don't
know
much
about
it.
A
You
can
go
too
far
into
it,
but
I
saw
someone
else
to
talk
about
it.
One
of
the
vendors
was
talking
about
how
they
could
use
Burt,
but
it's
it's
based
on
word
vectors.
It's
basically
word
vectors
that,
because
word
vectors
word
embeddings,
excuse
me.
They
don't
give
you
a
good
context
of
words
so
like
when
you
say:
I
eat
an
apple
or
I
use
an
Apple
computer.
It
encodes
those
word
embeddings
the
same
way,
but
Burt
apparently
has
different
betting's
for
different
contexts.
A
A
A
A
The
the
conversations
I
had
were
interesting
because
anybody
that
I
talked
to
and
I
probably
talked
to
four,
let's
see
one
two:
three:
four:
five:
six
ten,
maybe
ten,
ten
twelve
different
people
like
in
depth
about
what
we
do
at
noventa
and
and
how
the
brain
is
the
way
towards
AGI
and
how
important
it
is
to
understand
the
brain.
None
of
the
stuff
that
I
saw
today
was
is
truly
like
brain
inspired
in
the
way
that
we
talk
about
being
brain
inspired
and
when
I
did
talk
to
people
about
what
we
do
here.
A
People
were
excited
about
it.
They
wanted
to
know
any
more
information.
I
probably
gave
out
20
different
business
cards.
So
that's
good.
That's
good
I
mean
I.
Think
people
realize
that
perhaps
this
is
something
we
should
be
paying
attention
to
am
I
still
streaming
because
it
looks
like
it
looks
like.
A
A
A
They
had
these
little,
these
little
things
sort
of
stuck
around
and
so
here's
here's
an
example
of
it.
I
don't
know
if
you
can
see
that
will
AGI
ever
be
reached
and
you've
got
these
stickers
and
you
and
you
could
put
the
sticker
wherever
you
wanted
it
to,
and
everybody
said
we're
getting
we're
a
long
way
off.
A
Yes,
we'll
reach
AGI,
but
we're
a
long
way
off.
So
that
was
encouraging
to
me,
because
that
means
people
realize
that
what
we
have
today
is
AG
is
not
around
the
corner
and
I
think
the
hype
cycle
prints
promotes
it
like
it's
around
the
corner
and
it's
not
so
all
right,
that's
it.
Does
anybody
have
any
questions
before
I
sign
off?
We
need
to
figure
out
what
general
intelligence
even
means.
I
I
have
the
I
have
a
good
idea.
What
it
means
in
my
definition
of
intelligence
is
the.
A
Capsule
was
was
it
in
the
air
they're
out
capsules?
One
person
asked
me
about
capsules.
Nobody
talked
about
capsules
and
the
only
reason
he
talked
about
capsules
is
because
I
was
talking
to
him
about
locations
and
and
sensorimotor
and
object
modeling,
and
he
said.
Is
that
anything
like
kittens
capsules
and
I'm
like
yeah
and
then
so.
A
Yeah
but
yeah
hidden
is
hidden
as
a
genius
man,
so
it's
hard
to
put
it
any
other
way,
he's
certainly
really
ahead
of
the
curve.
Oh
so
intrepid
Fox,
all
the
Monday
meeting
about
capsules
cool
that
was
that
was
pretty
interesting.
I'm
really
happy
to
see
us
like
sort
of
reviving
those
things
and
I
heard
Jeff
say
several
times.
Man
I
wish
we
I
wish.
We
would
have
cited
him
because
he
hadn't
read
those
papers
and
after
after,
like
Marcus,
gave
a
recap
of
all
these
papers,
Jeff
was
really
impressed.
It
was
like
this
is.
A
He
was
like
he
was
happy
because
it
gives
us
an
indication
that
you
know
we're
on
the
right
track.
If,
if
this
is
what
Hinton
has
said
such
a
long
time
ago,
it
feels
like
well
if
we're
coming
we're
coming
to
the
same
conclusions
independently.
That's
a
that's
a
good
thing
right!
That's
a
really
good
thing
that
that's
an
indicator
that
we're
doing
the
right
thing.
A
A
A
A
Are
you
guys,
I'm
gonna,
head
out
thanks
thanks
for
watching
dub
dub
dub
dude
I
appreciate
the
time
no
problem
I'm
happy
to
do
it
as
long
as
I
got
viewers
that
are
interested,
yeah
yeah
go
ask
questions
on
the
forum.
I'll
check
them
out
later
take
care
I'm.
Stopping
the
stream
have
a
wonderful
weekend.
Monday,
oh
by
the
way,
Monday
we're
gonna
do
active
duty
cycles.
Spacial
pooling
active
duty
cycle
is
it's
going
to
be
cool
I?
Have
some
good
visualizations
in
store
for
Monday
all
right,
all
right,
ticker.