►
From YouTube: Steve Omohundro on GPT-3
Description
In this research meeting, guest Stephen Omohundro gave a fascinating talk on GPT-3, the new massive OpenAI Natural Language Processing model. He reviewed the network architecture, training process, and results in the context of past work. There was extensive discussion on the implications for NLP and for Machine Intelligence / AGI.
Link to GPT-3 paper: https://arxiv.org/abs/2005.14165
Link to slides from this presentation: https://www.slideshare.net/numenta/openais-gpt-3-language-model-guest-steve-omohundro
A
Okay,
so
I
think
we're
recording,
now
yeah,
okay,
great
so
I'm
thrilled
today
to
have
one
of
my
favorite
scientists
on
the
call
today,
Steve
Omohundro,
so
Steve
has
actually
influenced
Ament
in
a
couple
of
different
ways.
There's
most
of
you
had
moment
already
know.
Steve
was
my
PhD
advisor.
While
he
was
a
professor
at
the
University
of
Illinois
urbana-champaign
and
then
at
UC
Berkeley.
So
that's
a
Numenta
connection.
A
He
was
also
the
thesis
advisor
for
Bartlett
mill
and
Bartlett
was
the
first
to
propose
that
active
dendritic
properties
in
pyramidal
neurons
could
exist
and
have
a
computational
impact
so
that
work
on
active
dendrites,
you
know,
have
certainly
impacted
us
quite
a
bit
in
our
models.
It's
kind
of
hard
to
characterize
Steve
and
the
wide
variety
of
work
he's
done
over
the
years.
He
has
a
PhD
in
theoretical
physics
from
UC
Berkeley
he's
been
a
professor
of
computer
science
he's
been
an
entrepreneur
and
has
founded
several
companies.
A
He
has
created
two
programming
languages
that
were
widely
used,
starless,
which
was
the
parallel
language
used
on
the
connection
machine
at
Thinking,
Machines,
Corp
and
say
there,
which
was
an
object-oriented
language.
He
developed
at
Berkeley
that
I
used
for
many
years
he's
done
quite
a
bit
of
work
in
cryptography,
both
in
research
and
in
industry,
maybe
more
relevant
to
today's
discussion.
Steve
has
worked
in
machine
learning,
computer
vision,
natural
language
processing
and
AI.
A
Since
the
late
80s
he's
conversant
in
just
about
every
machine
learning
algorithm,
you
can
imagine,
and
more
recently
he's
been
spending
a
lot
of
time,
thinking
about
and
talking
about,
the
implications
of
AI
on
society
and
the
ethical
implications
of
all
that
he's.
Currently,
the
chief
scientist
and
a
board
member
at
AI
brain,
which
is
a
company
creating
new
AI
technologies
for
learning
conversation,
robotics
simulation
and
music.
So
today's
talk
actually
came
out
of
a
discussion
I
had
with
Steve
a
couple
of
weeks
ago.
A
We
were
talking
about
gbd3
and
it
became
clear
that
the
advances
that
are
going
on
in
natural
language
processing
today
are
quite
startling.
There
are
many
interesting
questions
that
are
being
raised
and
I
thought
Steve
would
be
the
perfect
person
to
kind
of
lead
a
discussion
on
this
at
momenta.
So
thank
you
so
much
Steve
for
coming
and
talking
to
us
today.
Oh
thank.
B
You
so
happy
to
be
here
and
yeah.
I
and
I
were
talking
about
the
implications
of
GPT
3
and
just
it's
hard
to
wrap
your
head
around.
What's
happening
just
to
sort
of
highlight
that
this
morning
it
was
announced
that
Google
announced
or
released
g-sharp.
If
you
say,
600
billion
wait
model
and
there
are
rumors
that
people
are
working
on
trillion
weight
models.
It's
a
we're
in
this
period,
where
these
things
are
just
growing
dramatically
and
so
understanding
like
what
the
implications
of
that
are
both
scientifically
for
intelligence
and
technologically
and
socially
I.
B
Think
it's
super
super
important,
sukhothai
and
I
had
a
great
conversation.
He
suggested
hey,
let's
broaden,
broaden
it,
and
so
I'm
really
happy
to
be
here.
I'd
be
happy
to
just
do
it
interactively
I
put
together
some
slides,
but
we
can
delve
into
any
part.
It's
especially
interesting.
So
I
don't
know
the
best
way
to
you
know
just
cut
in
and
make
a
comment
or
a
question.
Oh.
A
B
B
So
that's
what
super
Ty
and
I
were
talking
about
so
I'd
like
to
go
through
today.
Give
you
just
a
rough
outline
of
what
the
model
is.
What
it
looks
like
what
is
scaling
behavior
is
where
it
seems
to
be
going
as
these
models
get
bigger
and
bigger.
How
it
relates
to
all
this
word
embedding
sentence.
Embedding
document.
Embedding
ideas
and
the
something
called
distributional
semantics.
So
how
does
meaning
come
into
this
class
of
model
and
then
a
characteristic
of
these,
which
is
very
different?
B
Some
people
are
calling
its
software
3.0,
which
is
instead
of
you,
know,
writing
a
program
or
building
a
neural
that
model
and
training
it.
You
instead
give
a
prompt
to
one
of
these
big
models
and
it
figures
out
from
your
prompt
in
English
what
it
is
you
want
it
to
do,
and
so
it's
kind
of
a
remarkable
thing.
It's
sort
of
hard.
It's
almost
like
magic.
Does
this
really
work?
I'll
show
you
some
examples
where
it
does
so
we'll
go
through
that
sequence
and
at
any
point
you've
got
questions
or
comments.
B
B
Remarkably,
and
in
fact
it
can
do
translation
right
out
of
the
box,
but
it's
primarily
English
any
kind
of
a
probability
model
like
this
can
always
be
factored
as
a
product
of
the
probability
of
each
word
conditioned
on
the
previous
words
and
the
simplest
form
of
language
model
is
an
Engram
model.
Where
you,
you
know
like
a
bigram
model.
Is
you
just
take
the
two
previous
words?
B
You
predict
the
next
word
based
on
that
and
you
can
build
those
models
just
by
recording
from
a
big
data
set,
how
often
trigrams
occur,
and
then
you
can
use
that
to
predict
future
words,
and
some
people
been
doing
that
for
years
and
starting
around
2000.
These
Engram
models
started
beating
all
of
these
very
complex
linguist
design
models,
and
so
that
was
sort
of
I
would
say.
Maybe
one
of
the
early
wake-up
calls
that
these
big
compute
heavy
data
heavy
models
may
supplant
the
clever.
B
You
know
human
created
detailed
models
like
I,
visited
the
psych
project
back
in
the
90s,
where
they
had
teams
of
linguists.
You
know
detail
putting
all
the
detail
about
the
world
into
these
systems.
Well,
all
that's
been
supplanted
by
by
simple
statistics.
From
that
perspective,
you
can
think
of
GPT
3
as
a
2048
gram
model,
since
it
predicts
the
probabilities
of
the
next
word,
based
on
a
context
window
of
2048
tokens
and
here's.
B
What
the
architecture
of
the
model
looks
like
you
take
the
context
window
2048
tokens
each
token
is
mapped
into
a
vector
in
a
real
vector
space
through
this
embedding
matrix.
So
that's
sort
of
the
first
set
of
weights
is
in.
How
do
you
map
tokens
into
this
vector
space,
and
then
everything
else
is
done
in
vector
spaces
and
the
whole
pipeline
that
it
goes
through?
Is
these
2048
vectors
get
transformed
into
another
layer
of
2048
another
layer
2014
of
earlier
2048?
B
This
block
here
is
the
transformer
block
and
in
GB
t3
there
are
96
of
these.
So
ninety
six
layers
of
doing
this
transformer
thing
and
inside
the
transformer,
the
key
thing,
the
the
part
that
makes
it
magic
is
this
thing
called
self
attention
and
I'll
say
what
that
is
in
case.
You
haven't
seen
it
and
then
there's
some
more
traditional
feed-forward
networks
in
there
also.
So
you
get
96
layers
of
that
and
then
finally,
at
the
end,
you
have
one
more
feed-forward
neural
net,
and
then
you
have
a
soft
max
which
converts
the
weights.
B
It
converts
values
to
probabilities,
and
it's
those
probabilities
are
the
probability
the
next
token.
So,
given
an
input
of
2048
tokens,
it's
going
to
give
you
a
probability
distribution
over
with
the
next
token
is,
and
from
that
basic
thing
you
can
do
all
kinds
of
stuff,
and
that
seems
to
be
sort
of
the
style
of
the
modern
model.
So
what
is
a
transformer?
What's
the
self
attention
thing?
So
a
convolutional
network
takes
a
sequence
of
input,
vectors
and
the
output
vectors
are
weighted
combinations
of
these
vectors.
B
Usually
it's
the
nearby
ones
get
combined
in
some
way
to
produce
something.
Those
have
been
very
influential
in
image.
Processing
a
little
bit
in
language
transformers
are
similar,
except
that
each
of
the
input
vectors
goes
through
three
matrices
to
produce
of
you
vector
a
key
vector
and
a
query
vector
and
the
output
vector
is
going
to
be
a
weighted,
linear,
linear
combination
of
the
value
vectors
of
the
input
and
the
way
that
you
decide
on
the
weights.
This
is
the
attention
piece
is
the
vector
you're
interested
in
its
query.
B
Vector
is
dot
product
with
the
key
vector
at
each
of
the
other
vectors
in
the
input
and
those
dot
products
are
then
normalized
and
serve
as
the
weights
to
combine
the
value
vectors.
So
it's
basically
just
a
linear
combination
of
the
value
vectors
but
where
the
weights
are
determined
by
this
self
attention
mechanism,
and
then
these
the
matrices
which
produce
this
value
key
inquiry
vector
those
are
all
learned
in
the
system.
The
whole
thing
is
trained
through
back
propagation
and
to
end
to
achieve
high
probability
of
predicting
the
next
word
on
the
training
set.
B
So
that's
so
sort
of
the
most
vanilla,
autoregressive
language
model
you
can
imagine
with
the
one
extra
twist
is
that
we
use
this
sort
of
slightly
odd
attention
mechanism
in
the
middle
of
it
and
that's
all
it
is
just
scaled
up
to
be
really
big,
and
yet
it
does
amazing
things.
Oh
one
other
comment,
because
it's
an
auto
regressive
model.
That
means
it's
trying
to
predict
the
next
word.
From
the
previous
words,
the
self
attention
is
masked,
so
it
only
looks
at
previous
words.
There
are
other
variants
of
transformers.
C
Have
a
question
before
you
go
out
as
you
talk
about
this,
it's
all
in
the
context
of
language,
but
this
idea
of
this
sort
of
self
query
attention
mechanism.
Is
that
something
that
you
view
of
the
field
views
as
language
or
is
that
something
a
general
capability
that
would
apply
to
many
different
domains?
Is
it
a
language
solution
or
is
it
a
general
purpose
solution?
Well,.
B
I
think
it's
a
special
important
language,
because
what
it
gives
you
is.
It
gives
you
long
distance
connections
which
language
uses
a
lot,
but
people
are
taking
this
self
attention
thing
and
they're
just
dropping
it
unchanged
in
to
say
convolutional,
networks
for
image,
recognition
and
they're,
improving
performance
and
then
open
a
I
recently
did
almost
exactly
the
same
model
for
predicting.
B
You
know
you
give
me
the
top
half
of
an
image
and
it
predicts
the
bottom
half
of
the
image
by
going
pixel
by
pixel
and
it
just
sort
of
fills
in
the
bottom
half
pixel
by
pixel,
and
it's
remarkable
the
kinds
of
things
that
we've
learned
very
high-level
semantic
seeming
things.
So
it
seems
to
be
a
kind
of
very
general
intelligence
element
that
can
be
applied
in
all
sorts
of
domains.
It's.
B
Are
several
different
alternatives
for
self
attention
that
people
propose
this
one
is
a
very
efficient
to
run
on
GPUs
and
so
I'm,
not
at
all
convinced.
This
is
the
optimal
way
to
do
it,
but
it
seems
to
it
seems
to
work
and
they're
starting
to
be
papers
now
exploring
the
space
of
what
are.
You
know
possible
self
attention
modules
all
right,
so
the
the
these
models
seem
to
do
better
as
they
get
bigger
and
since
they
were
introduced
in
you,
know,
2018
or
something
they've
been
growing
exponentially,
and
we
see
that
trend
continuing.
B
As
of
this
morning's
announcement
from
Google
and
basically
every
big
AI
lab
has
been
building
this
kind
of
variance
of
this
kind
of
model
and
applying
it
to
a
number
of
different
different
tasks:
the
GPT
3
architecture.
They
actually
built
8
different
versions
of
it
with
different
sizes,
so
they
could
see
how
scaling
was
going.
The
biggest
version
is
175
billion
weights
and
the
pipeline.
It
starts
with
a
reversible
encoding
of
you
know
you
would
like
to
do.
B
But
at
the
moment
they
do
it.
This
way
in
a
context
window
of
2048
tokens,
they
have
96
transformer
layers.
Each
layer
has
that
attention
module
like
I
mentioned,
but
they
actually
do
96
attention
heads
in
parallel,
so
that
different
attention
heads
can
discover
different
phenomena.
So,
like
one
attention
head
might
look
for.
You
know
the
direct
object
of
a
verb,
another
one
might
look
for
the
indirect
object,
and
so
somehow
it's
going
to
the
system
through
back
prop,
is
going
to
figure
out
what
to
do
with
all
of
these
heads.
B
There's
a
lot
of
work
these
days.
Do
you
really
need
96?
Does
it
help
to
have
more
do
different
layers
need
different
numbers
of
heads?
All
those
questions
are
kind
of
unknown,
I
would
say
at
the
moment
each
of
these
heads
produces
128,
dimensional,
vector
and
they're
all
discontented
to
form
the
vector
for
the
next
layer.
B
B
That
there
are
in
making
this
thing,
so
it's
quite
generic
really
on
the
training
data
is
499
billion
tokens.
Some
people
estimated
the
cost
of
training.
It
would
be
12
million
dollars,
though
somebody
figured
out
on
the
cheapest
GPU
cloud.
You
could
do
it
for
four
million
six
hundred
thousand
dollars,
but
it
would
take
355
years.
B
A
B
Yeah
well,
in
fact,
if
you
read
the
paper,
one
of
the
worries
is
some
of
the
tests
that
they
tested
on
those
we
appear
on
the
web
and
so
might
end
up
in
their
training
data
inadvertently,
and
they
accidentally
did
have
some
leakage
like
that,
but
it
was
too
expensive
to
go
back
and
retrain
it.
It's
that
they
tried
to
account
for
that
in
their
reporting.
B
B
You
know
we're
not
nonprofit
anymore,
we're
going
to
be
a
for-profit
company
and
Microsoft
invested
a
billion
dollars
into
them,
and
it
turns
out
Microsoft
built
the
supercomputer
that
they
use
to
train
this
thing
on
and
I'm
guessing
there's
some
sort
of
internal
accounting.
Where
Microsoft
you
know,
donated
a
billion
dollars.
B
But
then
some
of
that
comes
back
to
them
for
running
this
computer
and
I'm
also
guessing
that
the
the
Microsoft
data
center
will
be
one
of
the
first
to
use
this
type
of
model,
and
so
there's
some
kind
of
you
know
agreements
going
on
back
and
forth
and
that
we
also
announced
recently
that
they
have
an
API
to
this
model,
which
they're
gonna
start
licensing
out
for
end-users
to
use.
So
some
kind
of
complex
business
model
and
research
and
okay.
C
B
B
B
B
D
B
Yeah
and-
and
it's
an
interesting
thing
about
you-
know:
academic
research,
labs,
probably
don't
have
access
to
this
kind
of
compute
power,
and
so
this
may
be.
You
know
if
you
know
the
future
of
AI
models
requires
this
kind
of
compute.
That
could
be.
You
know,
super
tiny,
alright,
we're
talking
about
the
sort
of
trade-offs
in
versus
China
versus
the
US
and
building
doing
AI
development.
It
may
be
that
just
Rock
and
view
power
is
a
really
critical
component,
so
here's
their
super
graph
of
a
few
about
a
year
or
so
ago.
B
They,
they
analyzed
all
of
the
transformer
models
and
they
found
very
scaling
relationships,
the
most
important
of
which
is
the
loss
of
a
model
versus
the
compute
power.
When
you
optimize
everything
else,
the
amount
of
training
data,
the
size
of
the
model
and
all
that-
and
they
found
this
really
simple
nice
relationship,
and
these
show
the
curves
for
different
sized
GPT
models.
This
is
their
smallest
one,
and
this
is
the
new
one,
the
really
big
one
and
they.
B
Yeah,
it's
the
predictive,
you
are
of
the
training
data
and
in
fact
a
better
more
intuitive
thing
is
perplexity
and
I'll
clarify
so
perplexity.
Is
you
know?
What
is
your
uncertainty
about?
The
next
word
so
you've
just
seen
a
bunch
of
words
and
the
piece
of
text
you're
trying
to
predict
the
next
word
and
perplexity
is
if
you
had
a
case
that
sided
dice
and
the
uncertainty
about
the
next
word
is
like
throwing
that
case
and
advise
the
K.
B
Is
the
perplexity
it's
to
to
the
entropy
of
the
distribution
of
the
next
word,
and
if
you
just
do
unigram
statistics,
perplexity
is
about
962,
so
you
know,
if
you
just
guess
what
a
word
is
going
to
be.
It's
about
a
1
to
9
to
62
by
grams,
gives
you
170
trigrams
109
GPT
two
broke
records
in
perplexity
about
a
euro
so
ago
it
was
35.8.
Gpt
3
now
breaks
a
new
record.
Twenty
point:
five,
it's
not
exactly
clear
what
the
human
perplexity
is.
B
B
You
know
elegant
essays
on
that
topic
and
with
GPT
three
they
generate
some
essays
and
then
they
take
human
written
essays
and
then
they
go
to
a
bunch
of
humans
to
try
and
figure
out
which
ones
were
computer-generated,
which
ones
were
human
generated.
So
it's
kind
of
a
variant
of
the
Turing
test.
Where
there's
no
interaction,
it's
just
you
know
the
system
generated
some
text
and
they've
been
finding
the
ability
of
humans
to
tell
what
was
fake
and
what
was
real,
has
been
steadily
decreasing
and
with
gbd3
it's
essentially
at
50/50.
B
Now
it's
like
fifty
two
percent.
They
can
detect
the
machine
and
so
we're
essentially
at
the
point
where
these
machines
can
generate
text.
You
can't
tell
you
can't
distinguish
from
human
generated.
B
B
He
called
the
bitter
lesson
and
his
bitter
lesson
was
that
a
simple
AI
that
leverages
computer
power
will
eventually
beat
out
clever
AI
built
using
human
knowledge,
and
he
argued
that
that
was
true
with
chess,
that
people
had
all
these
clever,
complicated,
chess-playing
programs
and
then
deep
blue,
basically
just
did
a
fairly
brute-force
search
and
be
fat.
More
recently,
we've
had
alphago
beat
go
using.
B
You
know
self
play
with
some
simple
learning
and
some
search
to
beat
go,
and
then
these
language
models
are
really
showing
that
first
with
engrams
and
now
these
transformer
models,
especially
that
they
seem
to
just
be
doing
better
and
better
and
better
as
it
gets
bigger
and
then
they're
a
bunch
of
other
recent
experiments
on
different
kinds
of
AI
tasks
where
it
looks
like
scaling
with
compute
power
is
just
you
know,
beating
everything.
So
that's
sort
of
a
tragic
conclusion
for
those
of
us
that
really,
you
know,
love
to
deeply
understand
things,
but.
B
The
maybe
the
reality
and
I
think
opening
AI
has
taken
that
to
heart.
Their
chief
scientist
recently
quoted
he
says:
give
GBD
to
the
compute,
give
it
the
data
and
it
will
do
amazing
things,
and
so
I'm
guessing
their
business
model
involves
just
scaling
things
up
and
it
sounds
like
Google's
doing
that
too.
There's
another
way
to
look
at
these
models
to
try
and
understand
what.
B
B
You
put
your
money
or
it
could
be
a
river
bank
whose
side
of
a
river,
and
so
knowing
the
context
that
you're
in
makes
a
big
difference,
and
so,
after
these
initial
things
like
word
to
vac
glove
fast
texts
which
were
static,
word
embeddings
a
whole
bunch
of
models
came
out
which
used
the
context
of
a
word
and
it's
encoded
a
word
as
as
a
vector,
but
within
a
given
context,
and
then
more
recently,
they've
been
encoding
sentences
as
vectors
and
documents
as
vectors
books.
Those
vectors
it's
a
you
know.
F
I
have
a
question
for
you,
which
is
when
you
talk
about
these
as
vectors
and
encoding
the
context
and
all
that
stuff.
Is
there
any
sense
in
these
models
of
temporal
structure?
Any
difference
between
the
last
word
and
the
word
before
that
and
the
word
before
that,
or
is
it
all
kind
of
collapsed
into
a
spatial
representation
yeah.
B
That's
a
really
critical
piece
of
question
and
they've
gone
all
over
the
place.
A
lot
of
the
early
models
used,
what
they
call
bag
of
words,
which
is
they
treat
a
sentence
just
as
whatever
words
are
in
there,
and
they
completely
ignore
the
order.
The
self
attention
operation.
Doesn't
it's
it's
permutation
invariant,
it
doesn't
know
anything
about
what
the
ordering
of
words
are,
but
what
they've
done?
B
You
know
the
man
hit,
the
dog
is
very
different
from
the
dog
hit
the
man,
and
so
so
these
more
modern
models
are
really
using
it,
but
it's
coming
in
in
a
sort
of
a
implicit
way.
That's
maybe
a
little
different
than
you
might
have
thought
so,
here's
you
know
the
the
early
shock
of
the
static
word.
Models
like
word
to
vac
and
glove.
Was
that
not
only
did
they
map
similar,
meaning
words
to
similar
locations,
they
had
these
equations
and
the
most
famous
one
was
King
man
plus
woman
equals
Queen.
B
So
if
you
took
the
vector
for
the
word
King
it
subtracted
from
it
the
vector
for
the
word
man
and
added
that
to
the
vector
for
the
word
woman,
you
would
get
the
vector
for
the
word
Queen
and
in
fact,
a
whole
bunch
of
pairs
where
there's
a
sort
of
masculine
feminine
version.
Like
oh
I,
don't
know
brother
sister,
grandfather,
grandmother,
they're
all
related
by
vectors,
which
are
quite
similar
and,
in
fact,
a
whole
bunch
of
relationships
like
between
a
country
and
its
capital,
France
and
Paris
Italy
Rome.
Those
vectors
are
all
the
same
Miami.
B
You
know
an
element
and
its
symbol,
a
company
and
what
its
product
is
so
a
whole
bunch
of
these
relational
informations,
fairly
coarse,
relational
information
also
gets
encoded
in
these
models,
which
is
sort
of
the
first
hint
the
G.
Maybe
some
semantics
is
really
getting
mapped
into
these
things.
I
did.
C
F
B
Well,
so
this
this
chart
is
not
from
gbd3.
This
turn
I
think
it
is
from
the
word
to
deck,
which
is
that
course
static
thing,
but
even
in
gbg3
you
know
it's
trained
on
web
data,
and
so
that's
one
of
the
challenges
in
terms
of
social
implications.
If
these
models
start
getting
into
very
important
central
things,
you
know
they.
B
B
D
B
So
getting
to
your
point
about
temporal
information-
and
you
know
language
is
all
about
much
more
complex
structures
at
the
next
level.
It
turns
out
for
these
work
like
models.
If
you
look
at
the
embeddings
that
they
create
they,
they
create
trees
in
the
vector
space
that
correspond
to
the
parse
tree,
the
linguistic
parse
tree.
So,
even
though,
there's
nothing
in
classical
linguistics
built
into
these
models,
it
turns
out
they
actually
rediscover
a
number
of
things.
So
there's
a
whole
field.
B
Now
that
people
are
calling
Bert,
ology
Bert
was
sort
of
the
first
and
most
prominent
of
these
models,
and
so
people
have
been
analyzing.
What
are
these
verticals
learning
and
they're
looking,
particularly
as
you
go
down
the
different
layers?
What
kind
of
in
Meishan
are
the
vectors
in
these
different,
deeper
and
deeper
layers?
B
What
are
they
encoding,
and
so
this
paper,
what
it
discovered
is
that
eight
sequences
in
a
classical
natural
language
processing
pipeline
that
used
to
be
human
bill,
are
sort
of
showing
up
in
the
multiple
in
the
layers
the
deeper
and
deeper
layers
of
birth.
So
early
layers
discover
the
part
of
speech
what
constituents
there
are,
what
the
dependencies
between
things
are,
what
the
entities
in
a
sentence
are
the
semantic
role.
B
Labeling
of
entities
coreference
between
you
know,
pronouns
and
nouns
semantic
proto
rules
and
then
relation
classification
and
so
kind
of
remarkable
that
an
end
to
the
end
that
prop
train
model
appears
to
be
discovering
some
of
what
classical
linguistics
has
tried
to
figure
out
and
I.
Think
this
place
to
something
really
fundamental
in
linguistics.
Their
notion
of
semantics,
like
I,
would
say.
Probably
the
dominant
notion
of
semantics
is
Montague
semantics,
which
tries
to
map
structures
in
language
into
some
form
of
logical
formalism.
They
used
to
type
to
lambda
calculus
who's.
What
Montague
was
promoting?
B
What's
going
on
now,
sometimes
it's
called
distributional
semantics
and
one
of
the
early
advocates
was
this
gentleman
John
Firth
and
his
most
famous
phrase
was,
you
shall
know
a
word
by
the
company
it
keeps,
and
so
his
idea
is
that
the
semantics
of
a
word
is
the
probability
distribution
over
the
work
over
the
words
of
the
context
of
in
which
it
can
appear,
and
so
you
can
figure
out
the
dog
and
cat
or
similar,
because
dog
and
cat
can
appear
in
similar
sentences.
You
know
the
man
took
the
dog
for
a
walk.
B
B
You
know
I'll
talk
a
little
bit
later,
that
I
have
I'll
say
it
now,
so
I
used
to
have
a
great
thought
experiment,
which
was
let's
say,
you're
a
really
good
pattern:
recognizer
statistical
learner
and
you
just
watch
TV
all
day.
Can
you
learn
about
the
world
from
that?
And
it
seemed
to
me
that
you
could
that
basically
very
quickly
discover
the
notion
of
frames.
B
You
know
TV
frames,
you
know
about
like
what
what
pixels
are
near
one
another
in
the
in
the
picture
and
then
pretty
soon
you'd
realize
that
oh
there
are,
you
know,
blobs
of
color
that
tend
to
move
together
and
then
you
discover
their
objects
and
then
you'd
probably
build
3d
models
of
those
too.
As
best
explanations
for
objects,
rotating
and
then
you'd
maybe
discover
the
laws
of
physics.
B
You
hear
these
sounds
and
you'd
be
nothing
you
could
do.
This
is
sort
of
I
think
arguing
the
opposite
of
that
and
maybe
I'll
skip
ahead
to
a
slide,
showing
the
kinds
of
things
that
you
can't
real-world
semantics
that
implicit
in
GPT
it
can
discard
it
nose
so
after
being
trained
just
on
web
text,
you
can
then
probe
it
by
asking
questions
like
let's
say:
does
it
know
what
the
president
of
the
United
States
is?
B
You
could
say
the
President
of
the
United
States
is
blank
and
then
is
supposed
to
fill
it
in
and
it
would
say
the
highest
probability.
Word
is
Trump
or
you
know,
so
you
can
use
that
kind
of
statement
to
extract
the
knowledge
that
sort
of
in
these
in
these
systems
and
using
that
people
have
found.
It
knows
all
the
US
presidents
and
Russian
leaders
in
temporal
order.
It
knows
the
latitude
and
longitude
of
cities
in
the
United
States
and
Europe
and
their
relative
distances.
B
It
knows
the
relative
size
of
many
objects
like
cars,
elephants,
humans
and
houses.
It
knows-
and
you
can
test
that
by
things
like
is
a
car
is
a
human
bigger
than
an
elephant
or
an
elephant
is
bigger
than
a
human
which
of
those
phrases
has
higher
probability,
and
it
knows
what
animals
are
dangerous.
What
objects
are
dangerous?
How
smart
different
animals
are,
what
clothing
is
appropriate
for
different
age
groups
for
a
different
emotional
arousal
for
cost
for
different
weather
conditions,
the
qualities
of
mythological
character
creatures,
physical
properties
of
objects
like
rigidness,
strength,
transparency,
whole-part
relations?
B
You
know
that
the
hand
is
a
part
of
an
arm
and
an
arm
is
a
part
of
the
body,
countries
and
cities,
their
capitals,
their
gross
domestic
products,
their
internet
usage,
all
that
kind
of
stuff.
So
all
this
information-
and
you
know
some
of
this-
is
explicitly
in
the
training
set,
but
some
of
it
sort
of
emerges
from
it
and
somehow
these
models
seem
to
be
capturing,
this
kind
of
semantic
information,
and
so
suddenly,
if
you
start
seeing
that
you
can
now
imagine
that
a
smart
learner,
just
listening
to
the
radio
could
figure
out.
B
You
know
what
the
objects
are,
what
the
categories
of
those
objects
are,
which
objects
are
most
similar
to
one
another
Oh?
What
they're,
what
the
actions
tend
to
be?
What
simplified,
intuitive
laws
of
physics
might
be
when
intuitive
sort
of
psychology
might
be.
You
can
build
up
probably
a
pretty
good
model
of
the
world
just
from
sufficient
amounts
of
text.
So
that's
my
current
stance
I'm
happy
to
hear
arguments
against
it.
If
anybody
disagrees
so.
A
B
Think,
certainly
interacting
is
very
helpful.
You
can
learn
things
much
much
more
quickly,
so
so
here's
my
little
chart
at
exactly
that
which
is
you
know.
Biological
organisms
are
interacting
with
the
world
that
lets
them
probe.
You
know
aspects
and
they
don't
know
well,
they
try
something
and
I
think
they're
very
driven
by
when
mistakes
are
made.
You
know
you
you
push
on
something,
you
predict
something's
gonna
happen
or
something
else
happens.
Then
you're
really
interested.
B
You
start,
you
know
playing
around
with
it
and
so
I
think
it's
very
helpful
for
building
models
of
the
world.
One
step
removed
from
that
is
assimilated
like
a
video
game
of
the
world,
and
that's
probably
as
good
I,
now
think
that
yeah,
you
don't
really
need
that
interaction.
I,
think
it
speeds
things
up,
but
I
think
if
you
have
a
sufficient
amount
of
raw
video
and
you
just
do
it
a
good
statistical
model
of
it
that
you
can
build
up
those
kinds
of
things
and
now
I'm.
D
B
B
So
yeah
so
related
to
that
is
this
other
piece,
which
is
some
people,
are
calling
this
gbg3
type
of
interaction,
so
the
old
style
model
wall
and
the
old
style
neural
nets.
Were
you
design
a
neural
net
for
a
specific
task?
Like
let's
say
you
want
to
do
sentiment
analysis?
You
want
to
look
at
movie
reviews
on
Netflix,
and
is
this
a
positive
review
or
a
negative
review?
The
way
you
used
to
do
that?
Is
you
take
a
bunch
of
reviews
and
you
have
humans
label
them
yeah.
This
is
positive.
This
is
negative.
B
You
build
a
special-purpose
neural
net
and
you
train
it
on
that
task.
Then
they
kind
of
got
the
idea.
Omph
it
I
think
was
the
thing
that
shifted
people
from
that
view
to
a
kind
of
transfer
learning
view
where
you
train
a
big
big
model
on
maybe
unsupervised
learning
tasks
and
then
once
you've
got
a
good
model
of
language.
You
then
put
a
little
teeny
layer
on
top
that's
specific
to
the
particular
task.
B
You
care
about,
say
sentiment,
and
then
you
train
just
that
extra
little
bit,
and
that
has
been
the
paradigm
over
the
last
few
years,
and
so
the
first
thing,
maybe
would
be
the
old
days
in
software
1.0,
where
you
design
an
algorithm
to
do
something.
Software
2.0
was
you
design
the
neural
net
and
you
train
it
to
do
something.
The
new
paradigm.
B
B
B
On
the
other
hand,
as
you
run
it
it's
doing
this
attention
thing,
so
the
attention
weights
are
changing
all
the
time
and
some
people
are
thinking
of
those
attention
weights
as
a
kind
of
fast
weight
or
weight
that
is
sort
of
dynamically,
depending
on
what
the
input
is
and
so,
and
that
way
you
might
think
of
it
as
a
kind
of
neural
training,
but
in
its
inference
and
so
in
their
tests,
they
did
three
kinds
of
tests.
They
called
zero
shot,
one
shot
and
fuchsia,
and
so
is
it.
B
Let's
say
they
want
to
translate
from
English
to
French
notice.
It
was
never
taught
about
translation
just
that
there's
some
webpage
out
there.
That
happens
to
say.
Oh,
the
translation
of
this
word
in
French
is
this
in
English
and
that's
all
it's
using
to
figure
out
what
translation
means,
and
so
the
zero
shot
version
is,
you
just
say,
translate,
English
to
French
cheese,
and
then
you
ask
it,
then
you
put
this
little
arrow
symbol,
and
then
you
hope
it
figures
out
to
say
fromage
here
and
remarkably
often
it
does
for
many
of
these
tasks.
B
You
can
give
it
a
little
more
context
for
what
you
mean
by
giving
it.
One
example,
which
is
you
say,
translate
English
to
French
sea
otter
goes
to
Lutheran
in
there.
What
does
cheese
go
to
that?
Helps
it
a
little
bit
or
you
can
give
it
a
examples
that
they
call
few
shot
and
they
put
enough
examples
in
here
to
fill
up
their
context
window,
which
is
2048
tokens
and
they
say
that's
typically
somewhere
between
10
and
100
examples.
B
It's.
You
know
that
it
can
do
a
quite
a
credible
job
on
just
given
one
example
and
then,
as
you
give
it
more
examples,
it
does
better
and
better.
So
it's
a
very
weird
way
of
programming.
And
yet,
but
you
know,
that's
the
the
framing
in
which
they
do
it
do
everything
here
and
they
do
a
bunch
of
different
tasks.
Something
like
you
know,
25
or
something
like
that,
and
this
is
sort
of
overall
how
it
does
on
zero
shot,
one
shot
and
two
shot.
B
So
having
more
examples
in
the
input
definitely
helps
it
and
having
bigger
models,
helps
with
a
lot
42
different
benchmarks
that
they
do
them,
and
many
of
these
benchmarks
are
standard,
really
a
hard
benchmark.
Superglue
is
a
standard
natural
language
processing
benchmark
which
you
know
requires
all
kinds
of
you
know
figuring
out,
like
what
sentences
follow
from
one
other
sentences
and
what
things
entail
other
things.
So
these
are
not
trivial
tasks
and
in
many
cases
it's
like.
B
That's
a
good
question:
I'm,
not
really
sure
for
something
like
this
one.
You
know
once
a
human
gets
that
oh
they're,
putting
in
Asterix
in
the
middle
of
the
word
on
sister,
remove
the
asterisks.
The
human
would
really
quickly
go
to
100%,
so
that
shows
that
these
things
are
not
operating
in
the
same
way
that
humans
are
they're,
not
really
figuring
out.
B
You
know
the
the
abstract
notion
of
what
the
intention
is
here,
they're
doing
something
in
between,
and
in
fact
my
final
slide
here
is
talking
about
Kennan's
Thinking,
Fast
and
Slow
type,
1
versus
type
2
thinking.
So
so
in
human
thinking.
There
seems
to
be
at
least
two
forms
of
thinking,
one
which
is
unconscious
and
rapid,
which
they
call
type
1
and
1,
which
is
deliberative,
conscious
and
involves
reasoning
which
they
call
type
2
seems
to
me
and
I.
B
Think
other
people
are
starting
to
think
that
the
deep
learning
in
general
and
these
kinds
of
models
specifically
are
really
doing
a
good
job
of
type
1,
learning
very
rapid,
but
not
multi-step,
and
for
many
of
these
tasks
you
would
do
a
lot
better.
If
you
have
real
multi-step
reasoning
and
so
I
think
sort
of
where
AI
is
going
I
think
is
to
take
this
kind
of
immediate
model
and
combine
it
with
planning
and
reasoning
type
of
systems.
B
Is
really
planning
those
essays
in
the
sense
of
it's
not
like
sequentially,
considering
different
things
that
it
might
end
up
on
I
think
it
has
built
in
some
kind
of
high-level
semantic
knowledge
and
it
figures
out
what
semantics
it
wants
the
essay
to
have
and
then
it
sort
of
lets
it
play
out.
So
so
it
is,
you
know
it's
sort
of
like
when
people
speak,
I
think
most
language
speaking
is
also
not
I'm,
not
planning
out
what
I'm
gonna
say
in
the
next
sentence:
I'm
sort
of
letting
it
sort
of
emerge
from
from
a
structure.
B
That's
there,
whereas
a
really
good
essayist
will
figure
out.
You
know:
oh
yeah
I
want
to
have
this
emotional
impact
and
to
do
that,
I
need
to
go
here
so
so,
but
yeah
I
think
a
good
point
and
you'll
see
some
of
the
the
training
examples
also
you
know
like
they
have
a
sequential
character
to
them.
So
it's
an
interesting
question
Oh
in
particularly
this
one.
B
So
one
of
the
controversial
things
is,
and
they
were
arguing
about
this
about
GPT-
is
it
can
do
certain
forms
of
arithmetic,
and
so,
if
you
ask
GPT,
you
know
what
is
22
plus
33
it'll,
give
you
an
answer,
and
sometimes
it
gets
the
answer
right
and
sometimes
not,
and
so
people
were
hypothesizing
that
basically,
if
it
saw
a
particular
problem
on
the
internet
somewhere,
it
would
remember
it
and
they
would
give
you.
You
know
the
answer
for
that.
B
If
you
didn't
see
that
problem,
it
would
generalize
it
say:
Oh
they're,
showing
me
some
numbers
and
I
know
what
a
number
is
and
so
I
want
to
generate
a
number
of
the
right
form,
but
didn't
really
know
what
addition
was
and
what's
remarkable
is
so
they
tested
this
on
two
digit
addition
and
subtraction,
three
digit,
four
digit
and
five
digit.
The
big
models
they're
not
quite
doing
completely
right,
so
they're
not
really
learning
the
full
addition
algorithm.
F
On
this
one,
so
I
mean
it's
striking
how
you
go
from
these
thirteen
billion
parameters
to
175
billion
in
order
to
get
the
reasonable
results.
I
mean
what
is
the
thinking
in
this
community
that
the
scale
of
resources
required
for
it
to
actually
solve
any
real-world
problem
seems
you
know
way
out
there
to
me.
I
guess.
Is
it
a
view
that
Moore's
law
will
bail
us
out?
F
B
B
G
B
I
think
that's
really
a
central
question
that
you
know
we
barely
have
any
understanding
of
what
the
representation
of
operations
is
inside
these
models
and
and
I.
Don't
think
anybody
really
knows
what
their
computational
capacity
is
either
I
started.
Looking
at
the
self
attention,
operation
and
I.
Believe
you
know
if
you
handcrafted
I,
believe
that
would
be
sufficient
to
actually
represent
I.
Think
it's
computation
Universal
if
you
use
it
in
the
right
way,
but
whether
you
know
back
prop
learning
through
self
attention
can
learn,
say
a
real
addition
algorithm.
B
It's
not
even
clear
there
and
somebody
took
GPT
too
and
trained
it
on
chess
problems,
and
they
they
had
it
playing
chess,
not
very
good
chess,
but
it
could.
It
could
play
something
of
a
game
of
chess,
and
so
so
it's
weird.
You
know
these
strange
models
that
are
sort
of
halfway
between
just
general-purpose
neural
nets
and
they
have
something
of
a
computational
element
in
them.
I'm
guessing
these
are
very
early
days
and
that
they're
going
to
be
new
variants
of
these,
which
will
be
much
more
applicable
especially
to
this
kind
of
task.
B
H
B
Oh
yeah
I
definitely
think
it's
generalizing.
It's
not
just
that.
It's
you
know,
especially
when
you
get
up
to
five
digit
addition
very
few
of
the
five
digit
five
digit
addition
problems
could
actually
be
out
there
in
the
internet.
So
it's
you
know,
but
it
may
be
doing
it
in
a
fairly
simple
way
like
it's,
you
know
doing
the
two
digit
addition
on
the
first
two
digits.
You
know
it's
doing
it's
combining
it's
combining
knowledge.
It
has
about
elements
in
some
way
and
so
in
some
sense
you
know
somebody
should
really
nail
this
like.
H
H
Imagine
we
have
to
do
some
grounding
like
map
language
representation
to
features
in
the
visual
domain,
and
if
we
can
do
that,
we
could
maybe
even
reason
about
the
visual
domain,
but
in
the
space
of
word
and
then
we
can
figure
out
things
in
the
visual
domain.
Just
reasoning
about
it
onwards,
which
should
be
similar
to
what
we
humans
do
right
like
wishing
a
horse
has
four
legs
in
a
ponytail
and
then,
if
I
can
reason
about,
if
I
can
identify
four
legs
in
a
ponytail
and
I
know
reasoning.
H
B
We
just
we
have
a
little
reading
group.
We
just
read
papers
on
combining
language
with
reinforcement,
learning
also
that
often
there's
language
involved
in
tasks
where
you're
trying
to
plan
to
you
know
certain
activities,
but
yeah
there's
a
lot
of
people.
You
know
in
some
sense,
like
image,
processing
and
video
processing.
I
think
moved
ahead
ahead
of
natural
language
for
a
number
of
years
there
and
so
they're.
B
B
You
may
be
able
to
actually
get
an
alignment
between
language
ground,
the
language
in
the
physical
reality
without
ever
having
trained
it,
and
so
you
know,
I,
don't
I,
don't
know
of
anybody
having
done
that
and
clearly
I
think
it
would
be
better
to
train
them
together
and
so
I
expect
that
you
know
next
version
down
the
line.
They're
going
to
be,
and
probably
Google
is
doing
this
just
running
every
YouTube
video
through
it,
where
you
have
both
language
and
video
and
building
complex
models
that
have
both
vision
and
language
in
them.
B
At
the
same
time,
clearly,
I
think
that's
that's
the
next
step
and
whether
the
transformer
thing
and
self
attention
is
sufficient.
You
know
the
the
the
image
transformer
that
openly
I,
just
just
released
like
two
weeks
ago,
is
really
interesting
that
regard
because
they
they
use
exactly
the
same
model
for
the
in
the
image
domain
and
it
seems
to
be
capturing
visual
data
pretty
well.
So
maybe
that
is
a
sort
of
Universal
learning.
F
B
H
B
Know
I
think
we're
it's
really
gonna
help
is
the
vision.
Models
are
doing
pretty
well
when
you
have
known
classes
of
objects,
but
I,
don't
think
they're
getting
all
that.
You
know
there's
a
lot
more
deeper
semantics
in
in
natural
language
right
now,
and
so,
if
you
could
tie
all
of
that
knowledge
in
natural
language,
like
you
know,
it's
not
just
a
person
is
smiling,
a
person
is
joyous.
You
know
because
they
just
were
given
a
gift
or
something
is
it,
but
that
kind
of
thing
is
hard
to
do
in
a
pure
visual
way.
B
Think
we're
gonna
get
real
richness
of
semantics
on
both
ends.
So
word
scrambles,
another
simple
little
thing:
they
just
take
words
and
they
scramble
the
letters
and
it's
supposed
to
unscramble
them
similar
kind
of
phenomenon.
They
have
a
whole
bunch
of
examples
like
that.
So
so,
because
this
thing
seems
to
be
showing
characteristics
which
were
unexpected.
There's
a
lot
of
controversy
online
and
also
in
discussions
about
how
powerful
is
this
thing?
Is
it
really
just
learning
things?
B
Is
it
just
a
you
know,
kind
of
scrambling
up
the
internet,
remembering
it
and
spewing
it
out,
or
is
it
doing
something
more
and
there
are
opposing
factions
that
are
forming
so
one
blogger
who
goes
by
the
name
Guerin?
It
did
a
lot
of
experiments
with
TPT
2
and
has
been
recently
doing
a
bunch
with
TPT
3.
B
You
know,
and
then,
in
a
matter
of
years,
rather
than
a
long
time
on
the
opposite
side,
here's
a
well
linguist
Emily
bender,
who
does
a
lot
on
linguistics
semantics
and
she
just
came
out
with
a
paper
arguing
that
this
kind
of
model
is
not
in
principle
not
able
to
actually
capture
real
meaning
and
like
a
typical
sentence
from
the
paper.
Is
we
argue
that
the
language
modeling
caste,
because
it
only
uses
form,
is
training.
B
B
So
far
nobody's
really
been
able
to
do
that
with
text
that
I
I'm,
aware
of
and
yet
with
GPT
three
all
you
do
so
he
did
a
bunch
of
experiments.
We
said
summarized
the
Harry
Potter
story,
which
I
guess
it
has
read.
It
knows
about
online
in
the
style
of
different
authors,
and
so
here's
Harry
Potter
in
the
style
or
Ernest
Hemingway,
and
he
started
it
off
so
that
it,
the
the
bolded
text,
is
what
he
gave
it
and
the
other
text
is
when
it
generated.
So
it
was
a
cold
day
on
privet
drive.
B
If
a
child
cried
Harry
felt
nothing.
He
was
drier
than
dust.
He
had
been
silent
too
long.
He
had
not
felt
loved.
He
had
scarcely
felt
hate,
yet
the
Dementors
kiss
killed
nothing
death
didn't
leave
him
less
dead
than
he
had
been
a
second
before
and
because
I
was
pretty
good
and
then
he
asked
him
to
do
the
same
thing
summarize
Harry
Potter
in
the
style
of
Jane,
Austen
and
it
generated,
is
the
truth,
universally
acknowledged
that
a
broken
Harry
is
in
want
of
a
book
this.
B
B
A
A
In
this
case,
I
mean
mechanistically,
it's
given
the
bolded
text.
As
a
sequence
of
you
know
these
pairs,
these
tokens
and
then
it's
just
asked
to
predict
the
next
token
yeah,
and
then
you
include
that
next
token
in
and
then
you
predict
the
next
one
and
so
on.
D
B
You
know
when
you
generate
text
from
these.
There
are
various
ways
to
do
that.
It's
giving
you
a
probability
distribution
over
next
word.
If
you
just
take
the
highest
probability
word.
Sometimes
that
can
generate
aberrant
things.
It
can
generate
sentences
that
cycle,
and
so
they
often
do
what
they
call
a
bean
search.
Where
maybe
you
take
the
ten
highest
probability
words
and
you
crack
at
them
on
a
little
bit
and
you
you
sort
of
find
the
highest
probability,
sequence
and
so
I'm,
not
exactly
sure.
If
you
know,
if
they're
doing
anything
like
that
here,
yeah.
G
Well,
a
variation
on
a
sublimation,
and
that
is
is
to
look
at
you
know
the
two
highest
probability
and
if
they're
nearly
co-equal
in
magnitude,
then
you
know
note
that
there's
an
ambiguity
and
whether
that
would
be
enough
to
to
just
say
here's
two
possibilities
which
one
you
know
and
and
therefore
feet
I
mean
you
can
think
of
I'm
trying
to
incorporate
the
exploration
aspect
of
you
know.
By
going
out
to
a
distillery,
a
human,
you
know
and
basically
say:
okay,
this
is
ambiguous
to
me.
B
B
So
I
think
there
are
a
gazillion
ways
of
using
this
kind
of
tool
to
enhance
and
to
include
human
input.
You
know
in
actual
doing
something
useful
like
you.
Maybe
you
don't
want
to
generate
news
stories
or
something
you
could
maybe
just
have
it
generate
the
story,
but
it's
probably
better,
like
you
say,
to
have
a
human
in
there
curating
it
and
guiding
it
in
some
way
and
then
the
best
way
to
interact
these
things
in
terms
of
products.
B
I
think
is
totally
unclear
at
this
point
and
so
I
think
there's
a
whole
new
kind
technology
that
could
emerge
from
this.
So
I'll
give
you
just
one
more
example,
because
it's
super
amusing,
which
is
there,
are
a
bunch
of
critics
of
this.
It's
a
Doug
Hofstadter.
Actually,
an
old
friend
of
mine
wrote
to
Goethe
lecture
block
and
it's
a
real.
You
know
champion
of
understanding
deeply,
how
language
and
the
mind
works
and
I
think
he
rails
against
these
kinds
of
statistical
models
really
hates
it.
I
mean
Gary.
B
Marquez
says
somebody
else:
who's
written
a
whole
bunch
of
things,
sort
of
criticizing
these
deep
learning
models,
as
basically
missing
some
of
the
essence
and
so
I,
don't
know
if
they
wrote
this
together
or
whether
this
was
just
something
that
the
guy
who
was
testing
it
did
it,
but
he
started
off
an
essay
about
how
AIS
could
never
be
creative
and
never
have
good
language,
and
then
he
asked
gbg3
to
continue
it,
and
so
the
continuation
is.
They
will
never
be
able
to
truly
understand
the
underlying
meaning
of
things.
B
Additionally,
we
still
don't
have
programs
that
can
truly
imagine
or
create
new
things
all
by
themselves
in
a
natural
way,
but
rather
just
brute-force
through
random
strings
of
garbage
until
they
happen
to
hit
him
on
the
correct
solution
to
the
current
problem.
We
still
lack
a
rosetta
stone
for
the
universe
and
don't
really
know
what
things
like
light.
Sound
matter,
gravity,
life,
freewill
and
consciousness
are
was
generated.
B
F
A
question
which
is
you
know:
we
have
a
natural
language
partner
who
has
done
a
lot
of
work
on
this
using
sparse
representations
and
that's
a
lot
of
what
we're
doing
in
our
own
work
now
is
with
sparsity.
Have
these
guys
look
at
all
its
first
representations
as
a
part
of
what
they're
doing
to
you
know,
get
better
context
or
better
performance
or
anything
that
you
know
of
well
in
some.
H
B
Know
I,
don't
know
much
about
this:
the
GPT
3
on
GPT,
it's
very
rapid.
You
know
a
few
milliseconds
kind
of
thing,
I'm,
not
quite
sure.
Also
you
know
the
GPUs
have
been
getting
better.
Nvidia
keeps
cranking
out
more
and
more
powerful
GPUs,
and
so,
and
one
of
the
reasons
that
people
are
so
excited
about
the
transformer
architecture
is
that
it
Maps
pretty
well
on
to
a
GPU
that
the
attention
mechanism
that
they
chose
part
of
the
reason
they
chose
it
is
that
you
can
run
it.
I
I
B
It
seems
to
be
much
more
flexible
than
that.
If
you
go
to
this
this
website
here,
this
is
Gorn.
He
did
a
bunch
of
weird
experiments
and
so
in
particularly
asked
it
to
generate
puns
and
it
creates
its
own
words
and-
and
you
know
okay,
so
he
does
a
lot
of.
That
is
exactly
how
it's
doing
it
and
what
it's
doing
you
know
it's
it's
funny.
It's
it's!
It's
sort
of
like
interacting
with
with
something
different
that
we
haven't
seen
before.
B
A
A
few
examples,
even
in
their
paper,
where
they
kind
of
define
a
nonsense,
word
right
in
the
sequence
and
then
they
ask
it
to
generate
a
sentence
on
that.
So
they
have
an
example.
Like
you
know,
Giga
Maru
is
a
type
of
Japanese
musical
instrument,
an
example
of
a
sentence
that
uses
the
word.
Giga
Maru
is
:
and
then
it
has
to
fill
in
and
it
fills
in
something
like
I
have
a
Giga
Maru
that
my
uncle
gave
me
as
a
gift.
I
love
to
play
it
at
home.
A
H
You
write
a
few
sentences
if
the
word
and
you
ask
for
the
dictionary
definition,
and
it
will
give
a
dictionary
definition
which
makes
sense,
or
you
can
do
the
other
way.
What
you're
saying
you
do,
the
different
definition
and
then
you
ask
to
make
up
your
sentence
and
then
it
makes
it.
It
works
both
ways.
B
Guessed
that
this
kind
of
system
could
do
that,
and
so
I
there's
still
a
big
gap
in
my
own
understanding.
How
on
earth
is
it
doing
that?
You
know
what
so
I'm
hoping
that
people
will
start
probing
this
like
take
some
tasks
like
that
and
figure
out
exactly
what
kind
of
knowledge
it's
getting
at
each
layer
and
and
and
tease
apart.
How
is.
H
One
thing
that
I've
seen
the
near
future:
if
you
ground
this
language
models
to
image,
is
that
you
can
probably
create
movies
or
games
with
these
generative
models.
You
just
write
the
tax
and
you
can
generate
the
image
from
the
tax
right.
It
seems
like
a
very
good
application
and
game
industry's
huge
moving
industry's
huge
and
it
can
just
save
a
lot
of
cost
of
this
yeah.
B
I
mean
NVIDIA
has
been
doing
some
remarkable
stuff
where
you
draw
a
little
sketch
of
something,
and
it
makes
a
photo
realistic
image
of
that
thing
and
then,
like
you're,
saying
I've,
seen
a
little
bit
of
people
taking
text
descriptions
and
generating
images
and
videos
from
that.
What
does
that
mean
that
future
movies
will
suddenly
be?
You
know
on
the
fly
it
created?
You
know,
maybe
you
give
it
a
topic
and
it
generates
a
movie
for
you
remarkable
I.
E
B
So
bird
is
a
different
architecture,
so
GPT
3
is
an
autoregressive
language
model,
so
it
is
trying
it.
It's
basic
thing
is
to
give
probabilities
of
the
next
word
and
it's
trained
on
trying
to
predict.
The
next
word.
Burt
is
better
to
think
of
as
a
denoising,
auto-encoder,
and
so
the
way
they
train
it
is
they
give
it
sentences
and
then
they
block
out
15%
of
the
words
and
then
it
goes.
It
tries
to
compress
it
into
a
new
representation
and
then
generated
and
supposed
to
generate
the
original
sentence,
and
so
that's
its
training
model.
B
So
it's
a
sort
of
a
different
framing
and
the
the
advantage
of
burglar
one
advantage
is
that
it
can
use
both
the
context
before
and
after
a
word,
and
that
I
think
gives
it
more
power
in
terms
of
learning.
But
it's
harder
to
train
and
the
most
of
the
uses
of
birth.
That
I've
seen
have
all
been.
You
train
the
language
model,
and
then
you
have
to
train
it
again
on
the
specific
task
and
so
I
haven't
seen
people
doing
the
kind
of
thing
that
they're
doing
with
TPT
3.
B
Maybe
they
are
and
I'm
just
not
aware
of
it,
and
so
my
sense
is
that
Bert
that
Bert
architecture
in
general
should
be
more
efficient
in
using
data,
but
that
it's
more
complicated
the
Bert.
The
thing
is
pretty
sure
pretty
tricky,
and
so
it
looks
like
where's
DVD
three
is
sort
of
the
simplest
thing
you
can
imagine
in
a
way
and
it
seems
to
be
working
well.
So
I'm
not
sure
you
get
a
lot
of
benefit
from
the
Bert
thing,
but
you
know
I'm
sure
both
both
channels
are
going
to
be
moving
forward.
A
G
B
D
B
Interesting
because
China
I
think
has
a
different,
like
I've
heard
that
there
are
things
you
can
say
in
Chinese
that
don't
translate
well
into
English
I
used
to
like
to
look
at
the
daodejing
and
look
at
their
like
80
translations
of
the
down
engine.
You
take
multiple
translations
and
you
try
and
see
the
corresponding
text
and
in.
F
G
B
Using
reinforcement,
learning
to
improve
language
generation
and
tie
it
more
to
human
needed
things,
I
mean
one
area
is
catbots,
chatbots
are
still
pretty
bad
and
google
came
out
with
something
called
mina
which
tried
to
use
some
of
these
transformer
things
to
make
a
better
chat.
Bot
and
I've
heard
that
it's
pretty
good,
but
I
think
it's
still
not
really.
You
know
something
that
you
would
want
to
talk
to
for
very
long.
They
tend
to
lose
context
over.
You
know,
I
think
after
15
interactions,
they
kind
of
forget
what
you
talked
about
before.
B
They
don't
have
a
very
good
model
of
the
user
they're,
not
very
empathetic.
They
don't
really
know
what
you
know
and
don't
know,
and
all
of
that
I
think.
Reinforcement
learning
could
help
a
lot
with,
and
so
I
would
love
to
see
a
much
better.
Integration
of
reinforcement,
learning
really
planning
how
to
interact
with
a
person
or
with
another
some
other
system,
together
with
the
rich
kind
of
semantic
knowledge
that
these
models
have
seem
to
have.
A
So
keep
trying
to
think
you
know
what
are
the
limits
of
this
kind
of
paradigm?
You
know
you
could
argue
that
maybe
gbd3
has
been
trained
pretty
much
on
all
of
the
text.
That's
on
the
web,
or
maybe
close
to
that
you
know,
is
that
sufficient
to
you
know
solve
the
Turing
test.
Is
it
sufficient
to
create
something
that
can
generalize
and
really
seem
intelligent,
not
that
the
Turing
test
is
a
good
test
of
intelligence,
but
you
know
to
the
extent
that
you
can
interact
with
something
in
text.
F
My
head
was
going
on
that
was
novelty
and
inference,
and
if
I,
can
it
really
infer
things
that
it's
not
been
trained
on
I
mean
you
know
like
one
of
the
things
Jeff
talks
about
a
lot
as
staplers.
Okay,
you
learn
about
how
a
stapler
works.
Maybe
it
knows
how
the
stapler
works.
Do
you
know
that
you
shouldn't
staple
your
hand
and
put
a
staple
in
your
hand?
How
are
you
gonna
know
that
I
mean
you
as
a
human?
You
can
inference
that
I
would
expect
it
wouldn't
be
able
to
make
an
inference
on
that.
F
B
That's
an
interesting
point:
I
mean
my
own
sense
is
that
multistage
reasoning
is,
is
its
weakness
that
thing's?
You
know
the
kind
of
reasoning
that
it
finds
directly
and
then
slight
generalizations
of
that,
like
the
staple
in
your
hand
like
I,
could
imagine
it
would
get
the
idea
that
you
know
sticking
a
pointy
thing
through
your
body.
Part
is
painful
and
that's
not
good,
and
that
a
stapler
you
know
might
be
like
that.
So
maybe
it
could
do
that,
but
certainly
other
kinds
of
problems
which
require
two
or
three
steps
of
inference.
B
B
You
know
that
you
know
take
something
like
it's
bad
to
stab
yourself,
that
that
causes
damage
to
your
tissue
staples
are
pointy
that
the
way
a
stapler
works.
You
know,
so
you
might
be
able
to
find
a
chain
of
statements
which
lead
to
that
but
I'm
guessing
it
probably
wouldn't
do
it
on
its
own
in
the
current
status
and
so
combining
it
with
something
with
a
bit
more
reasoning
planning
you
know,
multi-step
thing
might
might
get
you
to
that
stage.
I
E
I
B
On
there's
an
open,
a
website
where
you
can
apply
for
access
to
their
API
and
I
think
they're,
eventually
intending
to
sell
or
rent.
You
know.
You
know
you
paper,
paper,
access
and
I've
heard
a
little
bit.
You
know,
Microsoft
in
opening
I
are
sort
of
working
together
and
guessing
the
sewer
will
get
a
version
of
this,
but
it
sounds
like
Google's
gonna.
Do
it
so
I
think
you
know
in
a
year
or
two
they're
gonna
be
a
bunch
of
models
like
this
out
there.
You
know
the
code,
for
this
is
pretty
Genie.
B
H
Go
back
to
the
other
question,
so
it
seems
to
me
that
if
you
combine
this
with
a
reward
base
model,
what
you
can
do
is
you
can
have
this
generative
model
generate
like
an
action
bear
into
action
for
the
next
time
step
and
your
reward
base
model
is
going
to
learn
a
policy
which
is
just
the
sequence
of
actions,
and
then
you
can
do
planning
on
on
longer
sequence
of
time,
based
on
this
reward.
So
you
can
actually
use
that
as
a
component
as
like
the
model
of
the
word
in
the
reasoning
process,
I
think.
B
That's
very
interesting
that
you
know
in
some
sense,
I
think
where
reinforcement
learning
is
going
is
they
need
to
deal
with
hierarchy,
hierarchical
plans
that
used
to
deal
with
sub
plans
and
language
sort
of
has
all
that
in
it,
and
so
it
really
seems
like
they
could
benefit
one
another,
and
so
I
would
love
to
see
I
mean
in
one
of
the
ways
of
thinking
of
gbg3
is
as
sort
of
a
black
box.
You
can
just
give
it
phrases
and
it'll
tell
you
probabilities
extensions
that
it's
not
knowing
compared
to
phrases
which
ones
more
likely.
B
C
C
C
As
some
of
the
anomaly
I
mean
it
gets
back
to
the
whole
sensory
before
you
know
original
general
intelligence
in
humans.
Is
it
very
much
of
sensorimotor
problem?
You
know
just
set
the
table
at
dinner
right,
it's
a
pretty
simple
task:
that's
a
sensory
motor
problem.
Where
do
you
put
the
plate?
You
do
this?
Where
do
you
do
that?
What's
on
the
table?
Do
you
have
to
clean
it
off?
You
know
that's
billion
things.
C
They
do
do
something
very
simple
like
that,
and
so
much
of
what
we
do
in
the
world
is
essentially
not
a
problem,
and
much
of
it
is
dealing
with
things
that
will
not
statistically
and
not
represented
anywhere.
So
the
current
arrangement
of
plates
on
my
dinner
table,
where
the
where
the
potatoes
are
versus
the
green
bean
versus
the
pizza,
whatever
changes
night
tonight
and
I,
have
to
build
a
model
of
that
it
very
rapidly
when
I
went
to
the
room.
I
have
to
update
my
model
when
I
move
the
plates
around
on
the
table.
C
So
these
are,
these
are
not
things.
I
will
find
statistically
described
anywhere
and
the
very
interactive.
So
there's
it's
sort
of
a
very,
very
fast
temple
modification
of
the
models
we
have
in
the
world
that
are
not
described
anywhere.
We
have
to
experience
them
and
learn
them,
and
we
have
to
learn
them
to
sensorimotor
interaction.
I
can't
learn
them
through
language,
I
have
to
learn
them
by
walking
to
the
room
and
seeing
things
and
picking
things
up
and
so
on
and
the
tasks
that
we
need
for
general
intelligence.
C
Also,
a
very
sensory
motor
related,
so
I
think
when
we
look
at
this
statistical
models
of
the
world
like
you're
talking
what
you're
really
just
statistical
models
the
world
I
mean
we
can
argue
how
good
they
are,
but
their
statistical
model
world
there's.
No
there's
no
I,
don't
think,
there's
any
debate
about
that.
C
They
have
very
shortcomings
when
it
comes
to
sort
of
real-world,
behavior
of
humans
or
intelligent
agents,
and
you
know
what
the
tasks
if
we
feel
apply
them
to,
which
are
really
impressive.
I
mean
I,
agree,
they're,
very
impressive
results.
They
only
apply
under
those
tasks
it
that
can
be
done
with
statistical
models
in
the
world.
In
some
sense,
they
they
don't
apply
them
to
things
that
require
tremendous
flexibility
and
very
rapid
learning
of
the
world.
B
Yeah
I
totally
great
great
points,
though
you
know
to
sort
of
push
back
a
little
bit.
Imagine
an
from
natural
language,
description
of
tables
and
plates,
and
you
can
push
Forks
around
I
could
imagine
you
could
build
like
a
little
physics
simulator
with
the
simulated
dining
table
with
the
simulated
forks
and
then
you'd
have
to
be
motivated
to
do
things
in
that
world.
But
you
could
then
sort
of
operate
in
that
simulated,
world
and
sort
of
learn
the
kinds
of
actions
you
might
take.
C
It's
the
question:
you
know
much
of
I've
always
believed
that
we
don't
learn
about
the
world.
Through
language
we've
learned
about
certain
things
in
the
language
and
most
of
what
we
know
about
the
world.
We
don't
we
learn
through
observation
of
other
sorts.
They
can
be
observations,
auditory,
tactile,
visual
and
so
I'll
use.
You
know,
I
have
an
example.
It
just
takes
something
like
I
use,
my
bicycle
every
day
and
it
makes
sounds
right.
It
makes
various
sounds
and
I
know.
C
These
sounds
I
know
the
sound
it
makes
when
I
clicked
on
the
kickstand
I
know
the
sound
makes
my
turning
here.
I
know
the
sound
it
makes
from
the
chain.
It's
there
anyway.
They
don't
sound.
All
these
little
sounds
I
know
about
it
and
they're
in
my
model
of
the
bicycle.
I
don't
have
words
to
describe
those
sounds
I,
don't
even
have
words
to
describe
the
things
I'm
making
the
sounds.
I
kind
of
know
that
yeah
this
thing
this
w,
but
I
may
never
know
what
the
word
for
it
is.
C
So
maybe
someone
else
does
I,
don't
know,
but
I
don't
learn.
I,
don't
worry
in
the
world
through
language
I
learned
something
so
much
obviously
would
communicate
a
lot
of
knowledge
that
way.
But
the
point
is
it's
not
that
I
think
that's
not!
That
could
be
critical.
These
models
I
think
when
you
think
about
language,
you
could
only
capture
some
parts
of
the
world
through
language.
There
are
acknowledged
that
we
individually
have
that
it's
not
expressed
in
language.
So
that's
a
sort
of
a
limit
to
the
modality.
C
It's
not
an
inherently
limited,
perhaps
of
how
these
systems
work
in
general.
That's
why
I
asked
you
earlier
how
these
models
apply
to
other
things,
I,
just
I
think
we
just
should
I
think
you
know
from
my
perspective,
but
we
really
want
to
create
artificial,
general
intelligence.
We
have
to
have
the
assistance
applied
to
sensory
motor
systems
systems
with
actual
input
since
with
arrangements
into
a
century.
You
know
visual
inputs
and
these
systems
have
to
be
active
agents
in
the
world
and
they
also
have
to
learn
very
very
rapidly
have
to.
C
It
cannot
be
based
on
statistics
on
lots
of
data.
You
can
only
learn
so
much
about
the
world
that
way
much
what
we
learned
on
the
world.
It's
not
like
that.
It's
just
like
hey!
This
is
something
new
never
saw
this
before
report.
This
is
a
new
arrangement
of
things
or
if
this
is
the
new
behavior
that
super-tiny
exhibited
today.
A
A
C
A
Is
but
you
wouldn't
find
every
possible
combination
of
three
daughter,
Alexa,
four
dots
in
esterified
background
a
pentagon,
and
you
have
to
kind
of
understand
the
relations
you
have
to
have
in
order
to
know
it's
an
equilateral
triangle.
You
have
to
understand
the
relations
between
these
three
dots,
not
so
much
explicitly
the
dots
themselves.
So
it's
it's.
C
C
Know
I
don't
know,
I
was
trying
to
pick
examples
where
this
that
the
statistics
of
the
models
are
not
generally
statistically
reliable,
they're
out
there
changing
all
the
time.
There
are
different
places
of
things
that
people
have
never
described
in
language
places.
People
have
never
been,
and
you
know
you
know
you
know
a
kind
of
thing
like
yeah.
B
C
We
have,
we
have
there's
so
much
stuff.
We
know
that
it
really
doesn't
exist
in
language
at
all.
Yeah
again,
this
is
that
is
a
that
argument
is
not
a
fundamental
argument.
That's
more
of
a
just
a
practical
argument
about
well
is
language
sufficient,
but
there
is
huge
number
of
things.
We
know
that
we
do
not
have
words
for
and
or
maybe
they're
not
commonly
shared.
Maybe
someone,
someone
knows
what
the
little
doohickey
on
the
little
bike
part
is
called,
but
I
don't
know
well,.
C
You
could,
but
that's
my
point.
My
point
is
knowledge
is
my
knowledge
of
the
world.
It's
not
stored
in
language.
My
model,
the
world
is
stored
in
a
model
that
I've
recreated
from
my
my
central
motor
interactions
and
I
can
apply
language
to
my
model.
I
can
try
to
explain
how
I
know
what
a
stapler
does,
but
they
didn't
learn
a
staple
through
language,
an
understatement
by
picking
it
up
and
opening
it
flipping
around
and
say:
hey
look
this
little
plate
on
the
bottom
can
reverse
and
makes
a
staple
go
this
way
versus
this
way.
C
Does
everyone
know
that
I
don't
know
and
and
say
you
I,
don't
have
a
word
for
that
thing.
I
don't
even
have
a
word
for
making
the
staples
bent
outward
versus
inward.
You
know,
but
there's
probably
isn't
word
but
I
don't
know
point
it's
there's
so
much
I
know
about
the
world
that
you
learned
through
experience
that
may
never
been
and
be
reflected
in
language
and
still
be
not
the
way.
My
brain
works.
My
brain
work
isn't
a
list
of
language.
I
apply
language
to
the
models.
C
The
models
exist
as
some
recreation
or
a
storage
of
what
I
experience
and
it's
as
our
work
here
at
no
Mantha.
We
understand
how
that
it's
happening
and,
and
then
I
can
say.
Okay
given
I
know
what
it's
my
model
of
a
stapler
exists.
I
can,
let
you
try
it
try
to
apply
words
to
it.
I
can
say:
okay.
Well,
how
would
I
describe
that
part
and
how
would
I
describe
this
action,
but
the
knowledge
about
the
action
and
knowledge
of
the
parts
doesn't
exist
in
words.
It's
something
I
decline
later
yeah.
A
I
want
to
distinguish
between
kind
of
I
think
two
different
things.
One
is
kind
of
the
modality
of
the
of
the
sense,
so
we
have
set
a
language
versus
vision
versus
auditory
and
so
on,
but
regardless
of
the
modality
there's
a
independent
thing,
which
is
a
purely
passive
way
of
learning
through
statistics
or
an
active
center
loop
now
I
think
those
are
two
orthogonal
yeah.
We.
C
Know
by
the
way,
to
certain
see
that
the
brain
learns
through
such
a
motor
interactions,
I
mean
that's,
not
debatable,
and
we
also
know
how
it
learns.
We
know
that
eases
reference
Renton's
for
storing
knowledge
about
as
you
interact
with
the
world
that
keeps
track
of
where
everything
is
in
a
reference
very
much
like
an
engineering
CAD
program
or
something
like
that.
That's
what's
going
on
in
the
brain,
and
so
knowledge
is
stored
in
reference
frame
and
and
our
models
are
stored
in
reference
frames.
C
Now
the
question
is:
can
you
get
to
AGI
when
the
system
that
doesn't
have
it?
You
know
I,
don't
think
so?
I,
really
don't
I.
Just
don't
think
you
can.
You
can
do
a
have
a
lot
of
impressive
stuff,
but,
as
you
point
out
it's
to
be
limits
of
these
things,
you
know
we
finished
that
we
finished
a
paragraph
from
Hofstetter.
It's
pretty
cool.
What's
the
next
paragraph,
you
know
what's
the
next
book
that
Hopson
is
going
to
write,
you
know,
so
you
know
your
first
and
second.
C
B
B
That
whole
language
system
was
built
on
top
of
that
the
language
comes
from
social,
from
trying
to
expand
from
the
individual
experience
to
a
social
experience,
and
it
doesn't
necessarily
represent
everything,
and
so
because
it's
our
shared
social
mechanism,
we
think
so
we
can
sometimes
think
it
represents
more
than
is
actually
going
on,
and
so
so
it's
really
interesting
to
see
the
evolution
of
language
and
how
that
fits
in.
It's.
C
Also
inching
because
all
these
all
these
AI
systems
today,
almost
every
one
of
them
we
interact
with
them
through
language
I
mean
you
know
if
you
say:
okay
label,
this
picture.
Well,
it's
late,
it's
a
word
for
it,
but
that
word
is
impoverishing.
If
it
says
this
picture
is
a
cat
doesn't
know
that
door,
cat
people
and
God
people
you
know,
does
it
know
that
the
cat
has
a
heart
doesn't
know
the
cat's
toes
need
to
be
clipped
on
it?
Just
ruin
your
furniture
I.
C
H
H
Yeah
but
then
the
second
question
is:
can
you
generalize
to
like
our
of
domain
distributions
things
you've
never
seen
before,
and
maybe
the
answer
is
they
can,
but
the
the
solution
they're
proposing
is
that
okay,
if
I,
can't
generalize
part
of
domain,
maybe
I
just
put
everything
in
domain
I'm,
just
gonna
train
on
everything
which
is
out
there
and
then
nothing's
gonna
be
out
of
the
main
right.
So.
E
C
Think
it's
a
little
bit
of
red
herring
of
the
focus
on
the
generalization
I.
Think
it's
more
important
to
focus
on
the
dynamic
learning
aspect.
You
know,
learning
new
things
rapidly
that
didn't
exist
before
and
building
models
of
things
that
that
there
hasn't,
because
you
can,
if
you
have
enough
data,
you
might
be
able
generalize
from
enough
data.
You
might
be
all
say:
here's
six
trillion
things
and
which
one
of
this
causes
job.
C
There's
so
many
dimensions
we
can
go
down
here.
So,
let's,
let's
check
the
whole
dimension
of
embodiment
right.
Yes,
please
manipulate
the
world.
We
pick
things
up.
We
do
things
we
move
around
and
so
on.
How
does
this
apply
to
that?
How
do
you
make
a
robotic
system
that
can
go
around
and
figure
out
how
to
put
the
chain
back
on
the
bicycle?
No
one's
never
been
shown
that
before
you
know
that
physically
can
do
that
can
ride
the
rides.
Oh,
oh,
what
the
chain
fell
off.
E
C
H
H
H
H
C
Don't
even
robots
today
can't
do
this
stuff
at
all.
So
so
now,
I
I'm,
not
saying
you
can't
simulate
what
humans
do
I
think
again,
I
think
we
can
build
intelligent,
robotic
systems,
so
I
guess
I'm
saying
is
they
can't
deal
with?
You
know
you're
saying
could
I
just
by
showing
it
every
possible
thing
ever
happened?
C
Could
you
learn
to?
Could
it
learn
to
do
stuff
like
that?
That's
that's!
Where
I
gave
the
example
of
the
plates
on
the
dinner
table
earlier,
because
there
is
no
statistics
in
the
world
that
tells
me
how
the
plates
are
arranged.
My
dinner
table
right
now
right
now,
not
not
general,
not
typical,
but
right
where's.
You
know
the
potatoes
on
my
plate
on
my
table
right
now
and
I
so
much
of
the
world.
We
do
with
right.
Now
it's
like
I'm
sitting
here.
All
my
coffee
comes
over
here.
C
My
mouse
is
over
here,
maybe
they're
different
in
a
minute.
For
now
and
that
kind
of
you
know
this
interaction
with
the
world
is
not
and
have
a
model
right
now
in
my
head,
where
all
these
things
are
I
can't
get
that
from
statistics.
You
know
I
had
to
learn
it
really
quickly
rapidly
in
a
second
ago.
Does
that
make
sense,
Benson
yeah.