►
From YouTube: On Attention (NRM Mar 2, 2020)
Description
Marcus Lewis on attention. He reviews current papers on Transformers and relates them to HTM with Jeff.
Recurrent models of visual attention
https://arxiv.org/abs/1406.6247
Attention is all you need
https://arxiv.org/abs/1706.03762
A
Okay,
well,
it's
gone
to
some
extent,
I'm
winging
this,
because
I
haven't
been
like
living
in
the
world
of
attention.
I,
both
I'm,
mostly
I've,
been
when
I
read
about
this
I
saw
the
connection
that
saw
this
was
months
ago.
I
saw
probably
here's
how
I
would
present
this
and
imagine
here's
how
I
would
sort
of
free
this
from
the
language
of
mathematics
and
neural
networks
that
is
written
and
tried
to
discuss
it
in
the
sense
of
what
it
says
it.
A
D
A
Like
Auto,
completing
entire
stories
and
writing
essays
that
are
almost
co-parent
and
it's
like
Bruce,
a
bunch
of
brilliant
presidents
doesn't
come
out
of
this.
So
I
guess
like
something
that's
cool
about
all
this
is
the
models
are
cool
and
they
work
they're
they're
clever
and
they
work,
which
is
nice
and
so.
A
A
C
F
A
D
A
D
A
Make
it
clear
whether
something's
moved
here
they
have,
they
have
an
input,
image
and
they're
choosing
which
part
of
the
image
to
pass
into
the
model,
which
is
essentially
sacani.
The
thing
is
one
thing:
they
do
do
words
a
little
bit
different
from
psychologism
I
mean
if
it
were
truly
to
copy
these
boxes
would
all
be
the
same
size
but
they're
kind
of
zooming
in
and
zooming
out
there
saying
like
first,
you
might
get
a
broad
with.
The
second
is
like
zoom
in
on
this
certain
part.
E
A
A
A
A
Evolve
toward
this
thing
that
work
starts
seeing
a
little
strange
over
here,
you
can
see
how
to
smooth
transition.
So
now
will
this
imagine
just
taking
this
and
have
multiple
diamond
and
parallel
multiple
moving
sensors
just
having
a
right.
Let's
even
call
these
cortical
columns
if
you
want,
but
let
me
say
this.
G
D
G
A
B
B
A
And
repeat:
now:
I'm
just
going
to
take
that
exact
thing
and
put
them
next
to
each
other.
If
you
have
multiple
independent
moving
sensors
at
this
point,
this
is
a
little
bit
analogous
to
what
we
could
call
a
cortical
column.
This
could
be
a
model
of
a
cortical
problem
and
independent
moving
sensors.
You
can
imagine
them
processing
that
working
like
this
with
multiple
of
these
yeah
multiple
of
these
in
parallel
I.
Don't
know
that
have
much
to
say
about
this,
because
you
probably
already
understand.
D
It
just
multiple
of
these
in
parallel.
Well,
as
it
seemed
to
point
out
there
that
if
you
talk
about
multiple
visual
columns,
they
kind
of
move
together,
but
here
you're
showing
them
something
which
is
fine,
which
would
be
maybe
like
one
column
for
your
finger
or
one
comfrey
or
something
like
I
mean
something
like
that.
I
mean
you
have
moving
independently.
So
if.
G
A
Now,
rather
than
thinking
of
this
as
multiple
moving
sensors,
think
of
it
as
an
array
of
sensors
that
is
sort
of
covering
the
whole
image,
no
movement
is
occurring
anymore,
at
least
for
this
conversation,
no
movements
occurring
anymore
and
and
instead,
when
this
column
wants
to
wants
to
get
information
about
other
parts
of
the
image
rather
than
to
copy
over
to
it.
It
uses
horizontal
conductivity,
so
magic.
A
A
That
yeah
and
by
the
way
this
is
always
our
language
thing
every
year,
so
important
and
all
of
these
papers,
the
cortical
column,
is
processing
input
over
time.
It
receives
an
input
and
then
from
that
it
sort
of
just
science
like
what
to
attend
to
next
or
where
to
get
information
from
nets,
and
it
could
do
that
using
the
large
number
of
horizontal
connections
going
in
both
directions.
It
can
retrieve
information
from
the
other
parts
of
the
sensory
array,
or
at
least
the
nearby
ones.
Is
this.
A
D
D
A
C
C
G
H
B
Any
different
way
so
the
same
thing:
let's
see
if
I
understand
it
so
there's,
let's
say:
there's
three
cortical
columns:
each
one
is
processing
one
subset
of
the
image
there
and
they,
so
they
get
all
of
that
information,
and
each
of
these
cortical
columns
here
are
connected
to
each
other
laterally.
And
if
a
cortical
column
now
says
oh
I
want
to
get
information
from.
You
know,
image
section
number
three
I
can
get
my
inputs
from
the
other
cortical
columns
I'm
only
getting
it
from
the
third
one
and
now
I'm
going
to
do
another
cycle.
B
D
D
D
C
A
A
That
they
introduced
and
they
incorporate
it
into
a
language
Wow,
so
everything
I
just
said
before
you
can
imagine
that
a
recurrent
neural
network,
like
all
of
these
all
of
these
you
know
the
the
little
G's
the
little
arrows
inside
your
and
these
sideways
arrows.
You
can
take
this
and
you
can
convert
it
into
a
feed-forward
network
for
a
specific
number
of
time
steps.
G
E
D
D
E
E
A
A
We
just
know
to
be
clear:
I
said:
we've
just
unrolled
this,
but
I
am
using
different
learned
weights
of
each
of
these
stages.
It's
not
like
I'm
sharing
the
weights
from
here
to
here.
It
is
truly
converting
into
a
feet
were
before
Network,
where
all
of
these
connections
have
different
ways.
If
I
were
literally
unrolling,
this
you'd
use
the
same
way
to
every
stage
and.
A
But
otherwise,
as
bad
as
that
is
what
this
is,
is
this
network
that
is
in
every
level
of
it,
is
receiving
it,
but
deciding
where
to
attend
to
next,
essentially
like
it's,
this
one
decides
that
okay
I
need
to
know.
What's
here
this
one
besides
that
I
need
to
know.
What's
here,
it
I
think
it's
useful
to
give
a
concrete
example
and
and
I
thought
there
wasn't
gonna
mention
language
at
all,
but
it
actually
was
actually
provides.
Some
really
nice
toy
examples
that
are
useful
for
thinking
about
this.
So
here.
F
B
D
A
D
D
A
D
C
A
Of
it,
which
is
a
lot
many
of
these
optional
paths,
Edward
it's
successfully
once
the
model
has
been
trained
successfully.
The
word
it
will
refer
to
the
correct
value.
It'll
refer
to
animal,
so
so
the
attention
has
been
routed,
cleverly
in
some
way
after
five
stages
of
processing,
so
caught
naked,
where
the
representation
for
the
word.
It
is
some
combination
of
a
couple
things
that
mainly
includes
the
animal,
and
this.
B
D
A
D
B
A
Network
here
is
just
not
yeah
I'm,
really
getting
at
the
corner
track.
If
I
were
tell
if
I
were
telling
the
full
story,
what
they're
doing
here
is
they're
there
they're
doing
translation.
So
what
they're
gonna
do
here
is
they're
kind
of
take.
This
English
sentence
move
it
into
some
intermediate
thing:
that's
not
really
in
any
language,
that's
called
the
encoder,
then
they
have
a
decoder
that
puts
it
back
into
another
language
like
French,
or
something
like
that.
E
D
D
B
F
F
D
B
D
C
E
G
Because,
because,
basically
with
this
network
seems
to
do
is
kind
of
blend
in
a
uniform
series
of
steps,
both
syntax
and
semantics
and
I
was
this
question
whether
it
got
to
the
level
of
semantic
understanding
that
understood
what
a
noun
was
as
a
class,
for
instance
like
an
animal.
You
know
whether
they
well
it's
kind
of
like,
if
you
think
of
how
maybe
children
learn
language
by
whatever
the
pattern
recognition
process
that
are
doing
they.
G
Don't
you
know
explicitly
study,
syntax,
explicitly,
study
semantics,
you
know,
but
something
like
this
going
on
could
occur
and
that
you
know
there's
some
useful
results
that
take
that
go
from
there.
So
I'm,
just
I
just
basically
said
that
you
hadn't
explored
that
but
I'm
just
kind
of
curious.
Whether
was
that
in
any
of
the
goals
of
with
these
Transformer
papers
are
trying
to
do.
D
A
A
B
I
agree
with
you,
but
let
me
play
devil's
advocate
you
be
there.
This
is
extremely
brute
force
and
huge
amounts
of
compute
power.
Data
I,
don't
think
throne
heaven
these
are.
You
know
these
billions
of
parameter
models,
and
so
they
would
train
it
on
billions
of
documents
from
the
web.
So
what
will
happen
is
many
of
the
questions
you're
talking
about
it
could
actually
answer,
because
someone
on
the
web
has
already
answered.
B
D
C
D
D
D
D
D
D
D
C
A
D
A
A
A
F
A
A
Image
to
interpret
what
you're,
seeing
and
so
I'll
bring
up
something
we're
familiar
with,
because
we
have
a
we
have
our
own
explanations,
for
it
is
order
ownership.
That
is
a
very
similar
problem.
Where
you
see
part
of
an
image
and
order
and
ownership
and
half
the
room
will
know
what
it
is,
that
there.
A
That
that
will
only
respond
like
say
this
is
the
input
to
a
neuron
and
and
be
wondered.
This
is
an
this.
Is
the
inputs,
the
receptive
field
of
the
neuron
and
v1?
It
will
respond
to
this,
but
it
will
not
respond
to
to
the
exact
same
to
the
exact
same
thing,
but
if,
if
the
actual,
if
the
actual
figure
here
is
like,
if.
C
D
A
D
A
D
A
D
D
So
I'm
surprised,
because
this
was
a
key
point
of
those
papers
that
wasn't
in
the
first
paper,
but
it
wasn't
the
second
papers
that
I
walked
away
from
and
it
looks
like
mine
because
there's
a
very
important
difference
between
just
which
side
and
where
you
are
on
the
object
and
so
I'd
be
surprised
if
I
got
that
wrong.
But
everyone
I
want
to
know
it,
because
it's
our
model
saying
that
should
be
more
more
specific
than
what
you're
suggesting
so.
H
F
D
D
D
D
B
D
D
D
D
A
A
B
D
A
D
D
F
D
D
A
D
F
D
E
G
G
C
A
B
D
A
D
D
C
D
D
D
G
Selecting
from
multiple
inputs
coming
in
there
right
so
I'm
trying
to
figure
out
is
that
something
that's
just
pushed
forward
to
what
you
asked
for
the
next
network.
Or
is
he
deciding
on
that
attention
thing
which
are
these
things
to
pay
attention
to,
which
would
be,
in
my
mind,
a
little
more
powerful,
rather
than
just
pushing
the
question
up
so
I'm
trying
to
figure
out
what
exactly
that
attention
mechanism
is
doing.
It's.
A
D
This
is
really
you're
just
unfolding
to
time-weighted
model
number
two
and
therefore
this
could
be
a
dynamic
process
occurring
in
a
single
column
is
attending
two
different
things
and
has
its
problem,
and
that
fits
exactly
what
we
think
right
and
the
other
hand
you
could
say.
Yes,
maybe
that
exists,
but
here
you're
showing
on
folio
x,
babies,
both
but
but
I'm,
not
sure
these
diagrams
consider
both
it's.
D
D
F
D
Guy
is
somehow
getting
is
say,
takes
an
input
him
to
attending
to
some
section
here,
but
it's
also
get
attending
to
something
down
here
at
the
same
time-
and
we
don't
understand
that,
but
we've
described
is
use
this
unfolding
in
time.
That
won't
be
one
of
those,
so
it
seems
to
be
a
combination
that
would
be
consistently
you're
saying
yeah
yeah
and
one.
A
C
D
B
A
B
D
G
A
And
adding
it
essentially
like,
if
you
think
this
is
a
layer
of
cells,
because
a
bunch
of
output
axons
so
almost
like
these
are
adding
to
those
axons.
It's
this.
It's
bizarre,
doesn't
work
with
biology
or
another
tape,
though,
is
that
like
what?
What
is
usually
the
case
with
these
is
all
of
these:
have
the
exact
same
number
of
cells,
this
might
be
1,000
cells.
This
is
also
compounding
cells.
This
is
also
at
1,000.
A
Much
more
sane,
if
you
say
oh
every
time,
I
see
one
of
those,
it's
actually
a
recurrent
Network
being
unrolled
every
time.
I
see
one
of
these
like
if
I
interpret
this
as
this
is
actually
you
know
this
ultra
low
G
like
this
is
actually
one
of
those
being
unrolled,
and
it's
just
the
the
differences
differences.
These.
A
C
A
Pay
might
connect
to
Selby
in
multiple
ways
on
this
dendrite
in
one
way
in
this
dendrite
in
another
way.
So
this
might
actually
be
an
artifact
of
the
simple
neurons
were
using,
and
the
resonance
were.
Training
are
essentially
to
the
extent
that
they're
using
principles
of
biology,
they
might
actually
be
mimicking
this,
and
that
networks,
and
this.
A
D
I
think
the
big
big
key
difference
here
is
it's
part
of
the
complexity
in
that
box.
The
circuit
a
few
thoughts
is,
you
know
this
sort
of
place
also,
how
do
you
think
going
on,
and
so
you
are
not
just
getting
this
input
you're
getting
these
input
in
the
context
of
some
framework
around
this
frame
and
therefore
you
it
just
down
someone
power
to
that
thing.
It's
you
know
over
time,
and
so
to
me
you
know
if
we
were
to
do
if
we
would
do
the
biological
model
of
attention
we've
talked
about
here.
D
You've
talked
about
Marcus.
We've
all
done
this.
Where
this,
when
we
were
attending
to
different
things
in
the
world
like
you're,
just
a
little
bit
attending
the
party
visual
scene
or
a
different
pressure
with
your
fingers,
you
are
building
up
this
structured
environment
of
reference
frames
of
reference
frames
and
in
so
we're
not
we're
not
just
it's
not.
This
is
a
2d
representation
of
the
world
or
this
month,
a
representation
of
time
is
really
we're.
D
E
G
About
something
that
you
said
so
when
you
talked
about
in
neurobiology
neurophysiology,
that
you
can
get
an
activation
through
multiple
paths,
so
I'm
just
trying
to
think
if
the
analogy
of
the
recursion
here
is
to
get
different
weights
on
there,
I
mean
the
synapse
is
not
going
to
change
dynamically
that
fast.
But
if
you're
saying
that
we
can
get
the
recirculation
by
the
act
of
a
you
know,
a
different
set
of
cell
bodies
that
then
sent
out
a
different
set
of
it
picked
off
of
a
different
than
during
tree
to
feed
in
the
same
thing.
G
A
F
F
D
D
D
Is
that
we
think
just
can't
change
very
rapidly?
That's
not
really
true
in
a
classic
view,
but
especially
know
that
the
brain
weren't
seen
very
badly.
You
could
look
at
something
on
this
board
right
now
and
you
could
only
looked
at
this
board
for
a
minute
or
two
and
you
would
walk
away
and
you
remember
things
about
this
board
right
away
and
those
are
not
recurring
activity
in
your
brain.
D
C
D
You
know
depression
is
that
those
are
the
minor
changes
right
and
what
we
required
it.
Our
whole
thing
all
of
our
models
required
that
you're
able
to
learn
these
major,
no
associations
rapidly
it's
hard
to
do
that
by
tweaking
individual
synapses.
That
already
had
some
weight.
It's
much.
It's
like
I
want
to.
If
I'm
going
to
learn
human
population
come
over
here's
a
population
code,
two
separate
sets
of
neurons,
but
they
could
be
the
same
over
time
and
I'm
gonna
say
this
powder
here
in
this
property
are
associated
rapidly.
D
B
D
C
D
C
G
C
D
B
C
A
B
D
We
talked
about
thalamus
Quranic
version.
Don't
got
a
key
to
be
sort
of
a
centralized
thing,
because
you
can
this
large
cortical
area
and
if
you
rely
on
the
selamat
relay
cells
and
to
do
this,
then
you've
got
this
very
small,
centralized
thing
and
that's
the
whole
point
of
bringing
them
all
down
to
one
point.
But
we've
talked
about
for
scaling
and
for
because
it
becomes
one
point:
it
enables
the
central
attention
and
it
kind
of
so
I
think
your
point
is,
you
could
have
is
allowed
in
a
colony.
D
A
A
B
C
A
A
E
F
I
can
I
try
to
explain
I
think
might
be
hard
on
the
other,
but
I
think
Jesse's,
alright,
so
each
one
of
those
bots
like
the
language
example
they
have
a
key
but
very
independent.
So
what
you
doing
is
for
one
of
those
punches
like
box,
one
you
can
query
and
then
you
multiply
by
the
key
of
all
the
other
ones,
the
key
yeah
you
can.
F
E
F
Then
it
you
take
a
soft
max
over
that,
then
you
have
like
probability,
distribution
that
tells
you
how
much
you're
gonna
pay
attention
through
the
fill
port,
and
then
you
multiply
that
probably
by
the
vent,
and
then
you
sum
all
the
sides,
so
they
do
representation
for
the
word.
Five
takes
into
account
a
feed
of
the
value
of
all
the
other
words.
That's
dependent
of
that
probably
I,
don't
know
if
it's
hard
to
explain.
I
am
pointing
today,
but
you
can
see.
E
D
By
what
the
previous
conversations
we
had
about
attention
in
our
honor
and
what's
required
to
those
a
temporary
model
of
the
world
by
attending
data,
free
components
and
with
what
are
the
different
reference
frames
required?
I
think
that
would
be
a
really
nice
supplement
to
different.
So
you
attention
with
reference.