►
Description
Lucas and Marcus continue discussions on attention and transformers. Jeff adds the idea of myelin to the conversation.
A
Correct
is
so
you
calculate
the
query,
the
key
and
the
value
for
each
one
of
them
foot
and
so
for
the
work
date.
For
example,
you
calculate
Q
1,
which
is
the
weight
matrix
times,
X
1,
which
is
input
and
then
similarly,
you
calculated
key
for
each
one
of
them.
So
for
K
1
is
the
weight
matrix
the
build
K
times
the
input
X
1.
So
they
share
the
big
matrix
here.
B
C
Is
now
or.
A
E
A
D
A
D
C
A
And
thank
you
for
all
them
and
then
you're
gonna
multiply
get
a
doctrine
between
the
query
entity
and
you
run
a
soft
max
over
all
of
this.
So
what
you
get
in
the
end
is
you
have
a
probability
distribution?
That's
gonna
tell
you
how
much
attention
you
should
pay
to
each
one
of
the
inputs,
all
right.
So
after
you
run
the
softmax
if
I
have
the
fact
is,
you
can
have
a
high
value
in
the
current
input,
which
is
more
relevant
and
then
you're
going
to
have
a
lower
value
for
all
the
quizzes.
D
E
D
D
D
A
A
A
A
D
A
C
A
E
E
E
E
Hour
and
in
addition
now
there's
this
other
thing
here
which
says:
oh,
not
only
am
I
going
to
do
this,
but
I'm
going
to
look
at
all
of
the
context
here
and
here
and
then
for
every
context.
Point
here,
I'm
going
to
give
a
weight
value
as
to
how
important
that
context
is,
and
then
this
is
going
to
be
multiplied
into
here.
Yeah,
all
right,
so
I'm
going
to
modify
the
normal
input
process
am
going
to
do.
But
how
important
the
context
is
is
that
if
we.
C
A
A
A
And
at
first
the
introduce
a
nation
recurrent
neural
networks,
not
just
solve
it
until
an
issue,
but
so
together,
dirty
diapers,
and
then
we
had
the
recurrent
and
then
we
have
in.
So
we
have
this
type
in
a
comment
to
this
guy,
but
also
as
months
and
then
this
paper
the
Transformers
paper,
but
they
suggested
that
you
don't
need
the
requester.
D
D
A
E
D
D
Just
shifting
the
image
around
does
only
make
a
difference.
It
really
is
an
active
processing
element,
so
this
doesn't
account
for
that
when
you
could
argue
that
during
a
fixation
that
maybe
this
kind
of
thing
is
occurring,
especially
over
some
part
of
the
image
perhaps,
but
what
most
neuroscientists
would
call
attention,
this
doesn't
capture
that
one.
G
Of
the
things
you
can
look
at
is
the
fact
that
the
early
age
were
positive
process,
the
sentence
sequential,
but
there
are
other
ways
of
processing
it.
That's
what
speed
reading
is
about,
where
you
can,
you
basically
jump
over
and
look
at
a
control
of
segments
of
the
sink
and
pull
the
meaning
out
of
it
like
that,
so
the
fact
that
we
most
people
don't
do
that
way,
because
the
way
that
they
were
taught
doesn't
mean
that
there
isn't
some
capability
for
actually
looking
at
these
things
in
larger
groupings
which
you
probably
capture
with
this.
D
D
G
D
G
What
is
the
example
you
just
gave
is
that's
an
economy
thing
that
you
can
recognize
it.
You
know
it
in
one
shot,
that's
fair,
but
if
you're
given
novel
content,
it's
how
you,
in
some
ways
with
the
speed
reading
you're
able
to
find
this
the
structure
and
you
yet
the
meaning
of
it
without
you
know
voicing
it
internally.
Let's
say:
yeah.
D
G
D
E
A
D
C
B
A
B
A
But
the
thing
they
do
is
they
bury
he
and
I
spawn.
So
how
do
you
represent?
Did
you
kind
of
represent
using
embed
IQ,
some
glob
or
sort
of
work
that
we
have
which
are
distribute
representation?
And
then
you
add,
so
you
have
like
a
distribute
representation
here,
but
then
you
add
something
that
tells
its
position.
This
position,
this
position,
this
position
so.
A
D
B
B
A
D
D
D
E
B
D
D
E
E
D
D
D
D
B
B
A
A
A
A
E
B
D
B
C
C
B
G
G
Lock
on
to
that,
and
without
words
for
that
so
I
think
you
know
and
I
would
I
look
for
scanning
text
I'm
looking
for
certain
saline
words
or
something
else,
something
I
even
don't
recognize
or
something
that
looks
key
to
me
and
that
would
focus
my
attention
around
that
so
I
agree,
the
fear
of
analogy.
That's.
D
Just
you
know
it's
just
part
of
it,
but
I
do
what
I
thought
was
appealing
here
was
like
you
know
this
issue
of,
like
the
word
thought,
for
example,
not
doesn't
contain
a
lot
of
information
and,
if
I
now,
this
is
whom
you
talk
about
that
they
yeah.
Well
the
thing
where
I
say
that
word
I,
like
you,
want
it
most
likely
tend
to
the
next
thing.
You
know
like
really
this.
B
C
D
G
D
C
D
So
it
might
go
to
this
with
the
cortex
you,
and
these
are
your
six
players
and
it
might
be
down
here
for
ten
centimeters
and
go
up
someplace
else
or
it
might
go
down
into
the
spinal
cord,
some
like
that
way
or
another
neuron,
but
that
actually
might
just
go
locally
if
it
doesn't
go
far.
So
spikes
travel
along
the
apps
on
all
this
always
go
over
to
the
end.
They
never
knock
it
all
again,
but
the
speed
with
a
travel
is
pretty
slow,
and
so
the
trick
of
the
brain
news
is
to
speed.
D
This
we
can
be
really
problematic,
so
the
trick
that
nature
does
is
they
wrap
the
the
axon,
in
fact,
and
the
fact
is,
is
that's
the
myelin
sheath.
It
is
an
insulator
and
and
I'll
go
through
the
details.
A
little
bit
of
this
in
case
you
might
be
curious.
Essentially,
normally
I'm
gonna
spike
travels
down
an
axon.
It
is
always
opening
and
closing
of
ion
gates.
That's
how
it
travels
right.
You
have
there's
a
local
voltage
change.
This
opens
a
gate
which
allows
into
ions
which
then
moves
that
to
death.
D
D
When
you
put
a
fatty
insulator
around
it,
that
doesn't
happen
between
the
ends
of
the
insulator,
and
so
what
happens?
Is
you
close
gates
here?
It
changes
the
voltage,
but
the
voltage
transmitted
goes
to
the
voltage.
Change,
move
all
the
way
down
to
this
next
node
here
were
it
opened
the
closet,
a
arm
which
goes
down
to
the
next
node.
So
this
is
a
this.
D
F
D
D
D
D
G
D
C
E
D
D
It
won't
work,
I,
don't
know
enough
about
that
disease.
I,
don't
know
what
happens,
but
I
also
believe
that
she
and
some
of
those
neurodegenerative
diseases-
it's
not
just
a
myelin,
sheath
I,
think
an
accident
axon
itself
decays
and
anyway.
So
this
is
what
about
neuroscience.
If
you
have
these
insulators
along
here
and
it's
these
things
up,
and
it
turns
out
that
these
these
insulators
are.
D
D
Character,
you
can
see
how
these
things
are
called
the
nodes
of
ranvier.
That's
what
they're
called
so
I
just
has
been
on
for
a
long
time,
and
it's
also
been
known
for
a
long
time
that
the
more
insulation
you
have
it
actually
progresses
faster
and
part
of
that
is
because
a
little
code
because
with
less
leakage,
but
also
because
it
actually
the
more
insulation
having
squeezed
of
the
axon,
smaller
and
and
then
the
the
thing
is
her
jumps
burger.
D
So
so
you
can
speed
up
or
slow
down
by
increasing
the
amount
of
fat
and
reducing
the
amount
of
fat,
you
can
change
the
speed
of
propagation
along
here.
It's
also
been
known
for
a
long
time
that,
with
training,
if
you
really
practice
something
a
lot,
you
can
end
up
with
morning
slim
you
meant
to
insulator
I,
guess
you
know,
there's
some!
That
practice
is
something
a
lot
musician
or
something.
D
F
D
E
D
We've
always
done
wave
your
hand
at
that
said.
Well,
somehow
it
works
right.
It's
always
teases
like
plumbing,
as
opposed
to
a
new
representation
scheme.
It's
not
introducing,
and
it's
hard
to
tell
it's
not
I'm,
just
using
a
new
way
of
representing
data
in
the
brain.
It's
just
way
of
making
the
brain
work
correctly,
but
but
but.
D
That
you
have
two
sets
of
signals
converging
at
some
point,
and
they
arrive
at
different
time
sets
in
a
pattern
a
and
path
B
and
then
processes
separately.
Something
like
that.
They
also
didn't
explain
other
than
explain
how
this
could
occur
because
of
all
you're
doing
is
detecting
the
amount
of
spikes
going
by
here
on
a
per
second
that's
of
turmeric
root
system.
It
doesn't
really
tell
you
if
this
thing
got
to
the
end
of
or
the
right
time
or
not
all.
D
I
see
that's
Victor
I
make
this
and
fear
so
I
can
see
the
thinner
I
make
this
like
it's
more
difficult
faster,
so
that
would
be
like,
but
they
they
suggested
that
it's
an
active
process
both
ways,
but
that
they
didn't
explain
how
that
might
work.
They
didn't
explain
how
it
how
that
this
guy
would
how
these
exercise.
You
know
whether
the
signal
arrived
at
the
right
time
and
I
think
there
are
in
that
you
could
speed
it
up
or
slow
it
down
based
on
getting
into
the
end
result.
So
there's
some
potentiation.
D
D
G
No
apology
of
an
ester
site
actually
has
these
multiple.
You
know
rejections
off
of
it,
so
you
can
see
his
coordination
thing
and
I
remember:
experiment
where
they
took
human
ester
sites,
which
are
much
more
had
like
a
hundred
offshoots
injected
into
mice
and
they
actually
took
and
they
were
showing
how
it
seemed
to
improve
their
character.
D
D
The
question
is:
are
we
doing
just
something
like
hey?
Maybe
these,
because
in
using
a
lot,
if
you're
really
trying
to
tune
this
to
match
the
big
need
here
and
I,
don't
see
how
astrocytes
on
their
own
can
do
that,
because
the
distance
is
far
too
long
for
a
network
of
a
suffice
to
to
happen
to
do
that,
it.
D
E
F
D
C
D
F
D
C
D
D
D
F
D
A
D
G
F
D
E
H
C
D
D
Back
to
here,
it
has
to
get
repeated
somehow
on
the
other
than
a
very
slow
chemical
gradient
to
go
back,
but
it
would
be
slow,
very
slow,
botany
in
sight.
You
can't
put
up
if
you
put
a
a
voltage
spike
right
here.
A
square
waiting
fold
is
fine.
For
example,
you
know
it'll
dissipate
very
quickly
unless
you're
actively
reinstatement.
D
So
I
it's
the
right
idea
and
I
just
don't,
but
these
are
really
leaky
pipes
and
they're.
Really
they
don't
have
that
back
the
repeaters
to
get
anything
the
other
direction.
Unless
it's
a
small
chemical
grading,
we
should
be
very
slow,
it
could
be
very
slow.
It
could
be
the
kind
of
learning
that
takes.
You
know
weeks
to
get
good
right.