Numenta Numenta Research Meetings, 2 Mar 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: On Attention (NRM Mar 2, 2020)

Description

Marcus Lewis on attention. He reviews current papers on Transformers and relates them to HTM with Jeff.

Recurrent models of visual attention
https://arxiv.org/abs/1406.6247

Attention is all you need
https://arxiv.org/abs/1706.03762

A

Okay, well, it's gone to some extent, I'm winging this, because I haven't been like living in the world of attention. I, both I'm, mostly I've, been when I read about this I saw the connection that saw this was months ago. I saw probably here's how I would present this and imagine here's how I would sort of free this from the language of mathematics and neural networks that is written and tried to discuss it in the sense of what it says it.

A

What it might say about the cortex and and like is the word- attention doesn't even make sense to use that word for all of this I'm.

B

Not going to give you a conclusive answer,.

A

For that I'm going to talk around it and tell you how it could or couldn't, depending how you interpret it so.

C

A

To focus on two papers in particular one is to kind of give some history on where the word attention came under this course, a little bit where the word attention came into this I'm going to talk about this paper, recurrent models of visual attention from it's from a group of defined people, including Alan scripts, when use them.

A

Oh, this is 2014 and and I'm only going to talk about it and like the cartoon detail, but it is enough to kill, understand it, because it has a lot of connection to what we've done and then I'm going to move on to this other one attention is all you need, which is this more crazy-looking picture over here, and so the flow of this is going to be like talk a little bit about the first paper, I'm going to imagine an alternate model, imagine another ulcer in the bottle and then say and I'm, basically showing a smooth trend from one to the other, where the word an attention.

A

How is the meaning of attention is sort of involved and what is the year that this is 2015 yeah? So it's a couple years after that yeah yeah this.

D

A

And- and this underlies a lot of the most recent things generating pauses, for example, open AI as the G et tu language framework or language language model, where.

C

A

Like Auto, completing entire stories and writing essays that are almost co-parent and it's like Bruce, a bunch of brilliant presidents doesn't come out of this. So I guess like something that's cool about all this is the models are cool and they work they're they're clever and they work, which is nice and so.

B

These have instead.

A

Of blowing past.

B

All the language benchmarks definitely much over the last few.

A

Years, yeah and so I'm going to talk a little bit in terms of language, but also in terms of vision, because here.

B

A

To start with this recurrent models of visual attention, but.

E

Then I'm going to jump.

C

To these things,.

A

Where they applied it to mostly NLP and but you'll, be able to see how it also can be applied division so, starting with recurrent models of visual attention. Okay, one.

C

Way after summarize, this is.

A

I wish I had cited it in columns blocks. It's when I say columns plus I'm, referring to our paper with locations in the neocortex are one word we have to in depth model with grid cells and building maps, sensory national objects with moving sensors and everything. So.

C

F

A

Know about this paper when we wrote that, because what we call sensory motor inference, they call attention what will they work, what we refer to as and we think of satiety highs. They talk about this. This is quite similar to as a hobby high, except that it is doing.

D

Something to talk about- that's like that's like over detention is the term. We refer to the overt attention when yeah, like you Italy, but moving some sense into it, to focus on particular and.

B

I think that's would be products closer to Cobra detention.

A

D

A

Make it clear whether something's moved here they have, they have an input, image and they're choosing which part of the image to pass into the model, which is essentially sacani. The thing is one thing: they do do words a little bit different from psychologism I mean if it were truly to copy these boxes would all be the same size but they're kind of zooming in and zooming out there saying like first, you might get a broad with. The second is like zoom in on this certain part.

D

If you're changing the location, that would be.

B

Could also change change the location to.

D

Some exercise, but generally generally.

B

Well and I always I've.

D

Always thought of covert.

B

Attention as the type of sensory motor movement just like.

E

D

I think it's important to know them. I mean you describe really just the brain. Okay,.

A

D

This has a fair amount of.

E

A

What we've done it's.

F

Not the same thing they're not, we were explicitly.

A

Building like 3d maps of objects where they're they're training something to move the sensor over an object to classify it, but they're, not saralyn, building a true spatial map of any time. They're, not there.

F

They're, just letting.

A

You learn whatever it lurks so.

F

It's not the same model as ours, but it was solving a very.

A

Similar problem so.

G

A

Where okay, the word the word used using the word attention here, makes a lot of sense. It's not that crazy to compute some tension so.

C

I'm going to.

A

Evolve toward this thing that work starts seeing a little strange over here, you can see how to smooth transition. So now will this imagine just taking this and have multiple diamond and parallel multiple moving sensors just having a right. Let's even call these cortical columns if you want, but let me say this.

B

What is this? Oh I'm.

A

Sorry I wondered if that would look like a G that.

G

D

G

A

I actually tried drawing two different ways, so does that box? Is it a layer of cells that has recurrent conductivity? Okay? Eero is a pair of this attitude.

B

A

Isn't input image sensors being moved over the image the input?

A

This is a recurrent Network that is receiving sensory and but it has been using it to figure out how to move the sensor or how to change the attention and.

B

F

C

Subsequent input.

A

And repeat: now: I'm just going to take that exact thing and put them next to each other. If you have multiple independent moving sensors at this point, this is a little bit analogous to what we could call a cortical column. This could be a model of a cortical problem and independent moving sensors. You can imagine them processing that working like this with multiple of these yeah multiple of these in parallel I. Don't know that have much to say about this, because you probably already understand.

D

It just multiple of these in parallel. Well, as it seemed to point out there that if you talk about multiple visual columns, they kind of move together, but here you're showing them something which is fine, which would be maybe like one column for your finger or one comfrey or something like I mean something like that. I mean you have moving independently. So if.

B

Your fingers moving is moving. It's like we're.

D

Think they're all accomplished yeah, but.

D

Yes, yes, so, but.

A

Stay there very long because now I'm going to move to this next stage of suppose.

G

A

Now, rather than thinking of this as multiple moving sensors, think of it as an array of sensors that is sort of covering the whole image, no movement is occurring anymore, at least for this conversation, no movements occurring anymore and and instead, when this column wants to wants to get information about other parts of the image rather than to copy over to it. It uses horizontal conductivity, so magic.

A

I'm saying that I'm saying that cortical column is processing, I'm gonna just keep using the word.

D

A

That yeah and by the way this is always our language thing every year, so important and all of these papers, the cortical column, is processing input over time. It receives an input and then from that it sort of just science like what to attend to next or where to get information from nets, and it could do that using the large number of horizontal connections going in both directions. It can retrieve information from the other parts of the sensory array, or at least the nearby ones. Is this.

D

Something they described in this paper. This is your your.

A

D

A

C

Then I'm gonna say the next thing.

A

Is this just so.

D

A

That's just like you can this picture right here. Remember the paper that I don't know any of us even know the authors. Were you presented it once it was the one with like the inputs coming in and the activation cascading.

D

C

D

And this was sort of flow of information, the different layers yeah. What comes in yeah.

A

And what has happened? What is that flow of information? All that we.

D

A

Involving voting and things, but there's we might, we might only don't happen that we might enter like that superficial layers, there's a lot of room for other things to be going on there and sounds like your pacifism yeah, and so this idea.

C

A

Cortical columns in parallel kind of computing they're, taking their input and they're like alright I need.

G

A

Something about what's over there in order to disambiguate what I'm looking at and they get that through.

H

A

Sort of through these horizontal connections.

B

Any different way so the same thing: let's see if I understand it so there's, let's say: there's three cortical columns: each one is processing one subset of the image there and they, so they get all of that information, and each of these cortical columns here are connected to each other laterally. And if a cortical column now says oh I want to get information from. You know, image section number three I can get my inputs from the other cortical columns I'm only getting it from the third one and now I'm going to do another cycle.

B

D

Transfer somehow beyond a point column in the next, so.

B

The difference between this, and that is that you're not you're not directly getting sensory input from somewhere else. You have any process sensory input from these other. If.

D

I would just like trying to imagine where you're thinking this is going, and my thing like, oh that's, Robbie was gonna, involve.

A

The possibilities, but but in this picture it might actually be the horizontal connection target.

D

I I have trouble imagining there mechanism that could possibly do that, but I'll let it pass. They might be. Yeah.

B

It is literally cortical columns like that yeah. It would be yeah.

D

I mean just kind of information you can pass between this concert. How much is passed? We already know what some of its doing so, but if I can just accept the fact that this may actually be done via the thalamus good.

D

A

I can now transition to talking about the the actual model in this paper transformer. This.

C

A

What I'm depicting.

C

A

The whole of transformer it's the new trick. They introduced chainsmokers.

D

Eternity yeah, it's that, like a well-known term, that yeah.

A

Candy canes it exploded on the scene that has, it was.

E

Introduced with this paper, 2010.

A

But it's a true, and so yes, that this is and I'll mentioned, that, yes, this is not the whole of this model. This is.

F

The new neural trick.

A

That they introduced and they incorporate it into a language Wow, so everything I just said before you can imagine that a recurrent neural network, like all of these all of these you know the the little G's the little arrows inside your and these sideways arrows. You can take this and you can convert it into a feed-forward network for a specific number of time steps.

A

G

A

Simulating this is like this with multiple unraveling.

D

This circle, yeah, okay,.

A

Yeah and and I, and that is essentially but.

D

There's a problem there because structurally as a problem, because you're Kentucky here on the bottom picture there that was going on.

E

D

Circle is that you're then changing the input to that common change in the hip column by moving.

D

The point here is my guess: I'm saying is well at least up in here. There was a cycle between what this gets from here and moving with is from here moving here. This guy is not maybe directing close on this. It's.

E

D

It's like this is a series of steps in time nobody's dealing with this now you're stretching this out with these these moments at Thomas. You don't show these talking back down to here. Well, right.

A

But this even here that was the case and then I've got down there is here, I do oh and you're assuming a static image here, yes, yeah.

E

Okay, all right Maury abandoned the whole woman idea, completely. There's no interaction. Now it's just okay abandon.

A

Besides I'm not talking about it right well,.

D

You mean if I want there was this movement ideas.

E

A

It's converted it more into movement, has now been converted more into like Robin, okay, okay, yeah, just for purposes of understanding. This dude and.

A

G

The other old network.

A

Okay, next thing, I want to say is: if you just don't know,.

D

All the circles of time yeah, we have three time steps. Yes, now.

A

We just know to be clear: I said: we've just unrolled this, but I am using different learned weights of each of these stages. It's not like I'm sharing the weights from here to here. It is truly converting into a feet were before Network, where all of these connections have different ways. If I were literally unrolling, this you'd use the same way to every stage and.

A

But otherwise, as bad as that is what this is, is this network that is in every level of it, is receiving it, but deciding where to attend to next, essentially like it's, this one decides that okay I need to know. What's here this one besides that I need to know. What's here, it I think it's useful to give a concrete example and and I thought there wasn't gonna mention language at all, but it actually was actually provides. Some really nice toy examples that are useful for thinking about this. So here.

F

A

Sentence from the blog post about this, the.

C

A

Cross the street, because it was too tiger a tough.

F

A

This network does really well, as you can imagine, feeding this this end to one of these networks simultaneously,.

D

Yeah, that's what that's what they do.

F

A

That's how it works.

D

That's how these models were a.

B

D

A

I know that I almost didn't bring up my own behind of this. My goal here is connected yeah, they think of this almost like division property. Yes, we don't have language of parsing parts of our brain that anyway, literally.

D

We really would move through this time. Yeah we'd processes we've stopped reading. It I've read a piece of it.

D

A

Knew it was risky for me to bring this out of prison because fact, but but it does provide a useful example for yeah, and it's.

D

A really difficult problem.

A

So I was already a lot of work to draw this many boxes.

A

D

C

A

Going up through multiple of these stages of processing, once.

F

You get up to about like.

A

The fifth level in these actual networks, the.

F

A

Of it, which is a lot many of these optional paths, Edward it's successfully once the model has been trained successfully. The word it will refer to the correct value. It'll refer to animal, so so the attention has been routed, cleverly in some way after five stages of processing, so caught naked, where the representation for the word. It is some combination of a couple things that mainly includes the animal, and this.

B

Is a really hard problem in NLP to do these so.

D

It also could refer to the street yeah.

B

D

That's the reason, two sentences.

D

What is the sort of still confused here, because when we parse the sentence like this, there are both ways you can think about static problem, and so you can see what does it refer to it? Well, that's one problem, but what is it doing.

D

C

How did you decide that you.

D

Decided that you one figure, we prefer two versus something else. No sounds. Oh.

A

Oh, that's the thing it's doing all of this in parallel it's. This is a very parallel model where, where every stage of this is as figuring out better and better ways to represent him better, better ways to represent, didn't every.

D

B

A

Network here is just not yeah I'm, really getting at the corner track. If I were tell if I were telling the full story, what they're doing here is they're there they're doing translation. So what they're gonna do here is they're kind of take. This English sentence move it into some intermediate thing: that's not really in any language, that's called the encoder, then they have a decoder that puts it back into another language like French, or something like that.

A

So a lot of the answer is going to be whatever the network is to learn it's sort of learning some intermediate language so.

D

Is that when they design this was not the sort of end goal.

D

Was that the task they assigned themselves? That was the test for the transient.

B

That's the end goal that I think that would be a task, but the end goal might be to get these intermediate representations that they.

D

Know how do they know what's working I guess the question is translations? Is that they've.

B

Done it on tons of different benchmarks, including translation,.

E

D

Guess I'm trying to get out here one of the key things that you get with reference frames.

D

F

Their knowledge here or is this.

D

A very, very clever.

D

B

May have more exposure to some of the, because.

F

I think the main applications language model predicted an export in the sentence.

B

F

That's what they have like into petite you, let that famous language bother. So you put you put some wart and then you're just gonna complete and you have like a whole thanks.

F

A

That's how they.

D

Obviously, you can't always predict the next word. The sentence, and, but you can do, is statistically better than yeah.

B

D

Other ways I'm clearly the animal didn't cross the street. You can't predict feet. You could have been actually there also. So.

E

C

E

G

So hooked us up just kind of curious once you get it into this intermediate language is in a form that they can then ask questions about it.

G

Because, because, basically with this network seems to do is kind of blend in a uniform series of steps, both syntax and semantics and I was this question whether it got to the level of semantic understanding that understood what a noun was as a class, for instance like an animal. You know whether they well it's kind of like, if you think of how maybe children learn language by whatever the pattern recognition process that are doing they.

G

Don't you know explicitly study, syntax, explicitly, study semantics, you know, but something like this going on could occur and that you know there's some useful results that take that go from there. So I'm, just I just basically said that you hadn't explored that but I'm just kind of curious. Whether was that in any of the goals of with these Transformer papers are trying to do.

D

Or can't be learned, I.

D

A

A

A

D

And that's where the structure will come from? Yes,.

B

I agree with you, but let me play devil's advocate you be there. This is extremely brute force and huge amounts of compute power. Data I, don't think throne heaven these are. You know these billions of parameter models, and so they would train it on billions of documents from the web. So what will happen is many of the questions you're talking about it could actually answer, because someone on the web has already answered.

D

B

Different sure I would say that's totally different from true understanding. Of course it is, but it's really hard to tell sometimes anything you can think of.

D

B

Wall sharp that it might be better than many human.

D

Yeah, that's kind of it again.

B

I, don't agree with the proposed approach, but the.

D

More resources to the.

C

D

I needed that we pursued here, that's been my life goal is to say you know you hate machines, I kind of look kind.

C

D

Right little smart to buy a very large part, Benji understand how the rainbow that you're not going to get right so I, don't think you can build a truly intelligent machine. Then sensor doesn't have representation knowledge, but it has this total probabilistic. You know building examples, but.

H

I, don't think that's their goal.

B

Example, that is.

D

Many average a vast majority of ARB search. You do not believe what I just said: yeah.

H

I, don't think they're also trying to like replicate human style in television.

D

H

That just like get some money.

D

H

Then they should get like linguist and cognitive scientist on their teams. Maybe they do all.

B

B

They listed just going to get into way because they have their preconceived notions of stuff and just.

G

Throw it up data.

D

D

Can get these great?

D

You need to carefully that's been a 40-year conservation life or maybe exchanging that way. Oh you're.

H

Probably right yeah.

D

H

That's why I look I I'm extremely.

E

H

When anyone tries to like parallel anything in like yeah, this is a very useful tool.

D

C

A

I, when I saw this I wasn't even trying to solve an LP I was.

F

Trying to come.

A

Up with how do I think I know he works, how the brain understands.

B

Language I was.

A

Like that's a cool neural trick, how can I use it in vision or how can I use it and.

F

The more sensory.

A

Processing that we do expect all primates and roam the city doing and, and so this idea of, of processing the input and through multiple stages, either recurrent or multiple like a feed-forward network. You add context to it. You take this and you figure out more about what what is going on around.

D

You training on lots and lots and lots of language here right so somehow, through some some very voluminous training methodology. It starts figuring out from many many examples.

H

Okay, just by like seeing when it's wrong, yeah.

A

D

But once it's learned once.

A

It's learned all of this is just network with a bunch of weight, and those weights know how to know how to solve the question that they know how they have learned and algorithm for saying, like multiple stages of processing, you see the word. Okay, look at a book work, a few words sooner.

A

What is that word through six through a multiples, multiple stages of like these fetches and looking around? It eventually adds this context. The word it but.

F

A

It involves many examples to Train those ways, but in the end, as you have a neural network, that is, that is, that is taking these inputs. Adding contents through multiple stages and that.

B

Seems like a really useful.

A

Thing you can use elsewhere, like vision, processing, the idea.

C

That part of the image just going to cause you.

A

To need to retrieve extra contents, it's going to cause you to have to figure out.

C

Something else about another.

F

A

Image to interpret what you're, seeing and so I'll bring up something we're familiar with, because we have a we have our own explanations, for it is order ownership. That is a very similar problem. Where you see part of an image and order and ownership and half the room will know what it is, that there.

F

Are neurons and v1.

A

That that will only respond like say this is the input to a neuron and and be wondered. This is an this. Is the inputs, the receptive field of the neuron and v1? It will respond to this, but it will not respond to to the exact same to the exact same thing, but if, if the actual, if the actual figure here is like, if.

D

There's a way to respond to the same green box down here in the green box. It's exactly.

A

That way, also.

C

D

A

It's where which side the figure and the.

D

Ground that was the first paper. I thought the papers went beyond that. Wasn't just wish side. There was much more specific than side. If I recall it's only seven.

A

D

A first paper was that and then I thought we sold those other papers where they were more specific. It wasn't just anywhere on the side it was. It was actually specific points on that object.

D

And I think that somebody described it in our paper I'd.

A

Agree but I've always I've always come back and said that they've never actually distinguished us. They never wrote this experiment at least not published, to show.

D

Whether this will respond here.

D

A

B

Did what you said and what I said.

A

And will also respond right here and yours. It won't. Oh.

B

I see what you.

D

B

By the object, yeah.

D

So I'm surprised, because this was a key point of those papers that wasn't in the first paper, but it wasn't the second papers that I walked away from and it looks like mine because there's a very important difference between just which side and where you are on the object and so I'd be surprised if I got that wrong. But everyone I want to know it, because it's our model saying that should be more more specific than what you're suggesting so.

H

You're saying that like by either lateral or top-down or some other connections that you basically get an indirect sort of pattern completion, it's.

D

Like the P comma D, one knows where it is on the object right.

B

Rule of any feedback connections, for example, but.

H

This can't work.

B

H

B

Try to make an argument for that, based on the timing of the signals. Oh it's.

F

C

B

Responsive here so fast that and well.

C

F

D

Is ruled on even.

B

The lateral I think they had an argument based on.

D

C

It is on the object, it's.

D

Not it's not just getting some input, it knows the object. Therefore, it can represent that input totally. It doesn't require horizontal connections other than boring, and it doesn't require top-down connections and that's.

E

How it can be so fast and.

H

Calm, it's all critics.

D

What commas we're seeing is gonna predict what it's gonna see.

H

Well, how could it do that, if it doesn't have any context for the recipe.

D

Imagine if you looking at something through a straw and you're a single common looking something I need you see, Justin! Oh that's an H, but you can still want to be part of it right.

H

That takes longer I think.

B

They do remember them.

D

Which is very, they propose another, let ourselves that.

D

To communicate so in.

C

Our model there's.

D

Two ways to come and do that one is like it was a single time without the move wrap to them. Those.

C

With the object is anything new.

D

This or do you know how much confidence folk do you have there as soon as they go together, then you know where you are yeah so, but it's important to students, because I think Marcus is right here, that it's not as specific as I think it is, and then that are in this weaker well.

B

There are there, experiments may not be able to.

D

Maybe that's it I'm, believing this very carefully I talk.

B

D

D

So I remember: oh I can't remember all the details, but this is the key you're trying to get out I.

D

B

The one is is able to know where it is so.

D

There was a set of papers on this I remember. The first paper was very soon. As you said, Marcus was just sort of like fine laughing right.

E

Yeah I thought.

D

Something from papers were more detailed than that, but I again so.

H

These are by a de bunda: hi yeah, not the first one, just.

B

To make sure it's clearly.

A

Conversation is kind of orthogonal in.

A

C

Application how this could how this could potentially solve this, would.

A

Work in both of these cases, but let's see I, will go here. I will say one Marta on that. It's right there in the name order. Ownership sells like.

B

A

Was referring to which side is owned by the see which.

D

Is board doesn't say, there's only two left in my port, which which.

C

Side of the edge owns.

A

This border, which.

D

Side of the educational paper, I believe you're right and then the term border. You can't take the name portion from that, because it could be what I'm saying one where they show it was like on the legs. It shows like the rear leg versus the forward leg banana or something like that. It was like.

A

Yeah this corner mechanism, this corner all trick of taking it but grabbing some more contacts from the surrounding strategies, more contacts from the surroundings. Can it's pretty obvious that you can imagine it solving this problem? And- and yes.

C

You could do.

A

It in the speed forward way, but I hope I do want to keep bringing back like especially.

F

Especially I drew these little pluses.

A

Here these residual connections, which are also in residence, the ones we use that have the many many layers residual.

C

Connections are very strange.

A

From a biological standpoint, the.

C

A

You would take these and add the numbers like at the numbers from here from the previous layer to to.

D

The next slide.

D

A

How all of these networks that have been successful recently do this? They.

D

Do this in the language here you're saying these.

B

Are saying that that's what they do in these transformers.

A

They're, more is a transformer, yes ones. Using this attention trick. No.

D

Take it for detention, a debate among so much. How is this different than the typical convolutional neural network that we're a company that does vision.

E

D

Seems to me this is very similar to what they.

B

Explained one of those connections are they.

E

A

These multiplicative.

B

Connections right yeah, so.

A

The processing this is doing what.

F

A

To attend to and basically.

F

A

Figure out what.

D

D

A

D

Just thinking, if you show me this picture, I really see the others. Maybe little circle things also, there I.

B

Don't see how this is.

D

Different than classic.

D

There's an analogy but I need like what's different. It's like the same thing.

A

Yeah the thing that's going on here ignore.

C

A

Levels: connections for that ignore that busses. For that the the key.

B

Here is what is.

A

Happening and happening in each of these interactions, and what is happening is each of these is essentially coming up with a query where it's deciding.

F

A

Of these to pay attention to and on the fly, it's like deciding that and then but.

D

Is how they're different then I have a whole bunch of fanning and input to a neuron? They have no eyes excitingly synapses to pay attention to.

E

D

Wants to be stronger, which once every week I think oh, this is the part. That's what.

G

What's dynamic about other than the weights, I guess it's the way you're describing sounds like a more complicated mechanism than simply just waiting the inputs and saying who's. Yeah.

H

But is it so differentiable like this process, yeah.

G

C

H

Well, it has to be said: yeah, it's the same, plus some bells and whistles. So.

A

E

Clinton each.

A

Of these actually has two pieces inside of it. The first part is essentially that's the attention part, and the second is a conventional as a conventional feed-forward layer, FF layer. So.

D

The attentional part is more of a dynamic. It's.

A

Doing something very except elements like, oh, maybe that's one way to say it's doing- is choosing what parts are out into this.

B

D

A

Are different ways to think about? It is there's a lot of flexibility.

F

In how you draw these boxes,.

A

And what you say is happening anywhere the way that might make sense to describe it as the output of this layer isn't just okay as multiple output populations. One of those populations is saying that, like here's, what.

F

I'm representing another.

A

One is saying like here's, what you should query in the next level of processing query.

D

Meaning here's what I should attend to.

A

Yeah: here's here's is like almost instructions telling you like. Alright.

D

There's a dynamic.

D

And- and it's not hardwired in the sense that how the possible ways I can attend to my input and I'm, just picking one of the hard water mechanisms or is it does that learn as well? What.

A

Has learned as a strategy the instruction that what.

D

It tells the next layer- usually okay, so that's learn, so that would be like what might hold up with elements. Let's say: yeah.

C

D

I would so today I hide this multiplexing Robert, oh my god. It's maybe I'm going to call it hardwired and and now I'm gonna learn which instructions to send to my multiplexer.

D

G

Is it saying if you see.

E

G

This is it kind of doing yeah.

D

G

Trying to say is that.

D

G

Selecting from multiple inputs coming in there right so I'm trying to figure out is that something that's just pushed forward to what you asked for the next network. Or is he deciding on that attention thing which are these things to pay attention to, which would be, in my mind, a little more powerful, rather than just pushing the question up so I'm trying to figure out what exactly that attention mechanism is doing. It's.

A

Doing something along the lines of, if you that's why, for this, but.

G

Telling the next layer to look for them yeah, but.

A

The thing is kind of arbitrary, where we draw these lines, if.

C

A

Telling the next layer you could have instead I could have told you a different story.

F

I'm, like this layer, looks at it and.

A

Decides what to look at next and then it looks at it. It's arbitrary I could have this with this block.

D

D

This is really you're just unfolding to time-weighted model number two and therefore this could be a dynamic process occurring in a single column is attending two different things and has its problem, and that fits exactly what we think right and the other hand you could say. Yes, maybe that exists, but here you're showing on folio x, babies, both but but I'm, not sure these diagrams consider both it's.

A

Like I'm intentionally leaving both of those out there there, but both might be happy.

D

Well looks like both are kept right, so any particular region. Again there was this v2, it's Adam. Indeed, you kids improve you, and, and so it is going through the.

D

C

D

For pretend, but it's also.

D

It's like this guy.

F

D

Guy is somehow getting is say, takes an input him to attending to some section here, but it's also get attending to something down here at the same time- and we don't understand that, but we've described is use this unfolding in time. That won't be one of those, so it seems to be a combination that would be consistently you're saying yeah yeah and one.

A

Thing I'd like to point out that I started as it's before within research. We got a little distracted by these pluses in the feed-forward today, but I'm going to bring back the bus is now I'm just making myself more room. So in general,.

C

Like here I'm going to say something that is more general, it doesn't only apply to attention.

A

To transformer and stuff, so these these many layer networks with ResNet the ones with residual connections, the ones where everything goes like this everything everything goes up and adds visual.

D

Connection its.

A

Do problem? Ok, one interpretation is hit global connections, no.

B

It's specific. That means adding them.

D

With residual connections, yes, so.

B

Really you take the other thing and you.

C

B

Skip level connection.

D

B

Idea is that it's easier to model the difference between the output and your input than to model the complex function. That's actually doing it, and so you know so.

E

They're modeling the differences rather than the.

A

B

D

Is their interpretation why it's called residual meaning for that yeah.

B

Because it's taking a subtraction, it's.

D

Over after, as opposed to.

B

D

B

Rather than the actual so.

A

These networks, I thought.

G

A

Were bizarre first time, I saw them because.

F

The idea that of taking this.

A

And adding it essentially like, if you think this is a layer of cells, because a bunch of output axons so almost like these are adding to those axons. It's this. It's bizarre, doesn't work with biology or another tape, though, is that like what? What is usually the case with these is all of these: have the exact same number of cells, this might be 1,000 cells. This is also compounding cells. This is also at 1,000.

A

F

A

Much more sane, if you say oh every time, I see one of those, it's actually a recurrent Network being unrolled every time. I see one of these like if I interpret this as this is actually you know this ultra low G like this is actually one of those being unrolled, and it's just the the differences differences. These.

F

A

Weights, and these than these, my angle here is that that artificial, neurons are oversimplified. Actual biological neurons have more capability of doing this, with have more capability of having multiple connections like unit.

C

A

Pay might connect to Selby in multiple ways on this dendrite in one way in this dendrite in another way. So this might actually be an artifact of the simple neurons were using, and the resonance were. Training are essentially to the extent that they're using principles of biology, they might actually be mimicking this, and that networks, and this.

F

Is what this really.

C

Plays into here this.

A

Network is using these residual connections and it may suddenly yeah, so we may be actually simulating this network or we being the communities who's running these in a sense where what we're doing this way, the brain would do this way to these. These overlap, yeah, I think.

D

I think the big big key difference here is it's part of the complexity in that box. The circuit a few thoughts is, you know this sort of place also, how do you think going on, and so you are not just getting this input you're getting these input in the context of some framework around this frame and therefore you it just down someone power to that thing. It's you know over time, and so to me you know if we were to do if we would do the biological model of attention we've talked about here.

D

You've talked about Marcus. We've all done this. Where this, when we were attending to different things in the world like you're, just a little bit attending the party visual scene or a different pressure with your fingers, you are building up this structured environment of reference frames of reference frames and in so we're not we're not just it's not. This is a 2d representation of the world or this month, a representation of time is really we're.

D

Building up this sort of structured like this is relative to this to this location, this mechanisms, and in some time, I really want to go through that carefully. We started talking about this. You had some interesting ideas about it about how what is required to build that. You know the.

E

G

About something that you said so when you talked about in neurobiology neurophysiology, that you can get an activation through multiple paths, so I'm just trying to think if the analogy of the recursion here is to get different weights on there, I mean the synapse is not going to change dynamically that fast. But if you're saying that we can get the recirculation by the act of a you know, a different set of cell bodies that then sent out a different set of it picked off of a different than during tree to feed in the same thing.

G

Is that what your analogy is to the different weights on the different courage layers is that you know you're you're, you're, basically cycling around through different parts of the network. Within that same complex.

A

F

F

A

D

And their diagonals! So if you, if you look at the population covered.

F

C

Individual I mean I mean most of all.

D

C

You only get this property.

D

D

Just wanted to just.

D

Is that we think just can't change very rapidly? That's not really true in a classic view, but especially know that the brain weren't seen very badly. You could look at something on this board right now and you could only looked at this board for a minute or two and you would walk away and you remember things about this board right away and those are not recurring activity in your brain.

D

That's not like cycling neurons going on, and so this very fast term memory, the only mechanism you can do that and we think it's a chit, the water going back or are things going everywhere.

D

D

So, what's going on, we think is going on. There's lots of these silent synapses we're just sitting there really kind of zero weight. Anything.

C

D

You can Bob the tunnel on pretty rapidly, and that seems to be that was.

D

When we think about wrapping memory, that's what I think and.

H

That's also the time scale in which you get potentiation or depression of existing actives enough yeah.

C

Distributed across enough to.

D

You know depression is that those are the minor changes right and what we required it. Our whole thing all of our models required that you're able to learn these major, no associations rapidly it's hard to do that by tweaking individual synapses. That already had some weight. It's much. It's like I want to. If I'm going to learn human population come over here's a population code, two separate sets of neurons, but they could be the same over time and I'm gonna say this powder here in this property are associated rapidly.

D

It's unlikely that I have, although, if I have already trained a bunch of synapses to see em.

D

G

You see admitted, but one would see mechanism.

D

Just imagine you I, don't remember, but you've got the synapse right.

B

D

When a potential arise neurotransmitter released, it doesn't have the appropriate effect on the other side. Okay,.

C

D

A standard heavier type of learning. What you want to do is you don't want to just a temporary influence you want to have it essentially change the structure of the postsynaptic synapse, so.

C

G

C

D

Does respond to that, so these there are synapses, just don't seem to have any effect that are there physically, the dirt transform it gets released, but on the postsynaptic side, nothing there.

G

D

So I don't remember the details, it is documented. People do know that.

D

G

I've got my notes: yeah.

A

So part of this is just like pretty summarize that I I wanted to bring in like historical, where attention came into this.

B

C

A

One question I have is so one one thing: there's nobody in this network is that in a sense, in a sense, every quote-unquote cortical column is doing attention independently of all the other ones. Yeah.

B

C

Here's the word attention I think it gets confusing.

B

These to use routing or gaining the.

A

Okay, they're there and they're impacting each other, but they have the ability that they, if a network wants to learn to have all these be totally independent, it will and so there's no there's nothing straining attention to be some who see my spotlight.

B

Well, I think I would put it another way.

H

B

Thought of as a global thing that maybe you can have a spotlight over detention or whatever, but.

F

They're using the.

B

Word attention was not really attention. It's like more.

F

Distributed rather than hating.

B

Which could happen in I think they're in the word, attention necessarily applies. Yeah.

D

We talked about thalamus Quranic version. Don't got a key to be sort of a centralized thing, because you can this large cortical area and if you rely on the selamat relay cells and to do this, then you've got this very small, centralized thing and that's the whole point of bringing them all down to one point. But we've talked about for scaling and for because it becomes one point: it enables the central attention and it kind of so I think your point is, you could have is allowed in a colony.

D

But if you talk about the thalamus part, it's got.

D

A

I mean that's how it works like go back to the crazy parallel example. What they're really doing is they're in parallel processing, every word of the sentence and and each.

D

A

Is choosing each stage of processing what it wants to attend to.

D

Each column, yes,.

A

And, as you see something similar, putting a sheet of columns over to the image, I breathe every time step for every level of the feed-forward, Network I think.

D

D

So they're sensing different parts of the object.

D

It's a decision, question and.

C

If I took the two.

D

To three columns representing two.

A

D

The chance that using.

A

The suit test sort of Arty's of this using.

B

A

Attention is confusing the conversation.

B

F

You describe this as like routing and.

A

How you do kind of wanting, if you do, want parts of the sentence like seeing this part.

C

A

The border ownership there's on the example parts of the image you what caused you to what to do to properly sense? What's here, you need to know which, over here yeah.

B

A

This problem over here for it to properly sense, would have sensing it needs to know something else and if you think of that, as a routing question, there's no probably it's just once you called it use. The word attention maybe does confusing. So this set.

D

Of possibilities like my hallway is interesting that they have.

B

Yeah, we might wanna have a meeting just.

D

Is it relying on sparsity via five, a dense set of activations coming in at ten.

E

B

They properly set up the input coming in inquiries and they kind of multiply to figure out which one's.

H

Just a convolution, it's.

B

Actually engaging yeah.

H

You know the convolution.

G

In the signal processing.

F

I can I try to explain I think might be hard on the other, but I think Jesse's, alright, so each one of those bots like the language example they have a key but very independent. So what you doing is for one of those punches like box, one you can query and then you multiply by the key of all the other ones, the key yeah you can.

F

You have like a query that I do format when it's a q1 for D and then I multiply by all the other keys for all the other words like animal didn't cross, a stream. That's.

E

F

Then it you take a soft max over that, then you have like probability, distribution that tells you how much you're gonna pay attention through the fill port, and then you multiply that probably by the vent, and then you sum all the sides, so they do representation for the word. Five takes into account a feed of the value of all the other words. That's dependent of that probably I, don't know if it's hard to explain. I am pointing today, but you can see.

E

G

Stood the discussion.

D

By what the previous conversations we had about attention in our honor and what's required to those a temporary model of the world by attending data, free components and with what are the different reference frames required? I think that would be a really nice supplement to different. So you attention with reference.

D

A

That was the journal club that I promised, but.

F

Why doesn't the right yeah that was my that was.

D