Numenta Live Streams, 23 Jul 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Investigation of Recurrent Sequence Memory

Description

Paper review of "Learning distant cause and effect using only local and immediate credit assignment" (https://arxiv.org/abs/1905.11589)

Discuss at https://discourse.numenta.org/t/paper-review-of-recurrent-sequence-memory/6357

A

B

A

Up there they're, like a small incubators, laughs research, not exactly sure how they're funded.

C

They've they're, probably some post-its one specific blog post. If you saw this one about like how to build an artificial general intelligence and they really, but that might sound a little crazy to you. But I really do try to lay out like what are the problems of the company to be solved really laid out like they. They refer to.

D

2016 or 2017, it's they've been talking I work here for what, if.

C

You remember feelings, he's a big fan, I.

D

Made a recent a predictive capsule networks as well as they're, obviously, on that same wavelength as we're doing here. Yes, we can sort of so this is paper. Just came out this year, it's a bit looking at the current sequence memory or in.

B

A

D

Yeah they were touch with us to to get feedback and see to explore potential collaborations as well. So that's something that we we decide. We take a look at the paper and see how I might interview before doing here and and think about that for the future. So what they were trying to do with recurrent sequence memory was design a bio, feasible learning algorithm that allows every state was learning so overcome some of the limitations of existing sickness learning and do so at much level.

D

Remember requirements, so the constraints, the rules that they set out for themselves, I put here so local credit assignment, so they define that by limiting back propagation across at most two layers. I think they justified this by talking about potential layer of active dendrites that are sort of like in neurons didn't of themselves.

D

So they allow backdrop across two layers, but not more than that and then don't allow any time, travel or semantic memory beyond the previous time step, which is common for some of these, like with Tim and RN ends where they just have direct access to previous historical states.

D

Understand the brain only has access to what's happening right now and everything else has to be stored, some sort of memory, so the model is highly inspired by HTM and the work here guys will see- and it's set up in general, so like a predictive, auto encoder, so general.

C

D

Encoder is one that takes an input, is like eminence digit to perform. Summary representation goes through a bottleneck and then produces output. That's a decoded prediction of the same inputs. It tries to predict itself, we have a loss as we compare the difference between the two, so we optimize the models to reduce the loss so that we're generating a high fidelity versions of the input through a smaller Latin space.

D

A predictive, auto encoder does the same thing, but instead of passing in instead of trying to predict the same exact input, you try to predict the next input, a T plus one, and so it's just doing a next sequence. Prediction.

B

What what problems I.

B

D

They were actually trying to take the work from HTM and bring into the deep learning worlds and get some and demonstrated on new problem types. So next year reference the work on the river grammars may be 2016. They don't actually reproduce that themselves, but they have a couple other demonstration tasks. They want to kind of show this working on. Okay,.

B

So that way, it's a little bit different than the goal that you said, which is I, mean I. Call this a goal of the.

B

Beautiful constrained.

B

So maybe they're doing this thing, I think we're trying to do right now is just take that biologically defined network and bring it into the machine very okay. So this is not that gold. My mind is not state that I know.

A

That what that goal is what you would tell a deep learning audience: okay, because they don't in the deep learning.

B

World they don't really have that.

A

Would tell us yeah you're.

D

Right, the extremely parallel to the mission.

B

D

And they reference like this paper.

B

I'm satisfied what was wrong about our model, and the answer is probably maybe they think there's nothing wrong. They just want to bring that into the machine learning where all that implied divisions.

D

So they illuminate a couple of issues that they see with this kind of architecture, which is inspired by HD models and yeah, so generally set up as a predictive, auto encoder. So we're trying to do the next item prediction and then so in terms of the architecture. What they have is columns and cells.

D

That's going to be very familiar, they call these groups, but these are guests, many problems and so that we have a proximal weight matrix or the proximal dendrites coming down, taking the input and though they're shared for all cells in the column, just like an HTM and then we also have, in this case a single active dendrite on each cell that is taking input from the previous memory state. So this is a memory, a kind of activity in the network at t minus one, so they every.

B

Trait that we have multiple that noise, that's what they're going jobs, one! That's right! There, a logic behind that or they.

D

Don't discuss that.

B

D

Discussed I think that, because member is one of their goals, they they tout the ability to do all of this with literally two matrixes of this size. One.

E

D

For the current memory state and want to spread this inhibition matrix and that's that's everything so like each game has a separate kind of predictive state of cells. They do everything in a single matrix of activity. So it's a little bit different in that respect and that event has some implications. That I think are important to discuss so.

B

D

All the predictive stages inherits the representation that is the same as the speed for representation, so.

B

So that's a key difference. I, don't see how that work.

B

D

B

So this is the steps that the network.

D

Goes through during training, so we're actually going to sum the fully connected. So we have fully full connections here. These aren't sparse. That's another point of difference, so we're gonna sum, but the input no connected to each other like.

B

This population zones, what kind of population itself it's.

D

The best route for the recurrence- this is the previous activities. This is recurrent and this is feed-forward. So this is like a new trend. L goes below so when you say.

B

Fully connected there's.

D

This is not a sparse matrix, what which of those are you talking about, and the proximal dendrite nothing as far as it's fully connected? Okay, so it had. It has potential connections to every pixel and you.

B

Should fully connect it from both and last into yes.

D

Exactly yeah everything it's looking at here, they do specify, but they do specification of I'm asking as after the parameters. So we do as some of these too and then we we have this inhibition matrix. You can think of that as interest just like this one. So for every so we have an inhibition state they. This is inspired by a refractory period of a cell. The cells don't fire for a certain period of time after firing less recently, so we keep track of when each cell fired most recently and we dictated that metrics over time.

D

So this is going to inhibit activity for sells the most recently fired. So it's similar.

B

D

We do for boosting yeah.

D

This exponential decay look actually quite long term that every stays around, so we never actually get all fit at the zero again. The inhibition that doesn't.

B

Seem? Okay, that's fine!.

B

D

Okay, so yes, we inhibit the result of this some and then we do that we do key winners, and so this is producing masks that are column. One isn't cell wise, as this is going to produce. It's gonna pick a single cell for each of these columns and then based on the activity of the the it's going to be a look at max pooling. It's we're gonna pick. Let's say: K equals to make two cells two columns, and so we have this cell and this cell to come out. So.

A

It's gonna pick a sparse set of mini columns, just like we do with a gel or okay and then there's each mini column. It's gonna pick one cell.

B

So they're the same basic representation scheme. It sounds like where you want to represent an input in.

D

Some context you need.

B

Some context it's.

D

Very similar representation scheme I think there's one major difference which maybe I'll get to in a second, because I think this is the thing to talk about the weight, but is everything make sense up to you're, just in a tie level ones doing that, and then we generate outputs. So we're going to the primary output is going to be the memory state. So this kind of activity is going to become the new memory for the next time step. So this going to be decayed as well, but there's an optional decay term.

D

They prove they show that this doesn't require. So you can just remember the previous memory state.

D

So for the EM missed demonstration, they actually don't decay at all. They just treat the memory as the previous time step. So then everything gets zeroed on every time step and then the new memory, that's the input. The recurrent input is just the activation from what.

B

Does the sequence in this yeah.

D

Sorry I should probably talk about the demonstrations they're gonna do earlier, but essentially it's the.

E

Design sequences.

D

So like a 0, 1, 2, 3, 0, 1, 2, 3, 0, 3 2 1 is an example of a high order sequence. So they're gonna pass in these digits. So every time they see a label. So these are the label. 0 1. Well,.

B

They're making sequence of that event as count exactly.

D

So they're gonna, randomly sample from M, missed all the zero digits and that's going to be that training example you're constantly seeing different digits as we're passing things in. We never direct the network, never directly observed the labels, it has to infer the label and then predict.

B

Correct next, it started well the image: if you don't label it, it doesn't.

B

A

There's they're the.

E

C

Vary quite a bit so they have.

B

A

Come up with all.

B

A

Zeros are similar and all the ones.

B

Well, are they are? They are in Holland because you know in a problem like this, you have to label them.

B

A

But zero will have to predict a one, regardless of what version of 0 plus yeah.

E

That's the constraint, is it's not a label.

B

A

It's you have to have a similar representation of zeros.

B

Well, it's assuming that there's something in the structure of the zeros that are similar there.

A

B

No, no, never you feel able to.

D

So all we know is that the representation.

D

B

A

B

A

To predict the zero to be one correctly, you have to remember the previous eight digits yeah.

B

C

Sorry I said this, so each of these zeros is at the same image or is it so every time they show zeros just randomly said? Yes, okay, this.

A

Is really here, it's quite different from how we do.

A

We don't in several memory when our topic Mary would not be able to do this.

B

Categorize, the zero block silly to the one.

B

A

Some extent, but it wouldn't be able to get their adversities right.

C

Here is the sequence: knowing the sequence helps you sort of label that movers begins yeah.

D

Exactly yeah so I think that that's one of the points that Wannamaker the representation includes, the prediction is veteran I'm doing label classification of all they.

E

D

A classification accuracy Wow by attaching a classification predictor on to the representation and.

D

A

Classification.

E

Of the next digit.

D

It's the next digit yeah from the hidden in the memories data, so we also generate.

B

Along the way,.

B

D

Like a traditional machine learning validation, so you hold out at that separate test set and then your run at inference and after you've trained do you run it right.

A

Maybe, while you're presenting the sequence, you you count the air at every single point in this new place, yeah.

D

Sabrina, you can also.

B

Get accuracy yes,.

B

Because, after the first zero, you really can't that's.

D

After its after training, it's.

B

No, no, no imagine I'm, assuming there's multiple higher sequences or order or not.

D

So this is like your custom, this network is they call this a high order sequence and just this for the entire model.

D

A

Don't do other digits it's only these four digits.

B

Well, there's a.

D

Part of us imagine like they only to do one sequence, editor yeah.

A

But when they're, given this accuracy number it's only these four digits that are being presented in this.

B

It's not it's not a this one sequence, because if I, let's say I trained it on 20 different 12 digit sequences and do multiple sequences that begin zero and therefore my prediction that the first digit would only I train on this one sequence. That's it then, then the first zero I can predict one very accurately.

B

So that's that to me why I asked where do the 97 97 percent come from, because if you were training on multiple sequences which I assumed it was the case, then you would have to either determine to make that measurement at the end or something far way into it or like you're, saying no they're. Just this is the one the training on this one 12 digit sequence. That's it yes and then, therefore you could measure the prediction.

B

A

I'm surprised, I didn't use the other six digits to create other unique sequences. So.

E

This is not it's.

D

A

A four category problem every time step just.

D

For this one sequence: zero zero one, two, three, four, zero, four, three, two one! So that's a five category, pro-isis.

B

To policy right, yeah, again point out: there's two problems: one is classifying the digit correctly, which is right, 104 and another is predicting. The next element, which is 4 complex, high works, even a problem. Yeah.

A

And when you're predicting it it's one of four possible numbers.

B

Yeah, it's not one of 10 yeah yeah. So that's.

A

That that guy, that will inflate the accuracy hours if.

B

B

E

Has Declassified.

A

The did you did something between these four, because the digits are coming from the test.

C

Side with our different.

A

B

D

Because this bus fare is trained on the hidden memory state, they can't just pass in a full image: okay, well, what they do do. Is they take this memory state? It train the funds from our once for the current image and once for the next image, and they show that the same exact memories to is able to capture the representation of memory, state encodes information for both the current item and the next item. Yes,.

A

The rippling should be true for our four temporal memory as well that it would be equivalent to saying the TN state captures the current input, as well as the next yeah.

B

D

It's all, but it's all done in a single cell tuition interest.

B

Is not just the.

E

B

The current state and the next day, yeah.

A

The tri-state would be the mini columns that are currently active on this.

B

A

B

Of cells that are.

E

B

A

You what just the next but.

B

They also would tell you whether.

B

It's away with the way I was speaking about, is you got this melody and the current state is both the current node and exactly or location of that melody.

C

B

Are you saying that the.

C

Current cells indicate with the necks all in the sequences is like. So if the next Solomon is a is whatever a five, certain seven cells are active.

C

B

In this case, they use a train it on distance, one sequence: there's a lot, there's no other options. Sure this again.

A

If there's multiple possible predictions that are valid.

A

And it won't happen in general, I.

B

Mean that's an interesting question because we had the handle, I case with the unions right, and so it's not clear that this would work with unions I'm, not sure who does it doesn't. But if you're going to handle that in majority, then not just some way representing multiple possibilities. They.

A

B

E

D

I believe that's right, I think that book, actually let me get at the end, so they didn't run a couple of our tasks. They have this reinforcement learning paradigm where they attach a DQ Learning Network. That it'll do a very simple kind of pathfinding task, which.

E

Is not super interested just essentially show that adding.

D

Our asset memory to the DQ it allows it to solve the problem worse, but I saw the problem and then the more interesting answers is a more useful benchmark. I think working on is this penn treebank data set in which you have a million words in order 10,000 vocabulary size. You have to do next. Word prediction, so you get sentence in every word, you're trying to predict the next word. That's going to come in this.

D

They have results for this that don't compare that well to kind of state-of-the-art MLS TM, but demonstrated does pretty well comparison to older, recurrent neural networks.

B

Would be good at you know, I'm, always amazed how for the prediction I get on my phone for the next word. You know, and it seems I if I really wanted to.

B

And so this is an interesting problem, so is there a benchmark for this yeah.

D

So one used is called penn treebank, it's just a big data. It's a corpus of data and.

F

D

Set and training set, and they can they just they report the numbers on how good are you're predicting the next board. In this sequence,.

B

D

And you feed it in one word.

B

Solution to this.

B

If we did this, for the are type of this big monsters,.

A

Deep learning is fantastic.

A

F

A

All about the computational efficiency really know, as a problem is kind of a solved problem. Surprising.

B

It never appealed to me as something to go after because I thought, like it's a marginal problem in terms of its value and.

F

D

That they compared to they don't compare, tells Tim. Those teams are getting like down to 50 for flux of years in the long path in the state of the art and they're, saying that this this network is sort of comparable to our intense round like five years ago, so good neighbor for 166. But.

F

D

Yeah I think they were trying to get away from the complexity of the bottle and the memory requirements Palestine.

F

So the timing question is that.

E

F

Innovation that they were units is that they can keep.

F

Bricks there, their.

D

F

D

F

C

There's no way just an activity to to just like remember now, sequence: remember what happened.

F

A

D

A

D

Didn't discuss that, but they didn't even replicate it. They just mentioned the.

C

Language model you say like: are they just for each word, activating a certain set of many columns mapping each word? Yes,.

D

So he gets it goes through an embedding and then it goes to the exact same network structure. The only difference is there is an embedding now said: it took a 300 n Shalhoub any so.

C

These words are, they: are they doing any kind of thing where they're, using like weird, if I convert converted.

D

And so I actually don't know, didn't mention that so I'm pretty sure no I'm, just sure they're training, you know static, random STRs.

C

Essentially, for each word,.

D

Or I think that they're, you I, don't know if they don't mention it, but.

E

My guess is they're.

D

They're training up and inventing from scratch: that's they're, building a felon language model or embedding model based on the corpus. So this is the theory interdimensional embedding matrix and it's trained based on I guess the traditional look at the ward next style of training from context. Okay, so you're gonna get these distributed connectors for each word and you're gonna pass that into the network. They're produce predictions from that, but that's a question that I had for them that I want to follow up.

D

It could be defusing some pre-training betting is following: they mention it, because I have not perfectly replicated results from the language modeling. Yet the M&S looks really good. The language modeling doesn't ok, so just quickly. We already discussed most of this in terms of comparisons with sham or the things that are doesn't gonna be different. So continuous.

B

We have you guys played in terms of its flow able to enhance those and activated and.

D

We're as far as for as this is dense, but its initial, even though there's a masking step on activations they're, using back prop they're limited across two layers, only or as like as HTM, then we could consider a heavy learning algorithm or what we call it. It's.

A

Heavy its styling.

D

And so everything is contained in the single activity vector which allows them to get to kind of tout this much lower memory requirement, and so all the predictions, annexations, sorry and so an HTM. We restore predictions and activations separately in common level activity and some level activity.

D

They have this refractory inhibition, which is parallel to what we do with boosting their architecture is explicitly generative I, think it's we've done some classification or generative Stephanies gym as well, I believe, but that's not explicitly built in the network, so they are constantly producing a predicted image for the next time step. Even though that's not when every representation there, the network is trained on this loss function is trying to produce images, and this.

B

Side was you to do because they're they're trying to predict a novel image every time, yeah I think that's probably the biggest difference.

A

The biggest advantage of this so.

D

The real big difference I see is this single activity vector so I believe my.

E

C

It so we're doing.

D

We're taking all in a different time, I think so. I believe that this architecture allows the which columns are activated to be active to be influenced by the recurrent employed in a way the HTM doesn't the correct me if I'm wrong. We are summing to activations and then we're picking winners from that. So we were really strong recurrent input. We can actually activate columns that have no connections to the input at time. Zero.

D

That's where I think it's really different, because I think that allows us to generate column level representations based only on predictions and actually essentially it unties from the current.

D

B

But we can't make predictions.

B

So they always have to be a sort of separate mechanism for doing predictions here, you're saying that they're they're combining them. So this system can make an active prediction right here.

B

The predictions could just say this is what Tom should be active. It.

D

Actually, activates those columns in activation.

B

Stomps, we won't argue, there's some advantages, but I also think the problem you lose a lot for that too I mean I, think there's a real advantage in separating out these two problems, because sometimes you so it seemed just different. That's something! We've! Never really! We didn't address in the temporal memory algorithm, but we took we think about a lot in terms of the column of architecture where those active predictions recommend- and here they're saying hey we're doing on one. No, it wasn't that so I'm just a place.

D

I think it's most useful is to do things like filling in included input. It's a noisy sensory space. We have really strong predictions. We just go with those predictions. Yeah.

B

Assume when we saw it yeah, contrary data, it's one thing.

B

D

Controversy, noise experiment descending, while we're talking.

B

D

In each Jim networker, you have an inclusion, you have no data. We can't activate these columns. Yeah.

B

D

Whatever frictions are happening, even a really.

B

Well that radiations this way we saw that really is more of the voting column or voting issue. So it's just! We don't try to solve that at the so they're allowing multiple well, we just have read out the multiple problems and trying.

B

My predictions I've.

A

Been curious to see that actually happens a lot here in reality when they do yeah.

D

A

Training, you know what.

D

B

A

But how often does it actually happen.

D

It's a good question: yeah correlations between the memories connections and the.

E

D

Connections, let's see, what's driving, that's a good question and.

A

The day they're still going to do the k winner take all of the many columns yes and what happens to the if there happens to be a cell that was active activated. Predictably from the past. That's.

D

Just been collected by one of these two columns like this is not a winning poem yeah.

A

Would that get zero it up, so you couldn't get. You could get a prediction from the recurrent active prediction, but you couldn't get this runaway thing because it would be actively masked or the next time.

D

I'm not sure because this prediction could be the really really strong predict prediction. You know this column actually activate it's even without any input. Oh, it.

A

Would be, it would win? Okay, because.

D

There's science together, so that's why I was saying that the predictive influence can actually produce calm, love like brushing through my days Jim can and what they're doing is they're actually decoding the record for column level activations to produce the prediction again. So the column of observations is what how many columns they have in this now, usually 200 to 600, 200, furnace and 600 flora, language.

D

Using training, they're betting, but it's initially it's just a vector of ID's, so Michael we're going high. So I'm, not my you can't pre-trained your met him we're just training it back, it's fresh! So it comes in yes morning, planner, so I'm using PI torches embedding module, which I think is the same yeah I. Don't they don't specify with their universally no need I? Suppose, there's a possibility there not doing any. Inventing at all, it seems unlikely right if they were just using one.

A

Another big thing I don't need to have listed there is that he can't handle and MeeGo sequences right. If you were to start in the middle of a sequence, you can contain that I've accounted for a while because evening that one cell per column, yeah.

D

You're in it, I always think here, just like architecture differences. Maybe right, there's perform services. Well, potentially that's yeah I think you know there that I would whether it can just infer that directly from the one cell would have to, but it.

A

Seems yeah you need to have multiple cell.

D

So one things we'll try: this is well allowing multiple cells to be active, yeah.

A

I think, though, an interesting kind of extension of this yep.

D

It's the major problem that they referenced in the paper is that it's very hard for this kind of architecture to generalize on the same sequences. So they do really well memorizing penn treebank versus this huge data set, so they they can remember. As these huge sequences of thinking for class city of nain next word: prediction actor to 50% on the training set, so that demonstrates media capacity of the memory, but that's overfitting with credit ii, and so once you give a test test, it doesn't have any idea what to do with it.

D

So they train for to get their best test numbers they train for Courtney box and once we 25% data set and then they can get to 166 perplexity crystal market in comparison to with the exponential of and negative level it did it.

D

It seems you what everyone uses for reporting on higher.

C

In sparse used, you said you said that it. So if you just start in the middle of a sequence, this can't pick it up yeah and if we go back to the a missed example with this, like all this.

A

C

Has to do is play back this sequence at to play back 0 1 2, 3, 0, 1, 2 3. It.

E

Can just ignore its.

C

Inputs- and it will pass this test if it, how do we know that this isn't just a network that that you just flash it with anything, and it starts playing 0 1, 2, 3, I,.

B

C

Unless you can you.

B

E

To show like, with.

A

C

Sequences and I'll put something different, it'd.

A

Be nice if it starts in the middle of the.

C

Sequence: it's able to to figure out where it is in the sequence. Then it then it then it be doing what we think. It's doing, that I.

A

Think that's a great question. We.

C

I don't know I'm.

A

Get like really I would give them probably.

C

The benefit of the doubt this they probably notice if this were what was going on, but that.

D

Just follow the batch would start exactly at 0, so sometimes you're getting thrown in the middle of sequence. Well, that is a random cuts. Ok, not gonna start here got.

A

It ok so yeah. It's.

D

Weird because if they start in the.

E

A

They would never be able correctly, let's.

D

Say you started the second.

A

Zero one two three there's no way you'd predict the last four. It would be actually incorrect. Yeah.

E

D

E

D

Think needle remember tell how we do the batching. Maybe it's true that it's all just fully sequential and gymnastic, but I mean your answer.

C

Satisfies my question that the issue is the potential: if you guys racist assault that they, if they start at random points through the sequence but.

D

A

I think you really need to have multiple sequences with you know, hi or confusion in between be able to do start in the middle yeah wonderfully trying to and I think mark at. This point is great with this one: just training on one: you could feed it almost anything and it could just replay the one, the single sequence it knows and completely ignore. They know they sort of ties to your previous point like the recorded but can drive the columnar activity.

A

So it could just completely ignore the proximal and I still get 100% of this attack step. Yeah.

D

It's true yeah, it's uh not very naturalistic training set. Is it, but maybe that's just a debugging.

A

Thing the language modeling is really the that's.

D

Right, yeah and that has very different statistics that doesn't have that room.

D

Yet so the thing they're trying to solve now or that they mentioned they want to figure out how to deal with is just exposure to novel words and novel sequences, which this does vary poorly because the overfitting problem, and so today we're talking to us on this call about it and intentional structures, and things like that. That might allow looking farther back in time.

D

But it seems like generalization, is the thing that they're struggling with they've learned, really long sequences from this huge data set, and so they can predict next for the trainings up. They've already seen that many times before.

E

D

Give it a test set of words they'd, never seen or see, was it they never seen before that it does very poor job of doing prediction. So it's.

B

E

A model that memorize.

B

The training center sequences.

D

F

Know painting sets because I did it. There's a lot of repetition between people have been saying that.

F

C

We're just like a thank you so.

F

Many documents learn your language bother. You're, not thank you like seeing the data but they're loving, my kinda lies engine saying you're just like.

F

Eccentric, like.

D

That with your Legos, their problem, so now you know whether country wow, that been three.

F

Spike might have because.

E

You know language solve is just yeah, so.

F

If they just get like a hundred different the backs books and just make them a lot of them, I'm gonna say the same things. People often face when the right.

A

You know these sentences that were repeated.

F

Yeah they're saying.

B

B

B

D

Produces vectors that are more similar.

B

D

E

D

A three under conventional vector that is just similar to other vectors for words that.

B

Appear that they want a photography, yeah, so there's some semantics.

D

And pre for the managing fast texter were debrief early just using an embedding but they're training up from scratch on the corpus.

F

D

F

A

To kind of list out the strengths and weaknesses of this you know I, think one strength is kind of like we talked about it's generalized.

C

A

P, unlike the temporal memory, you could create these representations that have similarity somehow if we give them the benefit of the doubt of the onion as sequences. It's it's generalizing.

B

To you know, to some extent, in you, inputs.

A

B

Not a station because you're always training on novel yeah.

A

It's figuring out representations that are somehow and trans and this digit level classification, if we give them the benefit of the doubt. So that's one the thing I think the big negative is the lack of unions that you just can't do multiple predictions and and.

E

That might explain their their.

A

Language model initiative, because I think you might need that there, it.

D

Must have tried through cells for Colin, but they don't report it. So I'm not sure you.

A

Don't want to just do two cells all the time. You only want to do two cells when there's when it's, because if it's fun that may be because you know what you really want- one cell per column, so it wasn't work to hard-code two cells per column and.

D

I forget out the skin. Does this what.

A

We don't do k winner-take-all, okay, just if there's a prediction and you get activity, then you become active. So there's no calm. The competition is all the predicted ones that are equal. We.

D

Just take away the k, winners and then probably fixing things the.

A

Other one one winner, yeah.

D

Yeah I mean the mending they talk about. It is just a much lower memory apartments to talk about order, magnitude reductions compared to that was TMS. This is a very tiny network.

B

Simple way of describing that really simple I think.

D

There's just way more parameters, because you have that you forgetting and given those terms better than me, but this like I forgetting, is there more cells, more synapses, more.

B

D

Think both but more so no because the parameters are like synapses and so there's more waits to tune by order magnitude. Apparently.

F

I, just the idea that this is an order of us.

B

Order management, less memory, I, don't really have a sense for the good of the memory. Carmen's for am sequence memory, I guess, I didn't have a sense for what else TM memory requirements were.

D

F

Like it's like the.

C

Number of parameters connected RNN, maybe at times three.

F

Hey you know thanks yeah I, don't know they say it's not like a large opening yeah.

D

I'm aware they got that I mean it could be that the foreign ins that they're comparing to were just much larger networks.

B

D

Cuz, this is small, I mean this was 200 by six.

B

A

It's really cool to see it I mean in the space of all the people earning stuff out there. This is way closer, the temple memory. Obviously you know, and it tries to utilize a lot of insights from.

D

Yeah, which is very cool. That seems like a really interesting, jumping off point for yeah. Well,.

B

I think also the thing is keeps jumping out of the because you got to look at corners. That was my representation of the that your training on novel all the time yeah.

F

B

Obviously, these are patterns that are from a pre-sorted corpus, dependent, so you're being another one or another two, but still the idea that that it's always somewhat novel coming in and we always had some issues ever in that regard, which is you know, the noise. You trip up the supplementary, so we had to rely on some cooler layer or some other artificial to to bridge across noise right.

B

We had some slight error in hippo: get lost in.

A

Theory that might happen it might not be just somehow.

B

That just keep something I would be like all the training on novel things. To is that fundamentally important to this network? Does that really help actually helps.

A

Make any incorporate if they can create a temporal memory, representation that encodes similarity somehow I think that would be really cool.

A

I'm, not sure this.

C

I like how this uses learned sequences to then essentially put labels and, what's being sensed, they can learn the MS digits by first learning sequence and then seeing seeing various versions of that sequence and not now that it has learned the sequence. It cannot label these zeros and ones, and such it's almost with us. That's like our time for temporal memory, training our spatial cooler or early.

B

You know yeah, that's like a necessary on signal yeah it could you want you to also cynical that I could say. Well, maybe these different patterns are training. It on are close enough. It's just sort of like the ones. You know our ability to handle certain amount of noise, yeah.

A

It's again, I'm not sure.

F

How to read these accuracy, numbers.

A

But if you just do a spatial cooler and just use that to classify and mist, you got about 95% accuracy. Well,.

B

Obviously, the sequence helps.

A

Well, the the yeah, the sequence, helps in the back back problem should help as well here, because.

E

C

Using information.

A

The predictive information to go back and adjust the proximal dissipation cooler weights.

B

Are yourself trying to make it misunderstood? You come in as but just try Steven. The total memory will such that you can read out his predictions and it could predict the next element and that prediction should, in theory, improve the classification.

C

Yeah, but so like the other spatial Buller, well, it's impervious to noise, but but and its initial training was figuring out. What is noise and what isn't this? This is an external signal.

A

E

You could have two forms.

A

Of the number seven that are totally different and there would be pressure here.

A

C

Just to like higher also as a sensory.

B

Motor that would be, if that's a good example, as I just said, like I, have treated for my private one right and and our special cooler in all situations might classify differently. So I would learn those two separate sequences, depending which way to the one you know you might it might force those two categories together.

C

This is higher, receive insist, but if you're using locations as your predictive signal, we're.

B

Also going to apply there yeah spell that further.

C

Maybe like an easy ones that would be would be I, don't know here. I'll just go circle, different coffee cups that are sent that are a little bit slightly different from each other or or like or under different lighting or the same coffee cup under different light you're. Getting these different sensory inputs for the same option from the same location, same viewing, location and you're learning.

C

Because you have.

B

Because they're only the only make sense and the constants of the entire object.

A

But say you're looking at a good lady and you're viewing it, you know.

F

So you get a sequence of.

A

Your good lighting and there's trying to predict the next thing now.

E

A

The coffee cup under bad lighting know you're getting a sequence, that's completely under bad lady, but there's nothing to tie the first sequence to the second sequence. Well,.

C

He's you know recognizing that this is the have some way of recognizing this. Is my coffee pump? You touch it, you see, you do something you you, you don't have.

F

C

Something very straightforward: okay, the lighting of the room changes you're already observing a coffee cup. Someone does the lighting and I already.

F

A

These that case, in that case it might work. But if you're looking at a coffee cup under one.

B

That's the situation where you want the input, the bid say: oh I, see those two pounds is being the same.

B

B

E

Me that's a very key thing that.

A

We're missing right now we're creating these sort of generalizable representations sequences. We don't.

B

A

The sequences or in our columns paper.

B

We wanted to jump to a totally different one always thought the beans. Do you think sequences those like formality? We don't have yeah, and so we never had that always bothered me I. Put that a requirement early on, but but I felt, like the solution became clear through displacements because visit, which were learner's to sequence, with displaced onto the team and not the answer to that problem. So we now all ending sequences of sensory input, but now the its elements probably learned as the secrets of this place, but we haven't really developed.

B

But to me that's the general solution: it's not the sequence memory itself. It's a secret memory not to be applied to displacements. So we we don't, do that what you think of applied during yeah.

E

Think of something I.

A

Think that might.

E

Be solved by displacement for there other situations which wouldn't.

A

Be solved by displacements, like suppose,.

C

Like someone gives.

A

Me, you know weird coffee cup, that you get. You know these tourist shops. You have got.

B

These weird-looking coffee cups, I've never seen this.

A

Before and I immediately know it's a coffee cup yeah, how would we there's a lot of differences.

A

B

Could argue that that's how even think of that as a sequence with think about it, as a temporal sequence for sequence, was sloppy sloppy, specific timing. So the order of the items and as you move around spatially, it's right, but the actual distances are yeah.

E

So it's like it's like taking.

B

That coffee cup and I've stretched it in different various ways, but I haven't reversed. The order I can take a face. I can stress the position of thing, I haven't reposition things, and so it's sort of like it's like a melody.

C

B

The distance between each note has been stretched, but the uniqueness of the melody is preserved in the particular order, or something like that.

E

I'm not making up answers as.

B

We talk, I, don't have answers, but I know there's a real problem for us that we have to address one day, but I think this both ways of doing it. I still think they go back to displacements and stretchiness of the.

D

Really I think for a couple of things. We talk about trying in terms of building on this model, but it seems like the first thing is just getting a decent.

B

A

Why the he's beyond just quickly trying out yeah.

D

It does well in the mist and.

D

So if we had a really nice bass one, then we can try some of these changes. We want to make some of the things we think might be limiting.

E

D

Diversion from something.

D

A

Maybe initiative go back to the and the sequences that just try to see how they can already shows. We already talked about it and be good sequences.

B

And all 10 categories, and so just.

A

A

You know honey and biggest I overseas yeah.

D

They mentioned using tens, but not not a high context, so they can learn servers and I'm.

B

Like oh I do is I write, a seven or two ways or a five or so you can literally just talk with you. You can say sixes and nines are the same and train on the system, one or two pitches right now. The question won't come up as that it was curious to me so okay, so the system should figure out that sixes and nines represent the same thing.

B

Let's say they represent six, then what is the general model generate when it tries to predict.

B

That we do well.

A

Knowing it might just do the most likely one also.

D

The generative stuff looks exactly like a blurry version of the multiple digits. It's.

B

Having winds, learning still.

D

As ambiguous, it's not sure there's going to be a foot.

B

After one earned, oh I.

A

See because half the time maybe trained on an eight and I have to try and maybe training I'm sorry, there's.

B

A

Time it became emissaries.

B

Nobody and now we always do an error to training error, because they're using that difference for clerical, a partner.

B

I know I'm saying, but it predicts its always predicting something which never looks like the thing just got. So it's.

C

E

B

It's going to get a nine, but it predicts some of it looks like an eight. What would you there's.

A

Something that on average gives it the lowest error.

F

E

F

What's trying to improve is like changing.

E

D

Hand Britta's never.

E

D

Mispredictions, it's all about making use of the prediction area that it's getting.

D

D

It's like one higher level abstraction we're generating exact replicas in the next century. But if.

B

It's an abstraction sure. Often we do not always if I'm trying to predict the next word in the sequence off and I can predict exactly. You know, I can't get the actual specific sounds.

B

D

B

D

Rivalry thing predicting they cross inhibit each other, because we do have a sub representation of a six like there's two kinds of sixes through the nine count of six in the scene out of six, and then we can make predictions for both. But maybe some sort of like something stochastic in the network produces a very nice, six or very nice now, each time, because we've never seen the eight before so.

E

Really, we've learned.

D

A sub hierarchy, I think, is the answer whether this network or not, there's.

A

No hierarchy really here the Canada and.

F

They mentioned.

D

This on the call they can narrow staff, character, yeah and stacking. It doesn't necessarily work. It was designed to be stacked where you take the memory, outputs and you feed them into the next layer, but when they saw in experiments, was that the higher layers just learned, the same sequences will over theirs yeah beneficial nestling. That I think we get eventually play with as well. I think this better ways to tune parameters for higher networks to maybe encourage and variance structures, maybe M less inhibition at higher levels.

F

The same sequence punished even stop.

F

Volunteers, its own.

E

D

Single transition from a zero to a one at this level, so there's another level on top. Maybe you've got the zero to one, maybe you're, predicting that, because to a 2 to a 3. That's still just with two layers of hierarchies. You only get like as three transitions. You can't get like really higher sequence and then stay active for a while and transition to the really entire sequence. Simply a hard problem.

B

B

B

On that, everything is by the way, I know I just don't know with that, with watch or intuition. With that stop improve this as well.

A

I think to solve the Union thinning to have sparse welfare.

B

So they're not.

A

Trying to I'm not trying to solve the Union by this.

B

A

That would help.

B

Whether it's in terms of accuracy or in terms of memory requirements in terms of speed eyeliners would lead to our varsity stuff work here to.

F

A

A regular, if you have.

F

A

Can stack them but you're not going to get.

B

My variation stuff to say these, these different things belong together. If you don't add some sort of temporal aggregation stuff, there's no information in the second. They.

E

B

By boat from one spatial.

B

A

Start you're taking a zip file in your recent.

B

General theme that time is necessary to build new knowledge right. The time adjacency is the key element in all these situations.

A

Really cool and that.

B

A

Me of some of this stuff, we saw ICML pretty good contrast at predictive coding stuff as well. Yeah.

A

B