Numenta Live Streams, 22 Jul 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Temporal Memory via RSM-like models - Numenta Research

Description

"Temporal Memory via Recurrent Sparse Memory-like models" - topic from Jeremy Gordon https://twitter.com/onejgordon

Discuss at https://discourse.numenta.org/t/temporal-memory-via-rsm-like-models/6345

A

B

C

Thanks for waiting.

B

Everybody we were just getting started on this topic.

D

Now he's got his microphone on, probably have it on is muted here, but that doesn't matter this on a personal. Yet they go back to.

B

Okay, I think we're.

C

Okay, so I'm just gonna go through an update on what we've done from memory which oversaw weeks ago, just for a reminder on what that our controls like so this is HTM inspired recurrent neural network, that's designed for receivers, running tests, and so the way it works is we have n groups and n cells per group just like HTM and each column. Sorry they call.

C

These groups follows suit, each column of cells to be thinking having shared wigs from the proximal input, so this is showing, and this digit coming in as input each cell has recurrent includes from all the previous the entire basement. We see so the red connections there are from the previous every state in the blue connection. So yes, in the sort of deep learning are, these are continuous ways.

C

The real office call themselves to was its local friend of assignment and no kind of travel expenses to get after better connections to previous time, steps go and get direct connections to previous memory state. So all the hidden state of the network has to be encoded in memory. There's a nominee or you'd be okay winners, so they take K columns and once all four columns to activate they code through a final layer than producing image predation.

C

What that never expects the next image to look like in sequence, and then they use la SIA, MSC loss function to optimize.

C

No, ok, so la I switch it to the next.

C

Reminder on what the market looks like the test, attacks that we're testing with one of us we're calling stochastic sequence eminence, and so the way this works is we define a grammar of sub sequences, so the benchmark grammar that we've been using and the authors that yeah I already did as well as this criminal in there. So each row is a sub sequence and the way this works is you move the sub sequences deterministically for every label or digit use random wait.

C

A messages from that is data set trend is, and this one's we get the end of any stuffs Depot's for any move or you're rounding. It selects another subsequence to start on.

C

The goal is a pixel level image prediction, but any grammar is generated in this way. So the network is every single tables, they're the same meters, and it has to both solving this by classifying amnesty just as well as predict the next item, the sequence or learning key so chooses.

C

The second task we looked at was this language modeling task with the penn treebank dataset, which is a million words from The Wall Street Journal reduced to a little pebble area, 10,000 words and the goal is to predict the next word, based on the entire sequence, up to the currents and they're, not using any inventiveness, because they're interested in seeing whether an artist I love architecture and develop meaningful itself. So this is an example of the output from what I needs rollouts of predictions. You can see that it's doing an okay job.

C

This is something that you might get out of like a a to grand model as well. So it's able to, for example, predict that after New York, Stock, Exchange composite or the next words, but also results. Sighs later so, I turned artisan model re-implementing. The results from eg I um I haven't currently replicated their Samoa yet another very cold stage. I look at for the details of what exactly the differences are. I did start to see some results that were interesting enough to adjust the model a little bit and try a few tweaks.

C

The first thing that I was looking at was using to win ourselves for coal instead of one and using our key learners and Colin boosted instead of the original RSM model uses, and one of the things that was interesting was that the column of representation, so the the image predictions are decoding from max pools, cobble estates or taking the max of sales in each column and that column of our association was what actually produces the decoded next English prediction. So what I was hoping to see was differentiation between different representations of the same input?

C

So it's trying right here, 0 1, 0 tubes is like we've just seen a 0 and we're about to see one or about a CSU. We need these to be representative differential. We need to be able to discriminate between these cases. We can make a correct prediction for the next time step and what these blue boxes on the diagonal are showing is that there is no column level differentiation, so the column representation of each of these 0 1 0 2, is essentially the same level different predictive state.

C

What that means is they bring ambiguous in a.

A

Dollars point: how many.

C

Comically, the cells are representing the.

C

But the way that the RSM is constructed is we're actually decoding through a little bit kind of the Mac schools, but occasionally that's a result of the you get easily image. It's not going to.

A

Work I mean so it's.

C

Not every city.

A

Works: okay, but it's more right! You can't.

C

There's so many.

C

A

C

Not spread yeah, and so the news that perhaps if this is the primary loss function events then maybe it's it's releasing some sort of floor and yeah, because we're.

A

C

To learn all the report games a.

A

Really really common subsequence and that's unique.

C

Yeah, so by definition, we have a lot of it UT because we assume a part of it yeah, that's where any any given day.

A

Yeah we wanted everything nicely separated for sure. Well, the way you separate up, at least what we do it in the same as memory Agency Chris memories that you have so you know essentially instead of neurons, are decoding simultaneously the Union. So really do you you with multiple angles based on multiple previous days and you can just put a classifier or some sort. This is certain subset of cells, which represents this.

A

Could be done so.

C

This is the, and this was experimental is why they did it. This way exactly why they're cutting through the maximal versions that still I'm not totally clear, I, think there was some argument for keeping up parameters low, but he could even get full memory. Stick I think makes more sense and when they're doing iterations, they are using the memory. It's just for the image production, which is the lost memory loss metric. They go down like all activations. So it's true that so one one solution is problem. Is we use HTML HTML?

C

We just go full memory. State I spent there that there's some limitation of that structure, which is that we have to learn a predictive state for every single feed-forward input. So we have to learn a different subset for 0 to 1, as we do four to one to do for a factor one. Every.

A

Increment agent.

C

Has to learn it its own, all of its predicted dynamics and there's something that you're. In addition about that, see that there could be some additional efficiencies from learning that certain cells are check are then of predicting to you right now.

A

If you think about like melodies, have to do this right, any kimono, you I mean the only few knobs in some sense you interval, instant and and every melody compose or some sequence of these beautiful, and therefore you have to do right with these representations for all these transitions. I think the trick. Is you don't want to remember, nobody's subsequence of all time, it's like things that would repeat a lot and I.

A

Understand what you're saying sure? But but there is a real advantage to doing.

C

Potentially a learning English in certain that you might be useful to have certain cells and share this predictive. As I said, we know that the next note is going to be a c-could. It be useful to represent the previous note, with some cells that are shared between okay.

A

Why you'd wanna be more efficient, but it's not clear. We wanna be more efficient role of it.

D

Wasn't another issue is, could have representation semantics to them, and so if there was some information about the nature of.

D

A

A

Which sort of captured more semantics, like in sequences of semantics,.

A

C

I mean so this the Arizona style is 3ds max old Riggins. So I was interested in trying to encourage representations that were more differentiated of the reading by the predictive state, but we should next think what they're predicting so I was looking into a flat version, as well as a flattened version of this is a seminar picture, so there's no way sure every cell taste unique way matrix from input and unique way, matrix matrix from the previous four current state. So.

A

You're no longer there, so we have all.

C

Of us in and they're all taking input, and potentially this seems like it could be a more flexible members and allowing more clarification. So we have some cells that are sharing this primitive state so hard to see there. They do have nine, where the blue subtor are saying that the current station is eight and there this might allow us to have the same solid reps in the next predictive state and therefore sharing between sequence items are tired.

C

Tonight I think it's still possible for this kind of structure to learn many homes by learning similar ways very groups themselves, so it seems like it's still possible to get that same and stay fit and way to many more class Colin's, not forced a relearning, a predicted state for every.

C

B

Guys yeah, we do me a quick favor and move the mic from behind your laptop into the front. Thank you and.

C

Then so we're taking this one step further if the goal is to be able to produce these sub populations of neurons, some of which are representing the influence of which I predict are representing predictive state I wanted to try just partitioning the model entirely and forcing these subpopulations to take on these functional roles, and so this is I'm calling this a flat partition model, so it's flat and that there are column levels aren't shared weights, but we have a subset of neurons that are only taking feed-forward input and a substrate that are just forced to take part right input.

C

So where does that would be learned in the flat model? This is enforcing the connectivity to be in this exact fashion, so the red layer. There is a flat layer that takes input from a feed-forward and the blue layer. Is this hidden recurrent state?

C

They are combined and they produce the in it's to the next recurrence date, whereas the red layer is just taking the the next digit and I've also experimented with a third potential partition that integrates both the feed four and the recurrent activations, which is exactly what the cells were doing in the flattens and are the kind of traditional arts that model so.

C

This performs the best on the stochastic immunised benchmark. So on the this Poggi nine data set that I was showing earlier. The eight sub sequences nine lengths each the theoretical accuracy limit is eighty seven point: two six percent, that's the best you can do if, if you're, making the exact right predictions at every time, step and the flat model got to seventy four percent. After about twenty thousand mini-batches and the flat partition model got say eighty-six percent after just two thousand, so it learns extremely quickly and it gets much closer to the circus.

C

That was the benchmark of the grammar that I showed in the lower left corner of the stochastic eminence it's this guy on the lower left. This is just the grammar that we used to benchmark. So it's the stochastic, a missed benchmark.

D

With nine sub sequences same.

C

Good sorry, oh yeah, their lab is AGI.

C

Yeah, it's a confusing the emission of name do that. But this is the grammar that we agreed to test our models on. So we could.

D

C

It's actually so the ones used on the paper were fixed sequences, and these are stochastic regenerated sequences, so they were just putting in 0 1, 2, 3 and 0 3 2 1 over and over again, without any for any uniform random selection. That's a much much easier task. So once you have this stochasticity of uniform selection of the next subsequence, it got much the original models her perform less worse.

C

So that's why I was doing this kind of architecture, exploration and all was try to improve that's performance, so so that did reasonably well. I was pretty happy with those results, though there is this limitation of the fact that the benchmark we were using was pretty much just second-order sequences. There's only a couple of digits that require have three time, steps back and predict the next to do so. It might not be quite a difficult enough benchmark. So that's not know it would like to experiment more with the language.

C

Modeling is a much more difficult task, so let me 20s typically reported in perplexity even.

D

C

D

Stochastic internist I think should point out we're not actually sending in symbols we're sending in M nest images. Yes, so in order to actually do this, it's essentially has to solve Emmis and there. So you know that's something that they're the traditional tempo memory could not do it's. Yes, actually it is generalizing from a training set to test set here, yeah and while doing sequences yep.

C

So that was an impressive result, even from the first yeah.

D

C

Think when we added this stochasticity, it became a harder problem because they've kind of required, psychic sequences, Alicia, and so this model deals with second-order sequences. Very well. So.

D

If you think about like a theoretical I I mean typically, the best accuracy is on M NIST around 99% and for one mer, mall is probably less than that. Yeah.

C

Yeah I think we get too many 9% after more training. You.

C

Look at the sequence and you figure out what what's the best thing. You could predict it every time step, so it's always impossible to predict the first item in the second because you don't know which one you've uniformly chosen, but you can predict, for example, in this case this particular grammar. You can always predict too and you're getting it three out of eight, whatever yeah three out of eight. The next digit is going to be similarly difficult to predict and then, after that it becomes deterministic. So you can always get the next items correct.

C

If you memorize these sequences, see.

D

Two markers, if you remember you and made that comment that the previously didn't even have to look at the input, it would just generate the sequence we are hoping. This is a harder task that doesn't have those kind of trivial.

C

You can't memorize, one gig, you know have to remember, has multiple things and figure out which one we're in yeah, but.

D

It's still only a.

C

D

Order, it's not a super trivial. It's super difficult, yeah sequence task and.

C

So language modeling is much harder and in the original RSM paper they they noted that this doesn't get anywhere close to state-of-the-art from things like Alice and so language modeling is often reported in perplexity. So lower perplexity is better where the the best possible score is you're, predicting with a hundred send confidence. The next word in the vocabulary and you're not predicting anything else, putting everything else in zero.

C

So it's like a point distribution, the percentage there is the next word prediction accuracy, so they got twenty point six in their original paper, the flat partition model gets to twenty three percent and maybe around a hundred fifty perplexity after far fewer mini-batches. So there's still something interesting and potentially useful happening there, but I have some speculations for what it's doing so yeah. Just as I said, this is sort of disappointing results overall for a language modelling, so I think there's either.

C

This is a fundamental imitation of this architecture or there's some things that we're not doing as well as we should be here and I have some thoughts on that in general. What the RSM and the are some related models to on language modeling? Is they over fit to the training, said very quickly, so really regularizing and trying to figure out how to care less to unseen sequences?

C

Is the hard part there's a few things that the arts, a moth, was suggested for doing this I think they're, probably more regular ization that we can look at the concern I have about this partition model? Is that it's possible for the recurrent layer to be essentially passed through for the previous state of feed-forward input, which would allow us to do classification with t minus 1 and an T equals 0 and for the stochastic M disk task? That's probably good enough, because these are just second-order sequences, so you have last digit current digit.

C

You can do a pretty good job of predicting next digit if we test with third and fourth order models, I think we'll be able to push it and see how well it does that's my explanation for why it learns incredibly quickly close to the theoretical limit. Is it's essentially getting direct access and that would that would be a problem because it wouldn't yell to generalize very well, because it's not really using the recurrent state of the network as a long-term memory.

C

It's just using as a pass-through, if that's true than in language modeling, this becomes essentially a diagram model where we're predicting the next word, based on the previous two words, and that seems to be I, haven't tested this explicitly but I think then we get similar perplexity. We just tested a bigram model. So my guess is that there's some adjustments. We need to make to force the recurrent inputs and just be it passed through and there's a couple of ways that I was thinking that doing that.

D

A

C

A vanilla stand on the stochastic, miss tasks, I'm sure that it would original work yeah. We also salt that so it's really for the language modeling that I think Ellis TM shows us.

C

The other thing which seems to be happening is we get these repeating predictions and I think this is common, especially early in training for language models.

C

Where you get like you know, word were the the and so I think that one of the things that might be happening here is that we have this hysteresis of memory, so we're keeping the previous memory state around we're doing an exponential decay, and what that means is that the cells that we're predicting at the next word are still going to be around in the following time: step and even two times steps three times steps into the future, and so what I believe might be happening is that even when there's a correct prediction, those cells that made the correct prediction for the current word are still active.

C

I think we need some way of deactivating those cells, so they don't continue to perform predictions of that same word. We can they sort of get around this with this inhibition, they prevent the same cells from being active if they were recently active.

C

That seems, like kind of a blunt tactic and I think that we're still going to see a problem of propagating errors through to these previously active predictor cells and kind of reducing their connectivity, even though they were, they were correct to be active and they helped make the correct prediction: we're gonna reduce their connection strength because for three or four time, steps into the future they're no longer predicting the right word.

C

So this seems like a fundamental problem that I think active dendrites are supposed to help us solve, because with active dendrites we can as soon as a predictive state is confirmed, we can deactivate those cells. That's my understanding, the active generated model and so I think there's an opportunity to add active dendrites to this model as well. I have thought of a couple with that. We can potentially do that by essentially decoding the actual image that comes in and then subtracting that from the memory state essentially saying these cells were successful at predicting this image.

C

So let's fully lift remove them entirely. I think there's there's a few different ways to do that.

C

Yeah, that's the potential active dendrites, I, think stacking in hierarchy or something that the RSM model is designed for. So essentially, if you have a second layer on top of a previous layer, it's doing it next memory to a prediction of the layer below, whereas the the final layer that layer at the bottom is doing next image prediction. So this was designed for this kind of hierarchy and I think that the original authors mentioned that they were finding that higher layers were just learning.

C

The exact same sequences I think that there's ways we can get around that as well, potentially by modifying the hyper parameters for higher layers, maybe adjusting the decay and inhibition rates to encourage more invariant representations. So I think that this is part of the solution to some of the the repetition and other issues that we're having and with and overfitting. But stacking is not very trivial, so there's there's some work that will need to go into that and then, of course, the embeddings that we're using are fully non-semantic.

C

This is a binary embedding there's no notion of similarity between similar words, and so these models was starting to do much better with with the decent. Embedding so I think it's worth trying the the reason they didn't use embeddings because they wanted to see whether or not representations could be learned. But this is a if you want to see how this compares the state-of-the-art we have to use embeddings.

C

Yes, I think that's all I had.

D

C

Generating express representations through a as far as masking step, okay, twinners step, alright.

D

C

Mean like an encoder yeah.

D

C

We could yeah, we can look at using a sparse embedding as well and.

D

We talked about yeah, potentially switching.

C

To that yeah yeah I've tried like love and word of it, but I haven't tried any sparse, um but.

D

C

And they're binary yeah.

D

C

This little isn't required binary, but it you can use by area yeah I'm these in the current.

D

Embedding binary the current embedding is not every.

A

C

Not one Hut, it's a strange non-semantic, embedding algorithm yeah.

D

Yeah so um I think I'll see you've got a much better feel for the different aspects of all of these partitioning it on partitioning. We can also bring it back a little bit closer to the temple memory model and active dendrite service, step towards it exactly the way that inhibition is happening and the way we're carrying high over state, which we're not even sure if this can actually do more than second order or.

C

About something yeah, no.

A

C

That could just be because of essentially a something that we're doing wrong, with the way that partitioning is working or the flat models are working. That's just it's like a local minimum that it's using the feed-forward state I think that if we can get beyond that local minimum by kind of enforcing Moore, River and input.

D

Yes, I think I might have to kind of summarize it in comparison to temporal I mean I. Think we've gained one thing which is being able to do these semantic predictions and semantic representations. We've lost a whole bunch of other things that are in that time. We.

C

Can still learn extremely long sequences, just like HTM Ken, you can overfit to the training set. You know, get 50% work. Next work reaction, accuracy, that's learned, hundreds of thousands of words, so it's potentially yeah so.

D

Some things are.

C

Still there yeah.

D

A

D

B

You guys I'm gonna I'm going to cut off the stream now, maybe.

A

Know what you do thanks.

B

For watching everybody.