Numenta Live Streams, 10 Feb 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Backprop-Trained Permanences (NRM Feb 10, 2020)

Description

Numenta Research Meeting, with Marcus Lewis presenting a writeup on Backprop-Trained Permanences. See it at https://github.com/mrcslws/nupic.research/blob/backprop-structure/projects/backprop_structure/documents/backprop-permanences/backprop-permanences.pdf

Discussion at https://discourse.numenta.org/t/backprop-trained-permanences-nrm-feb-10-2020/7166.

A

Yeah thumbs up looks like we're going. Buddy is working. This shouldn't be too long unless it becomes long I'm, because I'm doing something that I talked about before and these recording meetings, but it's just a little bit more results and more summarizing it. So I have this right up here called that goes into detail about all this. So therefore, I don't have to go deep into detail here. Just tell you the broad themes, so this is called achieving sparse connectivity.

A

So what I'm doing is I'm taking a conventional arrow network that has you know, input units, output, units, multiple layers and I'm, taking the weights and I am making the weight a computed value from from as before, the the network stores a bunch of ways for every connection, but it also stores a permanence for every connection and in any particular time, step for any particular input, at least during learning.

A

It will compute the weights by multiplying the way that it's stored by this gating variable and and to really to connect this directly to permanence, as the whole thing works. If you use this rule where, if permanence is less than 0.5, then the gate, the zeroes out the way if the permanence is greater than 25, it lets it through the weight. The connection exists now for this.

A

For most the results in this paper, I used an alternate stochastic version of it wherever the permanence acts as more of a probability, we're in the synapse works with probability equal to theta, but this can work to it. Just you have to you have to train it a little different. You have to throw in some other kind of dropout, but it also works to do that.

A

So this is like permanence is in the sense that it's there used in the same exact way like during inference, for example, but but they're trained a little differently, they're trained, rather than being based purely on like this fired before this one and is trained on whether the synapse was useful on whether the synapse helped successful classification and at the core. This is how this is, what I'm doing we're hearing something about them, so I've written about like how how to do this? How to train these?

A

It's a pretty there's a pretty straightforward way to do it. It just involves passing the gradient through it's a technique that has been used before first place. I saw it was a paper by NGO I'd, read it myself and then went and found where other people had figured it out before they.

B

Might be permanent, you know.

A

Then they also used it for for, for example, for binary weights where a weight either exists or doesn't exist. I have combined it with other ways so that I can get it use as a sparse affine tool. Okay, so that's what I've done here. I've also experimented with the binary waste thing and it works, but that just wasn't the task here so now, I'm going to show you 2.1 and then I'll talk about it in terms of the cartoon one I'm gonna scroll down, I talked about the model and everything so on.

A

The left here, Google speech commands just to talk about these charts. I have accurately thought it on the left, as you can see, I'm zoomed in way, unlike the 90, some percent, accuracies and number of weights, is on the bottom and that's on a log scale. So moving to the left is a pretty dramatic, sparsa fication, a pretty dramatic reduction in cost I'm plotting here. The blue thing is my me: are my results basically.

B

A

Yes, yeah. That's where I'm not doing anything to try to make it sparse at all like any. Although.

B

He's going over to orders yes yeah, you.

A

Know these these these orange points are the HS d stands for house, so dents and there's there's nothing unexpected here. I think everyone in the room expects that when we go from this fixed sparsity Network we did and how can we dissidents to a network?

A

That's allowed to change its sparsity where anything can connect to anything, you would expect better results, and so this is mostly just going to confirm our suspicion that if you, if you allow this, the dynamic, rewiring or anything's, allowed to connect to anything, then you can either achieve better results with the same sparsity or or the same results with better sparsity or somewhere in between and.

C

That's what we are hoping for, yes, rather than just using static plates. Yes, it's like in our spatial pool there. Also we trained our firm, Vince's and actually yeah.

A

So so I don't think this is surprising, but it's just nice to see that it's working and we have at least one way to do.

C

This and I was like the main result: yeah, which means.

A

C

You can get really sparse ones, yes,.

A

C

A

Much this chart is I mean that comes from.

B

Just having dynamic connectivity, oh yeah,.

C

That doesn't necessarily come. Does that necessarily.

B

Come from having using.

D

A

That's how I'm achieving.

C

So that's one way to achieve. There are other ways: okay, I guess. Maybe that's the important question ultimately.

B

Losing permits is one way we like that. We understand biology of it is that that is another way. That's.

A

Still an open question already: okay, that's I want to compare this. For example, something called a variational drop out and also they're, just old metal techniques of just weight pruning. If you just prune out all the ways how the sparks can you get other people in the real time? So really everything I'm going to talk about revolves around these charts or the cartoon version of them. So here yeah I'm, just showing the same thing where the dynamic version can do better than the static.

A

However, I should point out that I don't want to talk this up too much. This is dense too sparse, where the training is still a dense process. I am evaluating every every possible connection in this network I'm. So so, if something is far as far as if it's sparse level time.

B

There's a lot more room for efficient training that I'm just not doing here so there is this downside, design. I know you mean you could spend less time training. Yes, yes,.

A

So this is the yes, I went with the easy class. First, that's the smartest, and so we especially expect because.

E

I'm doing dense, dispersed.

A

Do we especially expect this line to be better, so you could imagine a sparse, too, sparse might be somewhere in between these and we'd be trying to get as far over as we can that's. At least what I would expect now. I do want to draw attention to the fact that that I don't know how I expected, if you, the fact that dense networks still do get better results, we it does involve giving up accuracy anytime.

A

We can get these order of magnitude, improvements and cost, but it does involve giving up giving up some accuracy, and so I wanted to jump over to this point of view that my point of view on all of this is that here.

C

A

Here's the same chart just a little more conceptual right now when I'm in, like forward most of us, have been working on this. This first part of reducing costs with little bit of a reduction in accuracy and at least in my mental model. Well, first of all, let's say like if this was all we were doing just this first part I'd be a little bit uninspired because.

D

I, don't know, is it difficult selling points like the old analyst would be happy.

A

Like yeah, you can dramatically reduce your cost and it kind of hurts your accuracy I.

E

Will okay I'll just say the second part.

A

That, like in my mental model, I'm also looking forward to this next step of, we can also now essentially run larger networks, larger, sparse networks. They can actually outdo the original one like seeing. This is the first step of a stage of then being able to then reinvest the savings, yeah yeah and do better like another way as I could have tried. You know we could.

E

A

Go denser and maybe get better results out here, but this might be prohibitively costly. But now we can actually get to those points because we're over there so and.

E

My point of view is like.

A

This isn't that this is all kind of the first stage of this process and maybe for a lot of applications. Yes, the first one is valuable, but is comforting to know that, like that robbing this perp is also going to be. You know, for so we talked about.

C

These as far as as far as somebody who actually tweets as far as course or connections, there are several techniques and I think Michael Andrew Lucas worked on some I, don't remember if you were because anybody there's like several different, because he was lens as far as you see. Okay,.

C

Criteria as far as as far as you have no information about what happens, it seems like you, have a huge space to explore with huge amounts right yeah, but you can still do it and you can guarantee you can explore the full space, we're still techniques to thing better than a static structure.

D

Question, sir, and you may be wisely said on the x-axis that costs right, then you know not all connections or, like you know, log scale weights or something because in your manuscript you make this point that you know like the way that it works with having a way to K that you know like pushes against the existence of weight and and having prominence that pushes it forward for the way to exist, and then it becomes a balance between that decay and I. Could reconfigure you're talking.

A

About so it's just so, you know I always.

D

Think you make the point that this obviously this this works to reduce the number of ways that exist, but not necessarily directly the cost and the sense of every weight that you have. If you wanted actually optimized all computer cost, rather than the weights the number of weights, which is an indirect way of optimizing computers. How would you do that? Well,.

A

Your your ring, so one of the points I make was a show, a table that I will show the table, but I'm only going to talk through a little bit. It's up here.

E

A

There this is a table that goes deep into detail about which which layers it achieve, which sparsity and so on, and one of the points I make in the discussion here is. That is that, if you look at number of weights so this top table, the vast majority of the weights are up in fully connected layers. But if you look at number of multiplies, the vast majority of them are in the convolutional layers, and so like the answer to your question, you're, asking of how do you optimize for multiplies?

A

Is you prioritize sparsity in the convolutional layers? Oh and.

D

A

Right now, I'm prioritizing all of them equally hot and and I'm optimizing for a number of weights number of connections, but I make the point in this that, when I switch to optimizing for multiplies, I need to use different.

A

It's one thing: I didn't mention before the way I'm sparse spying is of the permanence azar, always decaying over time. I just didn't mention that explicitly misleading these presidents are always decaying and I. What all we need to do is make it where the permanence is in different layers. Decay at different rates, I'll prioritize the convolutional letters, and that will really that will be the way to reduce number at all to place right.

D

I'm, just thinking so, if you happen to know right, you know in the combinational layer like a weight, will cost you that say 10 times you know the amount of multiplies to another layer, then.

A

Obviously, it does not automatically mean that layer should be 10 times like to have the decay no right. No. What it means is that I'm gonna have a hyper parameter, search and I'm going to use multiplies as my optimize. It optimization all right and then then I'm just gonna, let it set whatever permanence decays it chooses to, and it's gonna optimize it for me, but I make my hyper parameter search is gonna. Do all that thinking for me, yeah yeah.

D

Yeah, so it's not intrinsic in the mechanism, it's like it thanks Chameli tuning yeah and then the parameters do tuned for our likely losses.

A

Yes, desires, Possible's or technically I'll tune for layer, wise decay rates right.

D

Since then, then,.

B

Your sparsity right now is a proxy for some cost function, but so is multiplies as a proxy for some cross. One. Yes,.

E

In the end, the cost.

B

Function is something that's really determined.

B

C

Yeah I guess in one case the top one optimizes for memory usage, yes or a compressed version. This is a computational.

B

Business function, cost we don't really I.

D

Guess but I wouldn't be cool to see just assuming that a little little bit is to close that loop right if I were to think of that is to instead of optimizing to sort of like lower the cost right sort of like immediately reinvest all the savings into an additional layers right or like by their networks and then see what is there, you know, and then you can optimize back. You see I mean.

A

For research purposes, if we're trying to build the biggest AI, no matter what the cost, that is, the.

D

That's the nice what's nice about like make it a lot of these. It closes on itself right other than two steps.

B

Ultimately, the world of learning- that's probably not, and you know that's the way the machine learning world works today you know 0.01% was better, looks like, but reality that sounds really works. You know we also don't really know what is the actual ultimate possible performance metric. Maybe things that networks already there? Maybe they you can't improve them any further.

C

Well, that's very much the case for eminence for sure yeah quite likely.

B

For GSC, so so you know it does not fool ourselves to think if we can't get better accuracy, but by going deeper and wider that that's the failure in any sense, no no.

C

I think it's also incorrect to focus too much on accuracy, because there are other metrics that are equally important. Yeah.

B

C

A nice idea, but let's not get done there at all, it's like. Oh, let's.

B

See if it works, let's not focus.

C

Too much on accuracy, and these little things are basically fine. Tuning of this.

B

And very small differences in some of these test sets which really the practical point you make no difference at all right.

D

So when you, when you like, do compressions to you, know the models. The ladies open question.

C

C

A

B

Wasn't election with papers.

A

Yeah I wrote in the form of high and we mind. Mister impermanence is yeah, as if people know it's this.

B

A

Sit one thing: I'll just point out where this was. This is a little bit of a different kind of experiment than we often do so these these excess, often when we have an x-axis and a y-axis like and and these these dots essentially like float around and I, define different sets of parameters to find something that was around this far.

A

So you find another ones around this diversity, so that was like a different, difficult task of like difficult, hyper parameter, search where I was just choosing all these things to find this cloud of points and to show you the best ones. My name is very typical things that we're done here, the yeah, the the points that were optimal by some definition of optimal.

C

A

Mentally and again or eminence,.

D

C

A very high likelihood those things are over fitting to the cassette.

A

Sure they not sure so in the sense that if I, okay, I guess the person.

C

Is part of your training, Yahoo G.

A

I'm curious, if that's the truth, yeah.

C

It is the way to avoid is down in between validation. Yes, he does, but I.

A

Was too lazy to use.

C

A

You'll get more accuracy.

C

Across the board so.

B

I I'm I'm curious about something I look at the speech commands on it and you're. Looking at like almost two orders of magnitude reduction over the dense Network and you're looking at 1% reduction accuracy that seems really impressive to me. Should I not be impressed by that.

A

One part where I was lazy here or where I could have the Dennis network is gonna, have lots of weights that are close to zero and I didn't do any work to cream those.

C

And see if the network still would be I know.

A

But the point is that network is like very maybe.

B

It is and other people who report results on their speeches, everything they those zero in that those close to their on to latest in mine. Now.

C

What you could.

B

Do is have another line.

C

That's the pruning line. Yeah I feel if you do different amounts of pruning others that yeah yeah.

B

Is not HD if I did pruning and I got down to two orders of nine to less weights, then I, guess it wouldn't be as a yes.

A

A

If anyone else has questions or whatever your anyhow, you freeze there's.

C

A lake for training: you have this stochastic.

A

Buffering but for instance, do you still like sampling, very important question, but yeah I didn't know pretty much. That's only answered in the code, so yeah yeah. So it is. It's always useful when you have something like this, that your inference is still deterministic, so you don't want a sample randomly and what I found worked best and I found this months ago and I've just been sticking with it. I assume it still works for us.

A

Is that that there's a there's a what's the best way to just kind of each nearer each unit has this has like some number of permanence, it's like theta, 2, etc, and each of these is a different look for new leave arable.

A

If you sum all these Thetas, you get the expected number of synapses on this unit, and so what I do is I sum all of these to get this expected number of synapses and then I select the top. Ok, all right rather like say say, say: k equals some. That's suppose that was a summation. What a Keagle summation of like they don't it's expected number of enabled synapses and it's choosing the ones that are kind of the best.

A

So speaking right so the time that once once I did, that I was like first of all, I was like, but this is of course the right way. I did it and results got much better, so it is a useful thing.

A

For many partners left sampling, so you know figure out what you know: gonna minimize them. Sadly, most.

C

Importantly, I'm saying: is these the ones the end of this contribution so I'm? Just you know cases in case you wanted me to explore. You know that's a term because you.

A

Okay, cool I think we can stop this string.

E

All right, you're off fine.

E

One seconds.

E

E

So hey thanks for watching. Please hit the like button for me all you guys 14 people watching right now or whatever it's helpful to get like some of the videos and I just want to. Let you know that I'm gonna be not streaming for a couple weeks, starting on the 19th, because I've got a medical thing.

E

I got to deal with, but I will be back hopefully soon and back to streaming as soon as I can, yes, I will put on youtube eventually sound good thanks, guys thanks everybody thanks for watching I'm going to cut the stream. Take care, see you on the forums it should go to HTM forum join HTM form. If you want to chat with it,.