Numenta Live Streams, 19 Aug 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Learning Sparse Neural Networks through L0 Regularization

Description

Numenta Journal Club reviews: https://arxiv.org/abs/1712.01312
Discussion at https://discourse.numenta.org/t/learning-sparse-neural-networks-through-l0-regularization/6471/10

A

Okay, folks I think we're live just want to say before we start we're reviewing this paper learning sparse neural networks through L, zero regularization. This is sort of part of our current research focus on of seeing how we can apply sparsity to deep networks, biologically inspired sparsity.

A

The way we see it, you know how does the brain use sparsity? We know that it's super important. We think that it's critical for intelligence is a sparse representation, so we're looking at machine learning papers that have to do with sparsity.

A

Ok, no problem.

A

Coordinating with Marcus on on the new meant to end I'm in Watsonville right now, so thanks for watching we'll start in just a few moments and we're going to go over sparsity in neural networks. Learning sparse neural networks through l0, regularization, regularization.

A

We're just working out what's getting shared, so I'm, going to turn off my screen share for just a moment. I'll cycle through these two cameras in my office.

A

B

Are you guys ready you want to.

A

Turn the on-air sign on.

A

C

A

A

Think we're good at your audio strip.

C

B

Guessing! That's! Yes, not.

C

C

So I I've been looking at this. This paper learning sparsely neural networks through l0, regularization from Louise, OHS and well I hadn't came up. A couple of these people are the ones who created variational autoencoders, but that's sort of an aside that have some overlap with this because to understand this, um so the title says: l0 regularization, but I'm not going to talk about it and not those mathematical terms.

C

I'm just gonna say they count how many synapses there are, or they count how many nonzero weights there are, and it's like there's a reward networks for having nonzero things. So their overall goal is no networks where most of the weights are exactly zero. So sparse networks, by definition, sparsity and the motivation is I would say the primarily computational efficiency also to prevent overfitting.

C

They found that it can be good for that as well, but I'm feeling a little more of the focus here is on computational efficiency and the idea of being that the fewer your weight, the more modulation in there zero the fewer operations you have to perform on each so they're really they're not touch monstrosity of activation Christ.

C

The connectivity, so it's like and one you know- input comes into the network and put feeds up through the network. How many floating-point operations just happened, they're lowering that effort, and so that's the general idea and I'm going to jump a little bit. I had to talk about what their solution looks like and that'll, take a step back and explain how they got there and by the way, this I don't know. This presentation is focused on laying out describing it, so we can kind of evaluate on those terms.

C

I'm, not gonna, put it into context of other methods for sparsa fication I want to study some more before I before I tell how well this method performs relative to others. I'm not ready to talk about that, but but their model itself is I, think it's elegant and nice and we're looking at so but I plan on doing a little more with this understanding how it fits the broader context after this.

C

So if I were to just like over compress this and just simplify it to just say like probably what they do, they do something a little bit like having permanence ha's on connections except the permanence azar probabilistic. So you can imagine like we have. We have something called permanence where we evaluate how connected to cells are there's about a threshold they're connected there's this a little different, that's like, rather than being above a threshold.

C

It's like univision permanence is point five half the time it will be connected half the time it won't be and just to be clear that they never use the word permanent. That's just the word I'm overlaying on this. So to do to say that more formally now each of their ways each of the ways and in each time step is it has a stochastic element to it. So there is they. The actual learned, wait and they know it as technically.

C

The paper uses theta because they wanted to talk about parameters and but I just switched using W sense. Wanna talk about weights, the learned weights are, as they always are. No networks they're they're, just they're, not sometimes we talk about getting rid of weights that that's totally orthogonal to this. That's not the goal. Here's so they have learned weights as any network, but they have this extra factor between 0 and 1.

C

I should have written it well, I Drive this extra factor between 0 and 1, the axes, sort of the gate, and this factor this is Eve I've, drawn just a picture of the type of probability distribution they use. Where, with some probability, Z is going to be exactly 0. That's kind of the way there's going to be just completely 0 with some probability is going to be exactly 1 and they do have this possibility that it's somewhere in between as well.

C

Yeah with a small, so it's not a it's, not a.

C

So I would say it's a mixture of discrete and continuous. So I would like just to make up some realistic numbers. There might be like a 40% chance of being 0 a 10% chance of being 1, and then the remaining 50% is kind of distributed through this and and and.

C

Just to finish this, a little harder, the nice thing is- and this was the same- a little magical to me- I- wasn't sure how they hold this off. I'll talk about it later they can trip. They train these distributions, such that such that like it.

C

If this that having this connection is not advantageous, the distribution over time will shift this direction, such that it more closely on most often is 0 and it will, if it is advantageous, the distributional shift will shift in this direction, so they actually they're, actually modeling. You know whoa there.

C

Their network is taking advantage of the idea that a connector that the stochastic that it's enough sometimes doesn't work, sometimes does it sometimes might have more effects of money, sometimes where I have lots and they're actually incorporating that into how all this works. This.

D

Engine because I think this is closer to a pathology than what we did and we you know, we've said: okay, let's just go to binary and in professional, and but this is probably closer to what it's actually happening. That was kinda gross yeah, but it's not clear what the advantage of doing it seems. The advantage is getting to these in states where you really they're, not just important for the synapse is not important yet, and the in between states are in my mind, not only where you're going to go this in this paper.

D

So I can speak with my mind's like okay, those are good, cheese-face or you know, just a fact. The biology, it's like guys, throwing these little synapses.

D

So and so it's not true what I'm going to gain.

D

We just skip to the end. Getting there yeah.

E

Yeah I feel like they've sort of stuff in the right. There is a step toward biology. They still have these weights and everything, but it's a step in the right direction. You know so W tilde dense. Yes, that's always going to be dense.

C

Yes, it is but it's kind of its kind of manly meaningful, but for whenever any of these disease that are largely into zero, it's almost like the weight. Is there it's orphan? It's kind of a hat. It's there, but.

C

Unless it's zeroed, I'm sure or zero as a cost, but this.

D

Is the no slight difference to what we do? We assume that involves. You can't have the matrix like that, so we just assume there's a self sampling potential synapses, which is sort of like it should get you closer to the end games in the beginning, I.

D

Just the only thing is to contrast it just a super how it converted what we did yeah do it.

E

Is it so even an inference time they do this probability, I.

C

Mean now in principle they could for inference. They could just sample a random southern disease, the sample of mask once to sandwich and and just use that for every reference, there's no advantage. It's not like just now. The inference consists of doing multiple passes, so they could just sample. It seems.

E

Like if you wanted to get that computational efficiency.

D

C

Really I think.

D

Yeah and yeah they could they didn't. Even they don't necessarily sample this. They could choose just a set of zeros in the one you say anyway, the options tom assume. You know this in between states or notice, I'm not coming to play yeah.

C

Yeah I'll talk about because they're these next two steps, I'm going to first talk about why they use probabilities at all like why they have this zero and one. The second part I'm going to talk about why they have the end between.

E

Stage so those are kind of two separate questions, get it when they train today was the training between same the whole distribution or such Justin.

C

So they have a single parameter that causes this curve to shift and it causes like, and the curve and a little move like technically determine they uses alpha, but they yeah a single parameter, cost this whole curve, plus all the mass to move data from here over to here.

E

It's just a single.

C

Formula, okay, so I drop, use the car to address a curve here. N equation underlies this curve, and that equation has Ranbir's alpha and beta, which they will constant betas. They call the temperature the kind of coffee very confident, yeah yeah. They.

E

C

On the constituents they didn't train, they could have so 70 on it's just the beta. You can stay constant. They causes the curve they have this shape, so yeah I can move on to okay. What caused them to land on the solution? I'm gonna, do it in two parts. First, why probabilistic at all, why make Z stochastic see here is I drew a little vector sign above it. That means it's a mask this. So like Z, this was an individual weight. The Z vector is like. Why is the?

C

Why is the connection matrix? A random variable, R, the cracker, the connection factor? So the reason for this is well, if you so stepping back to like the original purpose of this, they won't. They want the network to training these networks. They want a cost function that gives a baby like reward or big negative costs to to having a weight. That is truly zero. Having a connection that is, zero, not lost.

C

Yeah, it's negative negative cost so yeah. The point is that wants to reward or punish announcer yeah, so I'll describe it that way, since we're set cost, we want to punish nine zeros, and that inherently has like the thing here. Is they specifically don't want to punish one point out more than in zero? Five. It's just it's just like this is like a discrete reward. If you're 0, you get the future reward, if you're not 0, you don't like this reward. That inherently is a very non smooth.

C

Roller functions, very smooth, very discreet, awesome function and this whole approach of like of using infinitesimal gradients, to figure out the best network, just doesn't work when you, when you're using those sets of functions so there there there general way to solve. That is if you make this mask, if you make the connectivity a random variable that lookbooks it'd just look something like this: let's just talk about like a Bernoulli random variable, just discrete it's either 0 or 1, with some probability PI, it's it's 1!

C

Otherwise, it's here, if you use messages of distribution for the mask, which should be just like almost like a permanence, is populistic in the most literal form. If you do this that solves the problem that makes it work now, you can make your cost function. What is the expected cost?

C

The average cost the average error you could call it. What is the average error of this of this network under these probabilities? Now, if you change the probability on a specific sense, just one synapse changes probability just slightly. This costs will vary smoothly because you're just moving the probabilities a little bit, you're gonna cause everything to to change an infinitesimal change here is going to cause an intestinal change there.

C

So, by this pair, these two steps of choose C randomly and use the expected value when you like waiting and the probabilities are everything that gives you a smooth cost function problem is that this is not attractable to compute a summation over all possible masks. That means some I mean you have a thousand cells that stupid a thousand somebody summation of two mm OH.

C

B

Next stuff they had to do, is they.

C

Need to not compute this directly but approximate it by by the.

D

Other tractable I mean II or Z function. Errors is it's just about it right, so you're, just you're.

C

Just doing this expected based on.

C

The formula would this: it's in principle solvable it's in principle, very simple, but I mean what I'm drawing here is actually going to be like if I were to expand this out. Indeed, like okay P of all zeros times, the cost of all zeros P of 999 zeroes and 1 1. We have to do that.

D

You would have to expand all the oh yeah.

D

I thought that your training, this system, you just pick whatever the currency, is.

E

You're right that I have yeah he's gonna change in every.

D

So, oh, it's funny you're, like you're, pretty much.

C

Saying what they end up doing, but.

C

The thing that's being trained as these wonder line parameters and you pretty much you're saying thank you I almost think I should.

D

Move out and just go on to look what they actually the amount guys. Oh all right, but I have to do this calculation over all possible. These yeah.

C

So because, in any particular times, that's what you have is with several possibilities. What is that root suppose? One of these every combination is still possible. Every.

D

D

You're pretty much getting it, that's that's what they.

C

Do yeah well I'm going to move on so rather than you know, expand this outward. You can approximate this by just simple and a few of these tackling a few Z's, not just choosing these randomly sampling them from like what are some likely. Yes, yeah, then taking a few lightly masks and just averaging across those. This is the notation of it's often used when you're sampling, something so take a few samples of it in averaging accustomed and they go a step.

C

Further I didn't write this here, but is exactly what you said: they they only handle one of these yeah.

D

Okay, let me do this over and over and over and over again so.

C

With the same conclusion, which is appealing from biological science that during learning whatever it, whatever happened,.

C

So, okay, that is their approach. They are sampling, a random mask and each time stuff. Your problem is now that, if you were, if you use these looks like this and you want to train these PI's, you want to figure out. Do you want to be larger or smaller you're back to a discrete cost? As you change these sort of like yeah, say you choose a random mask and you decide okay, how do I want to change my probabilities?

C

Changing a probability will either have no effect or they'll. Have this big stepwise effect as it causes cells weights to just change?

C

The point is the cost function is no longer smooth because we've gotten away from doing expected, cost there's no longer smooth everything's going to be discrete I, there's no vector.

C

Yes, and that was why that was where they make it now with it. As you vary pi as you vary, this parameter, that's controlling what Z is going to pop out. It is that, if they have this in between continuous is own. It is like between these two discrete points, then the not that changes changes it so that changing PI under has no effect or has a smooth effect.

C

Either has no effect or has a you know so mathematically. This solves this problem. This makes it way everything can be approached as gradients. Everything can be approached as changing things and infinitesimally and getting it into the decimal change. A slope helm.

D

So so that was where they they.

C

Took advantage of us that's where they needed this, they needed this in between continuous zone, because this can be computed directly.

E

So the question is we we really want. It is great fun and we really wanted from other zeros, disproportionately yeah and otherwise want it to be like. So that's an optimization techniques, and so they go through all of this. Yes, but at the end of the day, are they have we lost that property? Is it still really encouraging true, smart young, or is it going to get stuck in a lot of these indicators, so sort of like.

C

D

You asking whether this will actually get you to your unball yeah.

E

Now you do all of these steps to get away from the discrete thing, which is what you really want: yeah, so few lost ability to I. Guess that's just a question.

E

That's what we want yeah.

C

So yeah I can answer that in two. What is one of those answers is going to be. I should experiment that we're with it directly and see what these see, what these end up, looking like as I train it and see what they are. It may have answers that I, don't think they did a slightly more satisfying answer is in this world of probabilities.

C

What they're rewarding the network for is how much of its maths someone should the probability masses here, Oh any anything order or I should say it's punished for how much of its masses over here how much of his masses any math probability masses right here, it's very cost for that it has no cost for any mess. That's here. So there is a strong incentive for this to be exactly zero because any mess, that's right here, is paying the same cost any masses over here.

C

So so I would say that, like the vast majority of the mass, because network is on zero, now there's a lot of it lie, and this in between region, as opposed to on one I, don't know so that I can say. Did they give any numbers in the paper on this and you do they talk about how well they actually achieve sparsity.

E

C

C

Yeah they they plot it against other techniques. They show that it like the number of parameters, the number of effective. This is shrinking constantly over time and they compare it to other techniques, but that's the best. I can say: I can't right now, I can't tell you like they achieved 5% sparsity I'm, just making that number up. So that is that's more results. We're still still getting a good picture in my head.

C

One thing: I'll one thing: I'll kind of mentioned kind of it.

C

If it is landing on these in-between values,.

C

That's not really going to serve a functional role. It's not going to like act as a weight, because it's just so random you. It wouldn't be a very good weight anyway.

D

Nobody knows if you're in the inference you still apply yeah so like this is kind of like floating around, not noise, yeah options, yeah alright, if you're one, if your weight is one when used to applying a problem.

D

The weight is zero and then I do inference. Is there so probability they might not be so like or is it? Is it get to the lower picture, we're like which villager or there's really almost nothing else? It's really one sorry I'm not just getting.

C

Yeah, maybe the last thing I was pointing out. I was when I looked at. This I was a little worried that these in-between values would just be if it would, if the network would somehow be storing weights here, like a happy version, yeah and I, don't think it's going to do that. That's all I was saying because it's so unreliable, it's always hopping.

E

Around maybe the way they chose that this data may be that they control systems yeah. This movie squished horse down yeah, it's fairly continuous.

D

C

It kind of lightens the distribution that makes it more is the beta discovered tree. They call temperature system.

C

So, okay, final part, I'll talk about so I said early on that this was kind of magical to me that you could train a probability, distribution and I'm sort of phrasing it in ways to make it appear, magical, there's a strict form and to do this, but it's once you know this straightforward, it's obvious, but I'm not going to do it and it's elegant and it's used elsewhere. It's not just from the smell. They call it though other papers. They call it, though facing reaper a motorisation trick.

C

This is the strength that you'll even see titles of papers and talking about the reprime of your senses and in its this clever. So here's like the question of how I've drawn this, and sometimes when people depict backpropagation, don't draw like gradient passing back through this network. You train, then you use this to figure out how to change the way, use it to figure out how to change this. Oh, it was like black magic to me. How would you? How would you take your gradient and update this curve?

C

What does that even mean- and the answer to this is- is if you can, if you can generate random numbers, if you can generate if you can sample this distribution. However, you how it invert this here and ballooning that, if you can pose it as okay, first just generate a random number between zero. Just it does anything and by the way, it's just kind of a coincidence that I'm using 0 and 1 here and also using them here like this, could have been generating random in between like 47 and 49.

C

This could have been anything, but if you have a random number generator, just though the simplest thing in the world, it just generates a floating-point number between 0.

C

If you can, if you can start with that and then and then do a bunch of math to get your final result, it still is the case of the back propagation when it arrives at the random stuff, can't do anything but because you've set it up in this way. The gradient comes back. It comes here. It goes. It can now update your parameters.

C

I can now update the parameters that you used to take this simple distribution and convert it into your advanced one, and this way of posing the problem is really nice and, and it just makes it where, of course, you can train this me about back propagation, and so that's how you take a random this Dushan and train, it is by I, say pushing the randomness further back and their training English women steps.

D

Then therapy it's not important, you're, just you're explaining detail.

C

Of the system, something that's going to come up in a lot for it's a good strategy, that's worth knowing about. So it's not.

D

Necessarily you can't understand you just said: okay, I've got the iteration now I'm. Sorry.

D

For those of us who are like just setting up these networks, it's cool another strip.

C

For you just know what stopping you trying to dis ravier, okay, possible. Okay, the goal here is you want to be able to update your problem. Yeah, and the trick here is- is instead.

E

Of understood what you say, you want to rewrite the probability distribution, so it's a function of a uniformly generated, random number and some other parameters. And if you can do that, you can just.

D

How much reminds us why it is easy to update a random number than in distribution? Oh.

D

E

A constant I see you're combining.

D

Something like what you're labeling out there permanent with this and now I can update the one I'm, just a random numbers.

C

Yeah, for me, this was a big aha moment as I read this so worth shared Ellis.

C

So anyway, this is how they model kind of works, and the end I think is pretty simple: I mean it's it's as these: what I'm calling kind of probabilistic or stochastic, synapses and and making where most of the time most of those synapses are gonna, have to wait. Zero, sometimes they won't and and that's how it's sort of the Explorers should inform a connection here. Should I should I get rid of etcetera, and this strategy feels like it. It's just kind of a natural result.

C

If you think about the problem for long enough, how you're going to train sparse networks, it feels like this is where you're gonna arrive. If there's like one of like a logical conclusion of almost like, they discovered this rather than and then fit it, so it just feels next I, like it train spars, you.

D

D

Is it fair to say no case, you know, okay, we're take a bunch of sort of you know sorting with a bunch of convolutional bone that are fully connected to earth.

D

And now we want to be sparsely, maybe smart connections and there's multiple methods for Julius.

D

Randle they make some things are oh and see what happens are there's Genentech method for going about this yeah, and this is another map. It would that be a correct assessment. Yeah I think so one one thing that they kind of pitched about.

C

This method that makes it an especially nice one, is a lot of those methods, involve training, dents and then like creaming it into something else, whereas this is like during learning it's just continuously kind of getting rid of it's actually yeah, maybe more efficient, yeah or like starts out with somewhat that. It also has this nice leaky. It starts out, like kind of probabilistic sparse it kinda starts out like a lot of these are going to be kind of kind of off, maybe mostly off, and then the ones that are useful kind of common.

C

D

To be that intuition, that, if you want to get as far as activity, then your whole way of thinking about this to begin with a kind of different. This is, assuming you don't know, sparse activity and you're trying.

C

To figure out as far as productivity, but if I started with sports activity as an assumption built into the network, yeah I might come up with a whole different set of methods for dollars. Is that that just strikes me may be true? Is that maybe I can't I can't I can't refute that or.

E

D

But activity then I'm trying.

E

D

Do in this at all, but I mean you try to you: try to take a system, that's designed for full con committee, existing deep learning networks and you're trying to figure out how to introduce some sparsity at one point, and so you have to go through the hoops that.

D

Maybe people start off by training, fully training a network.

D

D

Activity as far as security forces, it's a very your network when you are something totally the way of representing information. You don't have with all this talking about. You know these overlap problems. It just seems like we might end up with a different set of solutions if you start over. That is your assumption. It's.

E

Just a god, damn it that's the case, I know, there's a new. The nice thing is it's a party and they try to reproduce several specification methods included and they could reproduce.

B

All of the men that attacked.

E

This one, so this one difference came larger test by commit net or translation problems. They couldn't make it. What happened is when they try yeah. They try like seven different. What I was doing, I'm trying it or just it work.

E

So that's my target for that state. Sigh hardest thing they did yeah are using to do the same fire, so they said they emailed elephants as well, so I think there's some ongoing description. This method was really intuition why he might not scale so we time they don't have they're just they.

D

B

Altered and in this case your face and they found.

E

Like a small error, meaning they call it something, and you know from I linking intuition open.

C

Yeah I'm saying my motivation for presenting us was that it's theoretically of kill me that did not present because.

E

Amazing results: maybe there was no I. Just that's not it's a nice different way of seeing yeah.

E

Yeah, it really has to know even in the one they tried, how sparse they got just looking at their charts a little hard to tell they're, comparing flops yeah and they're gonna really give the factor, and you can kind of guess. Maybe it's like a third or no. If they're playing a quarter are ten to twenty percent.

D

They're trying to take a fully connected system and come up with some way of breaking the symmetry onto the site, going to binary subtle, I.

D

Feel like they're trying to say: okay, the one beats like a fully connected network of all these high precision weights, but we build I, don't want to end our so how to break that symmetry of that system where, if you start with a system where you have your continuity matrix is inherently as far as to begin with and in that potential set up all your setups to don't want to do or something like that, then you would've broken the symmetry system. You know, that's, like you know, never going to happen now.

D

You're all you're doing is optimizing within some sparse connectivity how to make it even smarter, that's how I do brains would be doing it, and so it's it's a different problem inherently when you start with a sparse like.

D

I'm just trying to I'm just trying to think about.

D

D

B

Think it would be interesting to see the histogram of the Z's during various times of training, to see you know how that distribution.

B

You know edges out toward zeros and ones, because that answer is kind of you know what you're gaining as far as the sparsity goes, the the worry that you mentioned, of having a lot of coefficients near zero and near one but not actually getting to zero and one I think is, is valid and it's a question of whether this thing forces things to to take the leap over and get actually get rid of a connection.

C

But just because that's exactly what the cost function is rewarded with or what is punishing.

E

Is astonishing that zeros and then the graph kind of show that as well, it definitely does happen. It's just.

E

E

C

I think that's it for.

A

E

A

Going to shut off the live stream now, so thanks everybody for watching.