Numenta Numenta Research Meetings, 5 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Sparse Representations in Continuous Learning Paper Review - August 5, 2020

Description

In our previous research meeting, Subutai reviewed three different papers on continuous learning models. In today's short research meeting, Karan reviews a paper from 1991 that he points out was referenced by all three. The paper, "Using Semi-Distributed Representations to Overcome Catastrophic Forgetting in Connectionist Networks" (http://axon.cs.byu.edu/~martinez/classes/678/Presentations/Dean.pdf), was one of the first papers to reference sparse representations in continuous learning.

A

All right, you're.

B

On all right, so um I was reviewing um this paper which uh so so, actually, let's start with uh so on monday's meeting, subatai presented a few papers and they all um they all seemed to reference. uh This paper from back in the 90s, which talked about the use of sparse representations in continuous learning- and I think this is probably one of the first ones published on that, since all of the ones talked about had um they all used sparsity in some ways. So I thought.

C

It was: is this sparsity or just or just distributed representation? Is it really about sparky.

B

I mean they don't say that in the title or the abstract here, but it ends up being that way. Okay,.

C

I mean because you know you think about the thousand brain theory that is a semi-distributed representation of the world, so that those words are you know, you know, there's it's a sparse distributed model, so I don't know, I don't think they meant that here, but those words are very reminiscent of that. So but um okay.

D

Yeah, it's like another another uh acronym for sdr yeah, distributed.

C

Oh, it actually is not gonna.

B

But here in this one, um when they, uh when they say semi-distributed representations, they don't necessarily mean it doesn't necessarily have to be sparse. That's what I'm asking.

C

Because you know it doesn't.

B

Yeah yeah it doesn't, it doesn't have to be sparse, but um I think it it ends up being very close to sparse and you'll see what I mean um in some of the later.

D

You know, what's really funny is, I might actually have been at that conference.

C

D

At the cognitive science conference somewhere around that time- and maybe it was like a couple years later- but that's pretty funny.

C

This was right, you know the the the another beginning of another wave of connectionism, um which then failed. You know or disappeared again for a while, so.

B

Yeah, I found it interesting that this paper um it's it's, cited um a lot of other a lot of the other papers that it's cited and then some of the ones that cited this uh that were published around the same time. They were all published in in cognitive science, uh but and none of them were even though they're all neural network papers. None of them were actually published in europe's, and I guess that's because um new york's was just had just started taking off at that point.

B

So this was like the 13th cogside conference, but new york's is only at like it's first or second conference by then anyways. Let's start so um so, there's there's two. uh I guess main types of uh representations.

B

One is a distributed representation, which is very common today in uh neural networks like the hidden layers, come up with fully distributed representations where all the I guess, all the all the nodes are being used in some way, and then you also have local representations, which are basically like one-hot encodings or something of the and but but then there's there's a bit of a trade-off between these two types of representations, so neural networks and their distributed representations. They get really good generalization, um they're able to generalize to new new inputs.

B

But on the other hand, um you know when you want to train them, to do something new they're, not able to retain the knowledge that they learned on some of the earlier things, because you're updating the way that the whole representation is built and, on the other hand, uh representations such as one-on-one codings, which are local and not distributed. They are able to generally retain knowledge, but they're, not very good, at generalizing to to new inputs.

B

So the whole goal is to get something that that sort of is a trade-off and can combine the idea of local representations with distributed representations. And so that's where semi-distributed representations come in okay, so um this is one claim that the author, robert french, made in the paper uh and it's a big one that drives the motivation for everything. So catastrophic, forgetting is a direct consequence of the overlap of distributed representations and can be reduced by reducing this overlap so by overlapping representations.

B

um He actually defines a way how we can quantify this. So if you think about say you have four features that are represented right in a vector for example, and like so so one example, sorry, one input can be um represented by the the orange and another one by the green. The overlap in these two representations is just like the minimum of each of these of the values at each of the features. So here the minimum is just 0.2 here, there's no overlap at all um here. The overlap would be 0.9, and here point one.

B

So the over over total overlap would just be um the average of all those values. So this would be a relatively high overlap, whereas here the overlap is a lot lower um and- and so because this is- I guess this is a bit- this is a bit more. This is a bit closer to sparsity.

C

It's almost it's almost the definition of sparsity right, yeah.

B

Yeah, so the so, so what? Basically? What um what this guy's saying in this quote is that when you have, uh when you have high overlap, you're going to end up having uh catastrophic forgetting uh occur, uh and the reason for this is um here on the next slide.

D

The previous one, uh you know if those things were binary, then this would be exactly our definition of overlap too. It would be. There would be basically equivalent yeah that you have to have one in both places for there to be an overlap.

B

Yeah yeah exactly okay, so we want to ultimately minimize this and the motivation for minimizing this overlap between inputs of two different. Two different representations of two different inputs is as follows. So say here you have the standard model which which we see in all neural networks where you have a? U has a going into unit v and is given by this weight, and so when you want to calculate the activation of unit v, it's just a non-linearity f applied to the total input and that input at some point takes w multiplied by x.

B

To u x? U is just the output of unit? U which this whole thing is that v. So when we want to back propagate the errors- um and you want to compute the the how how uh w is going to be updated. It's uh essentially essentially comes down to this term, where you have this x, which is um the total input here, which is the activation of unit? U, and so, w is changing proportional to what x is, which is which is the output of? U?

B

So if we change, um if we change different weights for different inputs, then um meaning that like say you have different inputs or different tasks in this case. So if you're, um if x, is always activating if x, is always active, then um this then w this particular weight is going to be changing all the time. But if you have low overlap, um which is what uh robert frencher is advocating for, then it's only then you you this this unit is probably only going to be active.

B

X is only going to be non-zero or large um very very few times, and so it's uh w's only gonna change once in a while. So that's all intuition here, if you have these, um if you have sparser representations, then it's not going to change w's not going to change as much.

B

You know, and the rule is that uh for one specific task or one type of input um w's only going to activate for for that one and then, when you have a different different task, w uh x is not going to activate at all, so w won't really be changed so each time you're, just basically learning a different um sub network on the forward. Pass, which means you're only updating that network on the backward pass.

C

Yeah, can I interrupt for a second here. One thing is always puzzling me, so this all makes sense, um but what's one thing that's puzzled me is is well if you're changing all the w's um but you're changing your money a little bit right, you're, not changing them all radically.

C

It's not clear to me why exactly um that everything is forgotten it why it's catastrophic! You know what I'm saying it's like it's like. If I change all the w's a little bit the does, does it still work or is it everything falls apart? Is it super sensitive to every little nuance and w? Why would that be? You know what I'm saying it's it's not obvious to me.

C

I mean I can see why it's better to be sparse for not changing the w's, but it's not obvious to me how sensitive the previous learning performance would be. If I make minor modifications all the w's I I don't know, if there's any intuition about that, um it's.

B

Like it seems like.

C

I could train another category and it would still mostly work for the other stuff, but it's apparently not. I don't.

D

Know yeah, if you just did one little gradient update. That would be true if you just did, but usually they'll, do lots and lots of these tiny changes until you work until it works well on this new thing and by then you've changed w quite a bit, um it could be pretty dramatic. It doesn't so each gradient update is very small, but often they have to do lots of them. You know in order to actually, if you change it just a little bit, it's not going to do that. Well on the new task at all.

D

So in order to actually work on the new thing, you have to make more dramatic changes.

D

That make sense, sir.

B

I think we lost jeff.

D

Oh, we've lost jeff, okay,.

B

Yeah, okay, I'm gonna just move on so uh so they give a a suggestion of how they can reduce this overlap and basically it's it's almost the same idea as doing okay winners. So if you have this as your original input, then what you're doing is you're making the the ones that the features that are really large, you're, making them even bigger and the ones that are really small, you're, making them even smaller. So this is what this is doing. um So that's.

D

Can you walk? uh Is it just just the moving average of no? What is that.

B

Not exactly so um it's so it's just, they didn't explicitly say what alpha is, but I would assume it only makes sense that it's between zero and one. So here, um if this, so they say this is a really. um This is a eight gold is a very high value right, so it's closer to one than um whatever that whatever is remaining which it didn't get, then you would sort of add that on, but you would add it, but you'd reduce it.

B

You'd shrink it a bit so like if, if the old value is 0.9, you'd add like 0.08 or something like that to it. So it's getting.

C

Closer and closer.

B

C

I don't know if you like this one. I just lost my connection for about a minute, so I don't know.

D

C

D

What I was saying in response to your question was quite often each gradient update is just a small change, but they have to do many of yeah.

C

I think we can go on from that. So, okay, I got that yeah.

D

Yeah, if it was just a tiny change, then it wouldn't.

C

It's it's. It's still. The whole system is a bit surprising to me how how you can actually get all these weights to be correct for the entire training set. um Yet it's still it's just it just yet. Yet it's not robust they're changing. I I get it I just I'm. All I'm pointing out is not intuitively obvious. Why it's like that, but it's okay yeah! It's.

A

C

Obviously why it works. Maybe that's the question.

B

Yeah, so um so, as I was saying about this, um uh this, this in a sense, is kind of like k. Winners, because you can choose the not you can choose the number of um uh features that you want to make bigger and how many you want to make smaller. um It's like. I guess it's. It's almost a softened version of um of of k, winners where, um where, if I guess, if you set alpha to one, then that would just be equivalent of k, winners, yeah, that's.

D

But it's not comparing to the other ones. Is it um it's just.

B

So what the ones that so you have to know, which ones are sharpened and which ones are reduced? uh You know that you have to know that, based on the relative rank ordering.

D

Okay- okay, so here he's he's just looking at activations uh he's not looking at weights at all and he's sharpening the active activations to emphasize the ones that are already higher and de-emphasize the ones that are lower. That's yeah! That's.

A

How he's mimicking sparsity.

C

That's kind of like a k, windows, yeah, yeah, yeah, yeah, yeah,.

B

Okay, so um after seeing this, I just wanted to make the connection back to what subatype presented on monday. So we talked about this paper oml, where they claimed to have very sparse um activations uh in one of their intermediate layers. I think after they did all the after they did the after. They passed everything through the representation learning network. They showed that, like they were getting roughly 3.8 um um activation and but on average, all the units were activating for different inputs. So it's not like there were no.

B

There were no dead neurons here, and that made me think that you know in this case, since, since they are using relu here um and a lot of the units are actually uh have gone to zero when they're when they're learning this network, um when they're doing the backward pass, I would hypothesize that they're- probably um probably it's probably looking something like this, where only very few weights are being updated so that that in a sense is like.

B

I guess one one advantage here that, as opposed to regular dense networks, you know a lot of the weights aren't even going to be touched here in this case. So so um and then that team to do um well for them, so I guess there's it's worth investigating whether or not this. This idea of just updating a very few number of weights at a time would be helpful.

B

Okay and finally, also a few weeks ago. I guess I've I've shown all these graphs before, but a few weeks ago um you know, I tried out uh on a simple, continuous learning task at the different uh dense network versus a sparse network. So there's no there's no weight, consolidation penalty or anything here. This is just um this is just a regular loss, and um I found that I guess the the difference is this difference here.

B

Wasn't too big, but um the sparsity could explain possibly why I was able to get um higher accuracy with the sparse network on split amnest, but then I couldn't get that on on the gsc. So.

A

In koran, the dense ones are using value. In that case,.

B

uh Dense ones are using uh k winners- oh sorry, sorry yeah, you're, right, you're right denser, is using relu. um Sparse, is using k, winners.

D

Yeah yeah, if you go back to so that's that's good. uh If you go back to the previous slide, you know you know so this. What you're, showing in the feedboard network is because the activations are sparse and the weight updates are going to be proportional to the activation. Only a small number of those weights will actually get updated right in.

D

In our case, we also have sparse weights, so the connectivity is also far so so, even in some of the cases where the activity is high, the weights will be zero, so the actual fraction that gets updated is going to be even smaller than if you just had sparse activations.

D

So I just wanted to kind of point that out.

D

So you will not just update a really tiny part of the network.

A

Yeah not just the way it's going to be zero, but we also kill the gradients just to complement.

D

Yeah exactly so yeah and that's kind of what happens in in h in our hdm models too. Only at like a very, very precise, specific part of the sub network is updated at any point in time.

B

But I just uh super tight going back to what you said. If we look at the the update rules here um so when the activation is zero, that kills the gradient right there, but um the weight being zero. Does that also kill it? Because it's not going to make this this term zero? Is it.

D

Yeah, it's uh well what we we have to do when a connection when a weight is uh it's not that the weight is zero. It's that the connection is missing. So we enforce that with a mask. So it's it's not that the weight is zero and we increment it and it becomes non-zero.

D

It's it's actually the connection it is missing um so that that's another thing and then uh and as lucas said, that also impacts gradient flow back to previous parts of the network as well.

E

Could ask a quick question of that? If, uh if we're applying the mask we're applying the mask the gradient, so the gradients are being computed, then being clamped to zero for those connections.

D

uh Yeah there are a few different ways to do it. I think our current approach is just to multiply the weights with a binary mask and that will just kill the gradients.

E

So there is energy in the gradients, but it's not trying to be distributed anywhere. It's just being suppressed.

D

Yeah, that's an implementation, but conceptually the way to think about it is it's not that the weight is zero. It's that the weight is missing. It's just. The connection is not there that's the way, and in that case there just wouldn't be any gradients, and so we implemented by suppressing it, uh but just to take advantage of pi torch's, vectorized operations and and stuff. But that's that's just a implementation detail. You could imagine implementing it. You know by five different ways.

D

um The the the important thing is: it's not that it's a weight of zero that could then become positive or negative. Is that the connection is just not there.

D

That's the that's the way to think about it, and this is the distinction between those two.

A

There are a few dynamic, sparse methods that make use of that uh wasted. Energy kevin, like you said so, you just separately. You keep track of all the gradients, even if you're, not updating and then, if at some point a connection has a lot of accumulator grenade and then you let that connection grow and you put it back in the network. So.

D

A

Dynamic, sparse methods take advantage of that.

D

Yeah and uh and the intuition there from a neuroscience standpoint, is it's that's kind of like heavy, and you know if if these two things are uh kind of working together, but there's no connection, then you would want to grow a connection there. That's the kind of neuroscience intuition for it.

E

D

Okay, thanks thanks karan, it's it's interesting to know that back in 91, french was thinking about this stuff. That's pretty cool.

B

Yeah, I think uh what I like about it is that um you know it connects varsity with uh continuous learning, so I think they're, that's it's a promising direction. um Also.

D

Just to mention I almost yeah.

B

Go ahead, I just wanted to mention that it just it just came to me that I think there was another method that that was proposed for uh obtaining this sort of uh this sort of update, where they compute the gradient as regular, but then um they're on the back on the backward pass. There's they're only taking the the top um like the top k, uh partial derivative values and zeroing out everything else that way: you're you're, penalizing the weights that um that caused the biggest uh or how explain the loss. The most.

B

D

Was just another uh method, who's doing that french was doing that or oh.

B

No, that was um that was another paper quarter. I can. I can um link you to that, okay, but but that that was another technique. Someone had.

D

Yeah, okay, cool! Thank you! That's it! Yeah.

C

That was interesting.

A

uh I have a curiosity on that slide. I don't know if you know the answer, but there is a method in the middle sign n and it shows a lot of that difference.

A

Do you know why or what that method is doing.

B

Could you repeat that.

A

So in your slide, you have three methods there like pre-training, um oml and then the one in the middle srn, and that one shows a lot of that neurons right.

B

Yeah um here they were why.

B

I no I'm I forget what this one is super tight. You recall.

D

I'm sorry the srnn.

A

D

um Yeah, it was a different method. um I'll look it up, it might have been a recurrent.

A

Yeah, I'm curious as to why there are a lot of dead mirrors on that one and not in the mouth. So but.

C

What makes that I think it's the other to me, it's a little confusing the other around, because you know we found with the spatial cooler. You ended up with dead neurons until you had a boosting function, and so that seems like you're a natural thing to do, and so I was surprised that you all the units were active in the oml and the pre-training thing without any kind of boosting, so that to me that was like.

C

Oh, how did that happen as opposed to why a lot of dead neurons in that I would have thought to be dead, neurons everywhere.

D

Yeah, that's the one of the points they're making in this chart is that through oml they were able to get something like boosting, but the way they did it is they used an uh evolutionary approach to train the sparse network so that it worked well on continuous learning. So it's kind of it was this very sort of uh computationally intensive approach to come up with the sparse representation such that it worked well for continuous learning and.

A

D

They found that oh look, it discovered sparsity and that actually the units are much more.

C

It made me wonder if, if, if we just took our sort of spatial pooler mechanism for sparsity and apply that to these networks, how would it.

D

Be exactly exactly yeah yeah, exactly that was part of our discussion on monday. It's like you, don't have to go through. Necessarily this whole meta learning process to do it. Yeah, I don't know, I guess in.

C

This episode, yeah.

D

Yeah, but it is it is. I thought it was nice that it the the best solution they discovered was something that was sparse and had a boosting like impact.

D

Okay, all right, not too long, yeah.

A

Yeah, should I stop.

D

That wasn't too bad.

C