Numenta Numenta Research Meetings, 3 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Continuous Learning: MAML, OML, ANML, and Supermasks - August 3, 2020

Description

In this meeting Subutai discusses three recent papers and models (OML, ANML, and Supermasks) on continuous learning. The models exploit sparsity, gating, and sparse sub-networks to achieve impressive results on some standard benchmarks. We discuss some of the relationships to HTM theory and neuroscience.

Papers discussed:
1. Meta-Learning Representations for Continual Learning (http://arxiv.org/abs/1905.12588)
2. Learning to Continually Learn (http://arxiv.org/abs/2002.09571)
3. Supermasks in Superposition (http://arxiv.org/abs/2006.14769)

A

Okay, we're recording.

B

C

B

So what I thought I'd talk about today is uh these three topics is sort of the focus is on continuous learning, um but and we'll look at kind of several papers, so we're not going to go through each one in any uh detail, but there's some underlying concepts in each one which are which are quite interesting and relevant. So I'll sort of spend a couple of slides on each one, but the papers would be sort of mammal, which we already talked about.

B

That's kind of a background and mammal is used in the oml paper and the animal paper and, if you remember, there's a variant of mammal called reptile. So this like whole animal naming theme going on here um and then also uh this new paper on something called super mass and superposition.

B

um So we'll talk about uh a bunch of those. um So before we do that, I thought I'd just sort of review a little bit about continuous learning in htms and temporal memory. I'm not going to go into this in detail, but the way we often talk about how inference works in htns and how continuous learning works is using a particular type of language.

B

So we, you know, we talk about mini columns and cells, so here's a cell a I mean an input coming in a and it activates a number of mini columns and without any context, all the many columns become active.

B

These then cause predictions and specific cells and other mini columns and as kind of shown on the right finger here. The way this prediction happens is that you have connections from some of these cells in these active mini columns onto one dendrite on on this red cell and another dendrite on this red cell and so on, and so when this red cell detects a pattern on one of its dendrites, it becomes depolarized or predicted state. That's what showed in these kind of red dots, and so this.

B

If these correspond to the b mini columns, then we say kind of b is predicted and then, when b actually comes in this cell because it was depolarized will win and the the rest of the cells in this mini column won't become active. Let me just turn on the here: okay and if there was a mistake, if if there wasn't a correct prediction here, so if these cells were not depolarized, then all of these cells would become active.

B

We'd pick a winter cell and then we'd learn the previous pattern on a specific dendrite on those cells. Okay, so these dendrites bias these cells to win, and if there's no winner, then we pick a random one and learn uh patterns on that. So this is kind of the language we typically use to describe it. um I'm gonna, say short slightly different way of showing the exact same thing. So one thing to note is that, after learning, in reality, this network is very densely recurrently connected and we never really show this in our pictures.

B

I realized, but there's a you know, once we've learned a whole bunch of sequences, there's tons of connections going back and forth between these cells. um So it's a very uh densely connected recurrent network right, what's kind of going on is that when you're, showing a particular task or a sequence, um you're instantiating, a very, very sparse sub network, that's embedded in this densely recurrent network right, that's the connected recurrent network! um So again this we don't typically talk about it.

B

This way, but um what's happening, is that these dendrites are actually choosing which sub network to instantiate at any given point right. So if you think about this being a full really densely connected network for a particular sequence, uh you you know you instantiate a particular subset which is like this guy this guy, this guy connected to a previous set of cells. There that's for this one transition in this one sequence and the specific dendrites that become active by biasing, the cell and determining who wins: they're actually instantiating, this particular sub-network.

B

Okay. So that's in some ways what a dendrite is doing. It's it's choosing a sub-network. uh Of course we have very sparse distributed representations so that avoids significant overlap between the tasks. So, even if a few of these cells might overlap with other other tasks or a few of these, you know portions of the sub network, the network subnetwork as a whole, represented by an sdr, uh because it's super sparse.

B

It avoids significant overlap between tests and that's how you can kind of learn new things and instantiate new things without sort of getting confused uh with other stuff, and then uh you know we talk about the weights being binary and permanence is essentially choose which weights to make active for a particular task or a sequence via learning and marcus has kind of drawn this connection in the past, before with his variational uh inference and variational dropout stuff- and so uh you know, permanence is during learning choose which sub-networks to make choose, which subnetwork is going to be part of a particular sequence uh or a task okay.

B

So this is kind of background. I'm just saying the exact same. This is no change in temporal memory. I'm just explaining it in a slightly different language and you'll see the point of this. As I describe some of these papers, any questions on that is that was that is this already all obvious to people, or is this anything new.

D

uh Just a quick question um when you said that it mispredicts and it chooses one at random to try to learn. uh Is it truly random or do the permanences somehow work into it.

B

um It's it's in our algorithm: it's it's pretty random in in in the neocortex. It might not be truly random. There might be still a little bit of each cell is not going to be exactly have the same amount of depolarization. There will be some randomness in there. um You know something we'll choose it, but in our algorithm it's it's completely random and it's actually good that it's random, um that that gives you kind of the big, the full. You can use the full representational space that way.

B

Okay, okay, so uh here are the papers, various papers, that I'll focus on now, um and I think this is they all sort of came out in rapid succession or over the last 12 months, sort of and it's sort of a fascinating constellation of papers and continuous learning.

B

And what I'll do is I'll go through it in this sequence? So I'll first talk about this one, which is the oml paper and then I'll do the animal paper which is down here then, as a third one, I'll talk about the super masks and superposition mammal I'll. Just briefly show my drawings from when we talked about a couple of weeks ago, but mammal is used in oml and animals, so it might be useful just to kind of remember what what it is, but the core is kind of these three papers.

B

Okay, so this is a review of my whiteboard drawings of mammal, so mammal is a meta learning technique and if you remember, the goal is to discover a network that a set of weights that is going to be already good at many of a particular set of tasks.

B

Okay in in the original mammal paper, uh uh they described good as being how quickly you can learn each task. um But the idea is that, through this meta learning process, um you're gonna find some optimal set of weights, which is basically like an initialization for the network and from that set of weights. It's going to be primed to be really good at this.

B

This big task that you've defined and the basic uh you know terminology and loop was um there's a network with weights theta um and that's the kind of network you're trying to optimize you initialize with some random values for theta, and then you go through this loop, where you pick a task out of your collection of of tasks and then for each task. You perform the task.

B

uh In her case, it was training on on k, samples and then you what you do is for each task: you're going to get a different weight, a different network, which is this theta, updated by the the gradient steps for that task, and then what you do. Is you update this original theta to minimize the loss on each of these individual networks? Here, I'm using a test set?

B

Okay, so and then you kind of you so now you've changed the initialization of this network uh based on how well it's doing on on all of these uh different tasks, and so then, you repeat, and hopefully over time, you're going to find a set of networks that is going to do very well on all of these different tasks and as in in the meta learning case during the meta test testing they they.

B

They test this on a completely new set of tasks that have not been seen during training as well, but they're all kind of part of the same thing. um Okay, so you can imagine uh you can so what the oml paper does is apply this basic framework to continuous learning, instead of looking at how fast it learns they're going to look at how well it remembers previous categories.

B

Okay, so this is the oml paper, my job ed at all, and what they do is they set up a set of a network like this, so you have an input coming in.

B

You have what's called the representation learning network, which is several layers here.

B

The output of that is then fed to another network called the prediction learning network, and what happens is that during the meta learning phase you up this? This is the theta you're going to update you're, going to learn the weights uh for this network for the representation learning network uh and during the continuous learning portion you're just going to adapt the weights on the prediction learning network to see how well it for each successive task?

B

Okay. So the idea is to learn a network here that outputs representations that are going to be really good for continuous learning.

B

Okay, so the the training loop that they have is very similar to the mammal thing, so here, um instead of a task where you're going to learn things quickly, what you learn is what you sample is a continuous learning task um and so the way they define this okay, so I'll explain this in a second, so you sample a particular continuous learning task where you're going to learn multiple categories in a row, then for each continuous learning task, you're going to go through this sequence of categories, uh train it on on each one and then compute computer loss.

B

So here uh they look at mainly omni glot, that's their their main result, which is this data set. I forget how many I think it's like 1600 different categories or something like that. So they choose randomly choose, so they first uh have a subset of all the categories as their training subset and then there's another subset, which is the test subset, which is not seen during training at all. So what happens during training? Is they choose randomly some list of 200 of those training classes? Okay, so these are 200 omniglad classes.

B

Then they train on five examples from each of these 200 classes. So that's going to be a thousand examples in a row: okay, so five from class one five from class, two five from class three and so on, and these are trained in sequence, in a continuous learning uh setup.

B

So it's like a 200 head task and then this loss function looks at the test set test examples from these 200 classes, computes the accuracy over all 200 of these classes and that becomes their loss function and then they update the the theta so that the loss you want to minimize the loss on all of these classes.

B

Okay, so this inner loop is that the standard, continuous learning uh thing that that that's big, where you have a set of tasks- and you learn each task in a row and then you want to see how well you've learned all of the classes at the end of the day, and then that becomes the inner loop for this whole meta learning set in reality. If they were to actually do this, this would be a thousand steps and you'd have to accumulate these gradients over a thousand steps.

B

So what they do is they only update five of these at a time? So they pick. You know some random point in the sequence train, uh I'm sorry they did five classes at a time. So they pick a random point. Within this list of 200 classes, they train for the next five classes. Then they update the weights and they they iterate. So they don't do a full thousand in here.

B

Okay, any questions, and so going back to here. uh The continuous learning piece is they freeze these weights and then this becomes a 200 head, continuous learning problem and they learn 200 categories in sequence.

B

They see how well they do they take that error and then they back propagate it into theta, update theta, uh so that'll be the next iteration. Then they keep this weight, this network fixed again, and then they train this guy for continuous learning.

B

um So this is this works really well um so they're able to they show learning 200 what it looks like learning, 200 categories at a time um and oml uh does quite well and it remembers at the end of the day, it gets about 63 accuracy across all 200 classes, which, in a continuous learning setup is, is really impressive.

B

um Some of these- I don't remember where all of these I think pre-training is, if you were to train uh the network completely on some of the class on the 200, the training set of the 200 classes and then try to do continuous learning. It doesn't work very well, um but oml performs uh quite well uh in here.

B

So this is, it's quite impressive: to be able to learn 200 categories in a row like this.

B

And the one big point they make in this in this paper is that by doing this, the representation that they learn is super sparse, so that that's kind of uh nice that, um if you look at you, know any particular output at any particular point in the scheme after the whole meta learning phases is learned. If you look at the output of the representation network, it's extremely sparse and the other thing they mention is.

B

If you look at the average activation across all of the all the classes, um you know the it's it's fairly high, meaning all of the units are actually used quite well and in our spatial cooler. This is what we try to get to with with boosting as well.

E

Do they get this level of sparsity, even though they don't uh explicitly uh uh encourage it.

B

Yeah, they don't encourage it. What all they're trying to do is do well on continuous learning, um and so the meta learning process must have picked weights in such a way in the in the data in the representation network such that you always get very sparse outputs.

E

A

C

Just exactly what.

E

Yeah go ahead, so it's just naturally learning these sub networks for each of the different tasks that it has to do.

B

Well, they do so it's not clear what it's doing they. I don't think they really analyze the weights. I imagine it may have learned a very sparse set of weights.

B

It could have just learned weights such that they're still using relative they're, not using k, win or take all or anything like that right, so they might have learned weights such that the average dot product is very close to zero, um and you know for specific types of inputs. Only some of them actually get above zero and some don't it's not clear to me exactly what what they're not explicit like. I said it's not explicitly designed to be a sparse network, but the meta learning process learned that um sparsity is best for this continuous learning.

C

C

B

B

Animal uh tries to take the and they sort of uh they say that work is built on top of oml. Is the exact same kind of training set up and stuff like that, but their network architecture is different.

B

So what they do is they add a neuromodulatory network, this theta nm, which gets the input, and then they have the basic prediction network, which also gets the input and what's going on here, is the output of the of this blue network, the neuromodulatory network?

B

That output eventually has the exact same number of units here as the number of units as the vector coming out of out of here, and then they do element-wise multiplication.

C

B

And so this modulatory network is sort of gating the outputs to the classifier, creating the activity that eventually gets to the classifier. So the way I kind of think of it is this network is kind of learning which activations are important for a given input. It's trying to pick some characteristics of this input and saying, oh based on this characteristics.

B

Specific output units are going to be more important than others, okay, so it's it's kind of like an attention system. It's it's gating the outputs here, so so during meta learning, they learn all of these weights. uh The weird thing is that during continuous learning, only these only the classifier weights are updated.

B

So during the meta learning phase, all of these weights are updated, but when it's doing continuous learning, so it's learning class, you know, k and then k, plus one and k, plus two and so on. Only the heads are updated.

B

um I was not very satisfied with this. It's like I thought, is this really continuous learning in that case, is it really well? It is continuous learning, but is it really solving the catastrophic forgetting problem, because during continuous learning, all of these weights are fixed, um so I was kind of left slightly unsatisfied with that. I would have been happier if at least some part of this basic network was also updated.

B

So in oml remember, you know in their prediction learning network they had several layers there that were fully updated and only that kind of the base representation was fixed.

B

So so what do you think is happening in the last layer? It's just memorizing everything yeah. So uh you know when it's learning class k, only the head for class k is being updated. uh Maybe um I don't remember if they're learning a couple of categories at a time or they're just learning one category at a time- and I think it's just essentially like like you're saying memorizing, which which activations it should it should attach to that head right.

B

So this guy is doing a lot of the work in figuring out which activation should be actually sent to the classifier. But the continuous learning piece is very minimal.

B

So it was, I thought this was a little odd and not fully satisf, uh satisfying as a continuous learning system or solving the catastrophic problem, forgetting problem um they get, but they get really good results. So here's um they go up to 600 categories in a row um instead of 200, and you can see that the animal during testing time it retains something like 63 accuracy on all 600 classes that were that it was trained on.

B

So just as a reminder again during the meta training, it's trained on, you know hundreds of classes where each time it does continuous learning on some subset of these classes. Once meta-learning is finished, they don't change those weights at all. They get a completely new set of classes, and now you learn 600 of them in a row. Okay, so it says never seen those classes during training.

B

Okay, so it is, you know it's a completely new set of classes, but and it's able to learn 600 of them in a row where learning involves just learning the output weights um so one. So this is quite impressive that they were able to learn. 600 classes, one kind of really interesting twist to all this that I just learned recently, I was in touch with uh kuram javed who did the oml, so they he just found a pie, torch bug in the scripts.

B

As we know, writing these continuous learning things in pi torch is really tricky. I think he didn't accumulate the gradients quite correctly, and so he just found a budget bug in the scripts and after he fixed the bug. If he puts a single output layer, sort of mimics, the oml and the animal setup, he gets the same results as animal.

B

So what I mean by that is, if you go back to this thing in this prediction layer network, if he just has a classifier and and nothing else, in between no hidden layers, just this representation and this classifier, and so it's again just learning the classifier weights.

B

He ends up getting the same results as animal. So in his case he does not have a neuromodular tory network.

B

So I thought that was kind of interesting. So there's now two different ways of getting this basic result here, which is which is kind of nice. With animal, they also looked at the outputs here, they're looking.

B

At the outputs, after the gating here before and after the gating and what they see is that, after the the gating, the outputs are super sparse again. So in their case also, the classifier is seeing extremely sparse activations coming in. So this is like the output of the of the basic network. This is like the output of the gating signal and wherever there's a coincidence between those two, that's presumably where there's a activation coming in.

B

Okay, so they also make a point of noting that this thing becomes super sparse.

B

Okay, any questions on that.

F

uh I asked a question on slack about animal using a remember set. I don't know if I had a chance, I'm saying.

B

Good, could you remind me what that remembers, that was yeah.

F

The organ algorithm for animo at each update they have like a remember set, which is a small set of 64 samples sample from all previous tasks that they use in the update, and that makes it very different from oml, because that means they have some sort of replay.

F

Well, I you don't see that you know.

F

B

F

The algorithm there or just.

B

Where was that, let's.

F

Look for a remember, set.

F

There you go, and you can see that in the algorithm you can also see.

F

F

That small part there just train plus a render sample 64 instance from the set of all meta training classes. That's, oh, okay! So there's a.

B

Little bit of replay yeah yeah, that's interesting. um It's been a while, since I read this paper, um and so that could.

B

Yeah yeah yeah yeah yeah exactly.

F

There's a sram: that's the replace that.

B

uh This one right, yeah line four yeah, yeah yeah, exactly yeah, so um so the training on this new trajectory, as well as some subset of the previous ones, yeah yeah. So that's that's a little weird! That's interesting! um uh You know that that means you know they must have put that in because they needed it right exactly.

F

B

Without it it it didn't, do as well. Did they have a graph showing it without the remember set.

F

No, there isn't.

B

Okay, I wonder how so that makes the oml one even more impressive, because they don't have that and now they're getting the same result as animal.

F

Yeah, but the strange thing is that when they talk about the remember set, they cite the oml paper and say they are using the same scheme. So I I was wondering if I read an outdated version of the oml paper, but I don't remember seeing that in the oml.

B

F

It is not I I read it again. I didn't find anything.

F

It's free, it's two different uh ways of citing the test and, of course, if you put the remember set in the webmail, it could be even better.

C

B

What I, what I do remember is um remember, um is that in oml what they do is instead of training on all thousands things in a row. They just pick five classes at a time and just train on. You know five or four instances of each class at a time.

B

So maybe that's what they're calling the remember set.

B

Is that possible or.

F

B

Yeah, it's not it's not quite what it says here.

F

Yeah, it's not the same thing, but I see what you meant.

B

Yeah- and that was just a optimization issue-.

C

B

Sorry, these are so big. I was taking screenshots.

F

I just wanted to point that out that.

B

Yeah, there's no there's. No. The word remember doesn't appear in.

C

B

And where does he say it's that's, what's going on in oml.

F

uh Just if you comment that remember again, some the first time he mentioned, I think there is a oh there.

B

It is so there's the remember set, but then it says the meta loss function is remember is referred to as the oml thing. So that's just how you compute the loss.

F

But can you google again remember? Is there another google I mean like.

B

F

I think the first time they talk about the remembers at the.

F

F

Yeah, I don't think they cite directly, then oh.

B

This is, this is maybe for the just for computing, the loss.

F

Right, but if you're using for computing the laws, then it's affecting your optimization right. Yes, yes,.

B

Yes, uh so it's not like they're, it's not quite the same as um uh as a replay they're, not training on it. Then.

F

Well, they are yeah. That's what I'm saying. If you look at algorithm, one line: nine, where you calculate the gradient there, s frame.

C

Yeah yeah yeah yeah.

B

No, that's right! That's right!.

B

Well, okay, so it's not in the inner loop, it's not in the continuous learning piece of it. It's used to compute the it's it's used as the meta learning objective.

F

Yeah yeah it's in the order.

B

uh So it's it's slightly different from replay, but it's it is being that information is being used. So maybe that's what this this thing is here.

B

So this is the oml one. They have two.

B

B

Yeah, so you can take this offline.

C

Yeah yeah an important.

B

It's an important point.

F

I mean it's, it's not a big deal.

B

It was just well it's it's it's uh it's interesting. uh We can. We can uh look at. You know I didn't go into so much detail into into each one. I don't know uh quran if you had a chance to look through that at all. If you remember any of that.

E

The the remember set yeah, no, I didn't I didn't even I think I totally missed that detail. Yeah.

B

Yeah, it's a so in what you're saying is in computing. The meta learning loss function, uh they're incorporating a few random examples from so they might learn on sort of five classes in sequence, but they're also including um some so of course, they're going to test on those. But I'm sorry they're, going to compute the meta loss on those but they're, also using some other previously learned classes from the full set of 200 or 600.

C

Yeah in there yeah: okay, uh that's a good point.

B

Any other questions any of this.

E

You said the the bug he had was that uh he's he was zeroing out the gradient. Each time was that it.

B

So he didn't say exactly what the bug was.

C

In at least hold on.

C

Here's a github.

B

You just said fixed a bug that resulted in incorrect meta gradients. These meta gradients involve taking a second derivative, so it's a little bit tricky in here, probably in pi torch, and then he says significantly improved results on both omniglot and his sine benchmark.

B

Using a linear, pln layer, as suggested by this is the animal paper here, it's possible to get the same results as animal without using any neuromodulation layers.

B

So, overall, all of his results are improved and in particular, in this particular case they can mimic. Animal next thing is: this repository is as active and uh he's sort of still constantly making changes on it. I think this is his master's thesis, so he's still working on that.

F

Oh, I had uh well.

B

I guess there's a there's, a whole discussion here, which I didn't really look at.

C

F

Yeah, I had one more comment: when you talk about.

F

C

F

So you said it's a bit disappointing, but it's not that different from uh animal from oml, because in oml uh theta is the convolutional part, that's the four first layers and then w. Is this the fully connected the last two so yeah?

F

The convolutional layers are also frozen there during the inner loop and they're updated in the outer loop and what they did in animal in order to keep the network the same size, they had the four convolution and two fc and they broke into two networks so that the the nm network has two calm and one fully connected, and the prediction network has two convent fully connected. So what they do is they freeze the comp and only train the fully connected? But that's the same thing they do in oman.

F

The only difference that no matter they only have they have two fully connected, and then they train those two, while in any mode they have one full connected. So they only train that one, but.

B

Yeah yeah the disappointing part to me. Yes, I I totally agree with everything you said. uh The disappointing part to me is that to me this is just a classifier.

B

um Yes, it's a it's a it's a it's a linear, connected network, but this all you know, you're just training, a subset of the classifier here you here you're doing multi-head trading right right, and so your pic, you know what task it is. You're only updating those weights um and the hard work is all done during meta learning.

B

And the continuous learning piece is just mapping the output of here onto that particular head and that's it. Whereas here he's doing that. But there's actually there's still a network here which has to learn.

F

Right, it's just two layers anyway,.

B

It's just two layers yeah, but there's still a I'm still have to. You still have to solve something non-trivial here, it's a much harder task. I think, and it's more representative of what you might want a continuous learning system to actually do in practice.

B

So in here I would have been happier if he had had a couple of hidden layers in between here. There's, no reason why they couldn't.

F

Yeah the reason they did they wanted to keep the same number of parameters as the oml in order for to have a fair comparison and since they they mimic, and they have two networks which are the exact same size, they had to break it down in half. That's why they only have one fc there.

B

Yeah, but that that's I don't know, I think, that's kind of, but.

F

I see our point yeah.

B

Yeah, you know, then they could have done both. uh You know I I don't know how well this would work if um they had two hidden layers in here right, yeah. I I guess in some sense in that case, this could have learned to do nothing.

B

You know, maybe it could output all ones, and you could end up with just this solution. All again, that's nothing! This is kind of a superset of the oml.

C

Solution in its most generic generic form.

B

Okay, um so there's gating that was done by the animal thing and in both oml and animal they get really sparse activations um so that that's kind of nice all right. Then there was this other paper that just came out. I think, just a month or two ago, um the super massive superposition.

B

um This is slightly different. There's no meta learning here. In fact, the learning is super simple, and so here um you know, you know you have a standard case where you're outputting a probability over all the classes uh given a set of weights and an input. So this is your the output of your neural network and it's trained by cross-entropy loss. So this is just the standard setup and in continuous learning you have a bunch of different l-way classification tasks.

B

You have a bunch of things that you learn in sequence, for a given set of inputs, so this is kind of the setup. So what is a super mass super mask is simply the weights multiplied by binary mask.

B

So this is an element-wise product. So this is very similar to how we do our sparsity in the sparse weights, but the big difference they make here is that the weights are kept frozen, so they initialize the weights just like normal about almost like normal, but then they never change the weights.

B

Okay and the other thing is that the weights are either plus or minus c, with equal probability. Okay, so it's almost like a binary network. You have plus c and minus c as your weights and they never change the weights at all. So the only thing they do is change. The.

B

Masks: okay, so it's so that means that if you think about you're doing continuous learning task one, you have a particular subset of the network that you use as your network for task, one, uh that's what they call a super mask and you have another mask for task two another mask for task, three and so on, and all of these masks are kind of subsets of the same underlying highly connected network.

B

Okay. So a super mask is a sparse sub network with essentially binary weights that are never changed. Each super mask is specific to each task and they make the point that this is really efficient to store. You don't need to store all of the weights. You just need to store the random seat that generated the weights, and then you have to store each of these masks which are binary okay. So the only thing now is: okay. How do you learn these super mass for each particular task?

B

um This part actually was not clear to me in the paper um there's a couple of different possibilities, but then they cite they do cite a previous paper by them called, what's hidden in a randomly weighted neural network, and this is how they they learn the mass there so I'll describe this, I'm not 100 sure how they actually learn it in this paper, but in their previous paper, what they do is um they have for each edge.

B

It's got a weight, but it's also got this score suv and then, in the forward pass. They only look at the top k percent, the weights with which have the top k percent of the scores, so they pick some subset of it. You know uh where it's determined by k. What that, how large that subset is, but then in the backward pass they update all the scores with a straight through estimator, so they um you know, update this. They update the scores, only not the weight values, so they update the score.

B

So, during the backward pass, this score, which was not in the top k percent, could suddenly become in the top k percent, and this guy could now drop. So you can actually you can change the connectivity dynamically um through learning and to me this is very analogous to what marcus has showed last year as well in terms of learning sparsity through the variational techniques.

B

I don't know if it's exactly equal or not marcus, maybe you can you can open on that, but it's at least conceptually it's very similar, so you can think of this score as almost being like a permanence, except here he's taking the top k percent of the permanences. Instead of setting a threshold on the on the.

C

A

I agree they're very similar. Oh.

C

A

I, in this case I uh I wouldn't- I wouldn't describe this as the variational stuff yeah it's more just like, but but the permanent idea and um tuning it using backprop. Yes, that's it's with the straight through estimator. It's exactly the same.

B

Yeah yeah, he didn't say it's variational at all and I'm not saying it's variational, but it's uh uh it's. It's sort of very similar.

E

um So subatomic on the backboard pass when they, um since they're updating all the scores that means on the forward passes. We have to compute all the they basically have to treat it like it's a it's a dance network to do the forward pass right because you need all the activations yeah.

B

You need all the they need, all the scores in the forward pass during training um after training. They can just store the mask for each once. Training is completely finished and now you're adjusting inference and you don't care about the training anymore. At that point, they can just store the mask for each task.

B

So when everything is kind of done, so, if you know what the task id is, which is the kind of the simplest of the continuous learning setups, then they can just instantiate the mask for that task and they go ahead and then compute the output as normal, but they said that it one interesting thing in this paper is: they can also deal with the case where the task id is not known and then so what they do in that case is um the output is a weighted mix.

B

They look considered a weighted mixture of all of the masks um where this you have an alpha. I, which is your mixture coefficient for for each mask and at inference time they run a couple of steps of gradient descent.

B

So I think this is like uniformly set in the beginning, and then they run a couple of steps of gradient descent to minimize the entropy of the output, so they're trying to find basically a set of coefficients so that the output is very certain. So, if you think about in the beginning, the output may be maybe like this on the left.

B

Why is my laser pointer there, so it might be like this and then after a couple of gradient steps, you change the mixture coefficient so that you pick a a combination of these mass, such that the output is as confident as possible.

B

So this is what you can do during inference. They also consider the case where the task id is unknown during training.

B

So here what they're doing is, if they'd run a couple of steps of this gradient and the network is still uncertain, then let's say you're training on a new example, um and even after doing this, your out network is still somewhat uncertain. You don't get low enough entropy, then they just instantiate a new mass and train it on that task.

B

And so so, by doing that, they gradually create more and more masks. Based on this uncertainty, metric.

B

So here uh they were able to go to 2500 classes in a row um which is super impressive, but they do they actually didn't. Do it on omniglot. They did it on permuted endness, which I believe is a as a potentially a simpler task.

B

I think, but the impressive thing is that there's no task identity during training or inference in this chart, so they're able to handle a wider set of categories um or scenarios of continuous learning, and then even in this case, where there's no identity during training or inference, they can go for 2500 categories in a row.

D

Subutai, do they have a policy for retiring masks, uh no.

B

There's nothing at least I didn't see anything there. Presumably what would happen is if some later masks make some previous ones obsolete. This alpha would always go become low for those early masks.

B

You know they could notice that I guess over training and that would be an optimization. They could see that if well afterwards, if some mask is never used, you don't you know you can just drop it from your set.

D

Well, if they're using cross entropy here uh after the numbers get really high, isn't that kind of a very weak signal?

D

If you, if you have lots and lots and lots of these, uh the masks and you're trying to see which one of them is dominating.

B

Yes, but they're, trying to minimize the entropy here so they're trying to get one guy with the highest probability.

D

B

So that's very informative.

B

If they were to maximize the entropy, then every they would be the opposite. They'd be all sort of equal.

D

E

So to define the mask they're minimizing the uh they're, trying to minimize the entropy of the output distribution, but that doesn't necessarily mean that that that matches the correct output distribution right. It could be confident but wrong.

B

Yeah yeah absolutely and that uh that could be a problem, um but it looks like they're getting pretty good results in this one on this test. So this is pretty impressive. Terrible to do here.

G

And that plot, do you know what up means like uh what that refers to.

B

um I I uh cut this off.

G

I know the paper it says like upper bound or something um but.

B

Yeah, I think their upper bound is probably, if you train, uh normally on all the classes in a typical setting. What is the maximum accuracy to get so that would be the best that this particular network could do. Assuming you know full-on training, no continuous learning, I'm assuming that's what that means.

G

B

And I don't know why they have a lower bound, presumably zero for everything but lower bound might be a typical backdrop train network and can continuous.

G

C

Continuously yeah.

G

They don't show results for other models that didn't get a task label right. It seems like at least in that plot.

B

B

There aren't that many situa there aren't that many models around that can that can do that at all.

B

um So that's why it's it's kind of impressive.

B

C

F

F

It's on some of the other plots and I think the pspf and the batch here. I think two comparisons are uh the batch e. There is like a yeah that one.

B

Yeah, so I think this is uh this is during inference.

D

This is actually.

B

During training, they don't have ident task identity.

F

Oh, I see it so the one that one they don't have any okay, yeah.

C

B

Yeah, so this is, this is pretty cool. I like I like this idea. It's very simple. um You know I don't know.

B

Let me see, I should have read this paper in more detail, but yeah. I don't think they did anything with imagenet or anything hard. Yet.

B

Yeah, I don't think so, so it's unclear, you know without training the weights at all. You know how well could you do on something like imagenet? They did. uh They did something like that here. This is the the previous paper. um Let's.

C

See yeah they had imagenet experiments here.

C

Yes, where's the table somewhere.

C

Sorry, sorry for moving this around so.

C

B

Yeah, so I guess the gap got to about 70 percent.

B

uh Here they can get to about 73 accuracy on a wide version of breastnet 50.. So the two a couple of interesting things here that they mentioned this is the the previous paper.

B

I like this chart here. uh What they show is that, as you make the networks wider and wider and wider, by doing these sparse subsets, you can get closer and closer to the accuracy of the dense version of the model, and so this is again a case where kind of dimensionality matters, and here also the way they're doing it. You have many combinations that are possible with a wider network.

C

So that was kind of nice.

B

In their case, they they get best accuracies when the percentage of weights with their decay is about 50 percent, which to me seems quite high, so the network is this. These marks mass are actually not that sparse.

C

F

We've seen another example of this uh idea of learning masks coming from the big back paper uh in the presentation that shin gave last year.

C

F

Yeah yeah, so he showed very good results on transformers and then I'll be just learning the masks.

B

Yeah was it similar? Did he, but I thought they learned the weights too, or no.

F

No, they just learned the master, they got a pre-trained network and.

B

They created networking.

F

By just learning the masks right.

B

C

B

Yeah, so I remember that, but those those weights were trained, they were.

C

Pretty yeah they.

B

Were printed here, there's nothing other than the random initialization.

C

F

But you so if you can get that different initialization, if you can pre-train or even like meta train with some of the algorithms we're seeing before, then you can get a lot further right. I mean theoretically.

B

Theoretically, yeah yeah, so I think this is a that's an interesting kind of next step with.

C

These with these ones.

B

um But I I thought I'd just close with the same thing again, I thought there were a lot of connections to um atm temporal memory. um You know here as you're, seeing with the super mass one as well as the the neuromodulatory neuromodulatory network. Is uh the instantiate very sparse? Well, they instantiate sub network specific to each task. In the case of the supermass, it's not very sparse, but in the case of animal uh it's at least the output is uh that's sent to the classifier is very sparse.

B

In our case, we have dendrites that are choosing which subnetworks to instantiate dynamically for any given input in the case of animal. They have this.

B

You know element-wise multiplication, that's choosing uh what to send out in the case of uh super mass, it's very dynamic and they use this kind of entropy measure to to decide which ones to send out.

B

I think I think the dendrites option could be much more flexible and much more powerful here and if we can get to extremely sparse sub networks, I think that could be super powerful as well uh in. In almost all of these cases, I think they have uh sparse representations to avoid significant overlap, um and in the supermax case it looks like the weights are close to binary. It's just two different values: um they don't and in their case instead of permanences, they have this pop-up score, which chooses which way it's to make activity a learning.

B

So I thought these were kind of interesting correlations uh between that. I think they're sort of very much in the same spirit as what was in htms, but hopefully we can do things in a much more flexible way. Each of these seems still a very kind of rigid in some particular way, and it's not fully satisfying yet and hopefully, if we, if we can incorporate a lot of these principles, we can get to something: that's uh really nice and flexible and and powerful.

F

What what is a weight sparsed in the supermassive paper, yeah.

B

uh So 50 50 50, I think, is what they find. So here's, like the percentage of weights that are on this is for imagenet uh 50 50 to 30 is the sweet spot for them.

F

But that that's from the other purpose so super mess just using that.

C

Same same okay,.

B

Yeah, so you know, the training in the super mass paper was really confusing to me.

F

B

Yeah, they didn't really describe it well, in my opinion, they just had a line in there that says the edge pop-up algorithm is used to obtain super masks. So this paper here 38, is this. This one yeah.

F

But it says uh layer wise budget from 34, which one is 34.

C

Yeah, I don't know, I didn't look it up. uh 34 is.

F

Oh, that's! uh That's this sct paper. That's.

B

The sct paper yeah the budget.

F

Is uh thirty percent right for an sat or even lower? I think it's five percent no.

B

What does that mean layer wise, but that's how much you change every iteration.

F

uh I don't know if it's how much you change it's 30 percent, if, if it's, how much of the weights are active, it's about five percent, so I don't know what they mean there, isn't that.

G

C

G

Dose uh rainy, topology, or something like that. I forget exactly the name, but they had like some way to calculate the desired. Sparse feed per layer based off of like the number of input, features and output features, or something like that.

B

So I I think if the sparsity was actually something close to five percent, they would have mentioned it um because they make such a big deal in this previous paper that the optimal sparsity is around 30 to 70 yeah. Optimal density is 30 to 70 percent.

F

F

Did you say erdos here in america, I'm curious, sorry what.

C

F

I lived in hungary for a while, so I think it's pronounced. No, it's adam.

C

uh Okay, we'll take.

F

Care of that the algorithm they have like this parameter epsilon that defines how much parts do you have, so the algorithm doesn't define just a way of calculating, like you said, based on input output, but then you also modified through a parameter.

C

F

Anything regardless of this algorithm in sat, they said it in a way that it's about five percent, I'm not exactly sure.

B

Yeah I feel like here: if it was that low, it would have mentioned it yeah. um So I'm guessing it's just how much they change every well. How much is that changed every iteration is is based on the back props. So I don't know if they have a budget. So I I'm confused. I don't know what this layer wise budget is.

B

Their code is available, so we could look through their code.

F

Yeah, I'm I'm curious how much parsley you're getting.

E

Is it not just it's per layer, it's this much sparsity so that at least some layers aren't more sparse than others. Isn't something like.

E

B

You mean trying to maintain a constant sparsity across all the layers.

E

Yeah, I think so. I think I might have seen that in one of the one of the supermass papers, yeah.

B

Now, interestingly, they didn't look at their activation. Sparsity they're, just saying about weight sparsity, uh whereas the other two animal and oml looked at activation. Sparsity.

F

So one drives the other, not necessarily not.

C

Necessarily yeah.

F

I'm thinking about yeah, not necessarily.

B

Yeah it'd be interesting to see if they end up choosing masks such that the activation sparsity is also sparser.

E

So for some of the super masks um they they made these. uh I guess they were only working in the regime where the weights are binary, but is there was there a reason for that.

B

G

Binary or the or you said it was effectively binary book.

B

Yeah, it's either.

F

B

It's either minus c or c uh randomly.

F

And uh I think the reason quran is that if you look at the original super mask paper, uh they started they started with random weight and they tried to make it simpler and simpler and simpler to see how far they could get by just learning the mask, and they got to the point that all you all you needed is supposed to set a constant and all that matter if they were positive or negative. So that was kind of the end of the super market, the original one.

F

So I think they pick up from there having that knowledge, and they just define that. Based on that, that's that's! How far that they got.

B

And I think they mentioned that's actually inspired by uh hattie's paper.

F

Yeah yeah, I'm talking about harry potter, the original super maris paper, serpent.

C

F

I can find the link.

G

Here I have it actually, I was just looking at it.

F

Okay, I send the blog post, you sent the paper. Oh.

B

Yeah, do you want to share it or wow.

F

B

F

Share, oh that's. Let me see.

E

So in supermass I think they cited uh two papers for the way they actually learned the supermass. So one was the uh ramanujan paper that um subita, I think you also talked about that. That's one that uses a straight through estimator, right, yeah, yeah and then, and then this one that uh michelangelo just sent is this. Is this the other method of.

B

Yeah, so this was the original.

C

Super mass paper: okay got it.

F

Yes, so so just to give an overall view. So the idea here of this original supermass paper was following the lottery ticket paper and what they were trying to show here is that uh you didn't even have to reinitialize the weight so in the lottery ticket, they learn the mask and they reinitialize their weight to the original value, and then they retrain from there and what they're showing here is that you don't even have to initialize the weights.

F

You just learn the masks and if you re-initialize, all the weights just to a constant value, that's the same uh sign of the original weight. So if it was originally positive, it's a plus one. If it's originally negative, it's a minus one, then you get the same results. So this was following the lottery ticket. So that's that's at the end here this mask the large final same sign mass criteria so yeah, just if you go through the blog paper later. It's uh it's actually very interesting, uh heidi actually presented to us here.

F

B

Yeah, it was in the brains that they meet up uh last year.

C

B

I think the one another nice thing uh with this particular paper is: they were actually able to get. You know, competitive or better than you know, state-of-the-art results on this really long, uh continuous learning task, whereas I think some of the other ones, the results, weren't quite state-of-the-art, but they were getting close to. They were just doing better than well.

B

You know a lot better than chance and better than what you might expect, whereas here I think I don't think there's any technique so far that comes close to our 2500 categories in a row and and the memory usage and speed is really fast. It's such a simple technique, um so they make a point of explaining how that they can do this really fast in pytorch and on gpus.

F

Yeah, the random seed idea is genius because then you can have networks as large as you want right and you can just distinguish it at during running time. You only need the only the running seat matters. So it's not. You don't restrict the size of the network.

B

Yeah, I guess you still have to have it fully instantiated in memory yeah at runtime, uh so it still has to fit in the memory of a gpu.

F

Yeah, but still it it's, it's ingenious for a continuous learning test, we're learning, like thousands, let's say, lifelong learner learning million tasks, and you have to start that it's.

C

F

Very ingenious way of doing that.

B

Yeah yeah, I keep wondering you know you know what, if you had networks that were so wide that they just couldn't fit in memory, but you only needed a really sparse subset of it. um You know, could you implement it in such a way that that you could run that efficiently?

B

Most of it is can be ignored at any particular point in time.

F

That's what we are good at right.

B

Yeah, that's our challenge.

F