Numenta Numenta Research Meetings, 27 Sep 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Selfless Sequential Learning

Description

The very interesting and recently published paper at ICLR2019 studying the impact of sparsity in the context of Continual Learning:
https://openreview.net/forum?id=Bkxbrn0cYX

Related: Continual Learning via Neural Pruning
Siavash Golkar, Michael Kagan, Kyunghyun Cho https://arxiv.org/abs/1903.04476

A

Okay, folks, we're gonna start in just a few minutes, we're transitioning from our stand-up meeting, which you just ended to the research meeting. So just give me a couple of minutes: I just want to come live and let you know that we're about to start I, don't think you can hear anything except me right now. I can see the office but I'm not gonna turn it out till they're ready, so I'm just gonna get to stare at me in my office for a few minutes.

A

The paper is linked in in the description down below. If you want to take a look at it.

A

We're this is like a series of paper reviews we're doing for an event. A journal, Club I hope you guys are enjoying them. If you are please be sure to like the video and subscribe to our Channel, we're doing this just about once a week we're trying to do it once a week. So that's my agenda. You can ignore that, but I, probably I might be live-streaming later today on some of this stuff on Twitch.

A

If you want to check out there but yeah so just bear with me, while the team sets up on their end of the livestream.

A

A

Okay, we're waiting for a PDF to be sent from one place to the next and yes, cerebro. Three. Do you remember cerebro one and sweetie bro, let's reboot tree three from the same person who brought you cerebro.

A

Just bear with me a minute here: we're almost there we're just trying to get a version of the document. That's marked up already, so it can be displayed for you guys to add emphasis to whatever we're talking about hello, Cody, hello, Chris, thanks for joining I. Don't know why my chats sometimes works. Chacha's is working yeah.

A

Sorry about the wait, hi Martin.

A

Yeah we have a podcast, it's a momenta, comm search for it. There's a podcast there I do episodes of it I'm going to do at least three more soon, I'm just sort of planning them out, but you can find that already. It's called momenta on intelligence.

A

The momenta on intelligence are cast.

A

Alright, we're getting that Ramos air.

A

A

We're so close hold on guys.

A

B

A

B

C

And if you want to yeah, if you want to scroll it.

C

D

You can see it's.

C

A separate monitor for me, okay,.

E

So I have to look into the yeah.

D

A

Gonna turn my mic off.

C

I can open up after, but it's.

E

Not these, it's.

C

E

I should be.

C

A

I'll take I'll, take it off your screen mark missing just in case.

C

I mean at this point I'm, okay, sure at this one, I'm sharing a certain window.

A

C

You'll be able to yeah and.

A

Then I will change the window, I mean they're all just know. What do you think he's gonna be able to do if they are almost.

C

Actually starting cards now.

D

C

Have annotations all.

A

Right: okay, we're good thanks.

E

Thank you, Marcus and well so I, oh I, hope you. You read it too in the sense that when, in the in that I mean on this I'm, not one of the outers of these and just as a disclaimer I mean I I do my best to to try to summarize the basic ideas of this work but feel free to interrupt, and you know, join the discussion we see where it goes.

E

Essentially, these paper is studying, essentially the idea of sparsity and importance of sparsity in the context of continued learning, so I guess that this idea of Spotify of the connections the way it's the activation. So having representations are sparse. It has been around deep learning mastery learning for many years now, and many people did. They think that this is important for several reasons right for generalization and to avoid overfitting.

E

But I think this is one of the first works and this was published at IX ELR 2018 that tries to really go in more in depth about sparsity and continually in the context of continued learning. Out is canal learning continuously, and the basic idea here is that if you have near Network right with a fixed capacity, the gradient descent optimization often leads to another saturation of all the internet work. When you learn a specific from a specific training set running batch as.

F

Saturation, you mean interference, yeah.

E

I mean that all the weights, somehow there's a there's, a very natural balance and delicate balance among the.

E

Contribution of each weight right: it's not that you, you learn from a few images, for example, few examples and you use just a person of this network, but you know, essentially you always use all the all you can about the network and what another interesting idea that has been around. Is that I mean if you do if you increase the number of parameters, even though the problem is the same, actually you get better results just because of these Oh for ArmA 2 ization of the network.

E

This is something has been a explore a bit, and it's also cool that from a network that is over parameterized. You can also he still the knowledge to another smaller networks and still being able to reach me the same level of accuracy. So, essentially you can. You could always find also for the pruning works, etc. You this idea. I think that this idea of other parameter is asian. Is inner inherent the concept of gradient descents after that, the optimization that comes with it so.

E

In the so, it's helpful to out these hugely over parameterize network in general, but the problem, especially for continued learning, is that you saturate the wall network since the first patch since the first piece of data that you are, and so the idea of this work was to say: ok, let's take a step back as we know that we're gonna encounter a number of you know, training patches over time we can say: ok, let's also penalize our ability to solve a particular task in a particular point in time, but let's preserve some space into the TD network, so that we can allow later on to encode new information in in the same space in the same arm tight space, so that that's the title, what the title is about area that you don't be, but you don't want to be selfish when you learn a specific task, you want to think at and be more generous with respect to the to the other tasks.

E

Performance is I, guess, and so, where is the pointer disappeared?.

E

Ok and essentially the basic I guess the paper is about a comparison with all other regularization, gliders and activation functions out the impact on this process. Essentially, they proposed a new regularizer, a new varsity regularized. Let's say that is already taught for continued learning essentially, and it's based on very loosely inspiration from biology. In the sense they try to enable us run in a specific region of the in fact, evasion. Space neurons that that activates at the same time, in nearby locations, essentially, okay, so I thought it was very it's a very good paper.

E

I think, and it's very easy to follow so and then I really enjoyed the introduction. Some some people, especially in the open view platform. You know this is you- can find the reviews online so that the introduction was maybe too long or something but I. Think it's quite a nice interesting. You have a lot of references here pointing out you know different concepts in state, sparsity, literature, I, would say, and yeah and I think something I would also to discuss with. You is the idea of specify in the representations.

B

B

E

B

He compares just part ways versus parts activation and stable envision. Mental representations by relaxations.

B

Sorry, hello, it's making a parallel.

E

Thank you, and so they I.

E

Don't remember my own comments. Sparsity is key for the early that representation yeah. That's that's also very interesting. That idea that, with the standard graininess said, you um very entangled representations and an AV way is in. Essentially, if you look at the importance distribution of the weights, you can see that some weights are more important than others, but essentially these, uh if you kadhi, is round you see that still even for small tasks, small datasets, you use more than 50% of weights. That are important for that. For that particular task.

E

That's interesting to see I'm, and so it will be mean this whole idea of decree, lady representations and possibly also weights. It's super interesting for continued learning, because you want to compartmentalize things so that you don't run into.

B

B

Hear ya mixing this whole idea, sparse.

E

So Ganesha is just an image. A cartoon image. I would say on the ideas, fortifying the parameters, so ddd connections, essentially between neurons and then representations, and they argue that essentially, a sparsity of the representation is more important for for continued learning. In this case and to disentangle representing the knowledge, essentially as we encoding to the network and.

E

E

Suggest you to read the introduction is very nice and.

E

I can now even my own governments.

E

I, don't think you can click on it to just know. Okay, well.

E

C

E

E

C

Nevermind I guess I'm, not funny I thought there'd be a way to just like.

E

Yeah, okay, sometimes okay, so why sports activation rather than sports parameters, essentially for two different reasons: more stability and and the fact that you, then you have three neurons to use later on question that I already asked you or why do you think parameter sparsity is useful as well since, in your in your model, VAD the quarter, learning algorithm, especially having I, know that you already explained in the paper how dance I can be so dance, but essentially duty doesn't use structural sparsity of the connections right amana in the paper there's an image in which you show that party can be both.

E

But I read in a sentence in this bit in that paper you said that fork, emotional, oh yeah, furnace. You didn't use.

E

Okay, so it was just for her yeah.

B

E

B

E

This monuments, all these.

D

You necessarily.

B

Need detection.

D

D

Stick around for two back: they basically were using ray tracing and lower the number samples they had. You could use the cheat day to deduce what the rest of it of that structure. Would him and my opponent, so you know, if you take it from a pure signal processing, point of view. It would be enough to say what's enough of a structure for each. You recognize it as legit, then.

E

Another thing is that these kernels are generally very small, so 3x3. Currently it's different from I, don't sixteen by sixteen and then in that case, maybe there's a mark space for sparsity there and no and then what you are saying.

B

Yeah, if you just say, I just wanted three of those six nine to.

E

Get a hit yeah.

E

Okay and yeah, but essentially for what died for where they can't remember. Now they work on the sparsity of the representations and so the the main proposal is these sports coding through local Neera. In addition and discounting offering.

E

We see now, as they did a very nice progress towards these final. It's a proposal.

E

Okay, this is the main, let's say, proposal, and then they of course assess the performance of these algorithm for ang marks, I think or maybe two and nice and tiny imagenet encipher, yes, okay, so three, three of them contribution: okay, so essentially three contributions study, the diverse set of representation based regularizer z-- parameter, pays regularizer and, as.

C

E

Sparse in using activation functions to decide. Okay then need it is proposal and then empirical evaluation. Okay, so.

E

So this is I think there's a lot of different proposals in this area. I mean sports in general for Contino learning, maybe the other paper we we also suggested reading for today is using that concept as well, but I I, don't remember any other content learning paper it's at least to the best of my knowledge, working specifically on sparsity for controller, even though I guess that it was something that many people were thinking about, because we know it's it's well known is a capacity situation.

E

Factors of its basic idea due to have must be no space, Splatt little weights, they're important for every task right.

B

E

D

E

Don't think there's something standard there, but some people have I've proposed I, for example the the other paper from but I said you get a baby. We are going to look at yeah.

E

The is fortunately proposed definition of saturation of the network, and then they say that when you reach that kind of saturation you want to gracefully forget essentially and- and they explained out, they you want to do it.

E

Continual learning yeah you're gonna, look at it fast, I guess, because it I explore it even less than these books. Okay, so.

E

C

Did you do this mentally? Yes,.

E

Exactly I thought it was the most portable you see it. I've installed, mendelian.

C

We want to take two seconds just yeah.

E

I mean, meanwhile, we can.

B

B

E

So I think so, but the nice thing is that they have sang the song. Every paper gets you out. It's uploaded online.

E

What are they, after.

E

It's difficult to annotate and read the papers and on a smartphone.

E

Okay, let's see, let's see if you can see the rounds now.

E

E

I think it's not.

C

E

And then eat enough, it's okay, I, remember the rest of them!.

B

E

Pdf portability still an issue in 2019. Well, so essentially here you can find some interesting about sparsity and algorithms that tries to try to take relate activations of neurons for different options.

B

This is correlating I.

F

Mean I would say that in general, activations of the final mystical, it's parse for whatever reason right, but there are also brain areas that specifically do organize and specify codes like, for example, the hippocampus, which is really important for memory acquisition, particularly long-term memory acquisition and then driving. The consolidation process has an input area called the dentate gyrus, which has been studied a lot, but it with respect to its input, output dynamics, and it's known that it essentially like it. Its function is really sort of a pattern: orthogonalize ER. It makes highly separable.

F

Representations of the input activation of internal cortex, which might have a lot more overlap and then sort of what comes out of there for their actual sort of memory encoding process, if you think of the deep structures of hippocampus of that, are a lot more pattern, separated allowing you to learn independent episodes and for memories to not interfere with each other. But that is in part because it becomes a very small brain area right. It's like 2.5 million pyramidal neurons in the main, auto associative field.

F

Where, then, you, you learn patterns by auto, auto by the camera connectivity with highly plastic synapses, which then encode that pattern and even then that's a transient process right, so the hippocampus does forget and through the consultation process like eventually becomes irrelevant. So there are some brain as but people have looked at input-output characteristics and have to do that. Yes, this brain area, mostly, does you know like a formalization and pattern separation, so those two exist: I, don't know how widespread that is.

F

I mean I've, studied it with respect to the dentate gyrus, because I actually built a little. You know piece of an Network simulator thing that did that for a hippocampus simulation that was part of my very first paper but I, don't know what you would have to say.

C

All I can say is that I have seen Mouse the two models that have the correlation of unit activity in general so that individual units Rd :. These are all smiles. So it's like people have found reason to publish papers, suggesting what happens but I don't actually.

B

And it's very likely happened of all my opinions. Have we looked into that in our own lives? Because, even if you aspire to find you might say, haven't we looked into our models.

C

Yeah so yes, mergo worked on something called lateral, lateral pooling which was taking a space with cooler and giving it this extra. This extra property, where the individual mini columns, try to decore a good people are alike with each other. So so he did work on that project for a while soon, as that would be better.

C

Sitting in like in HT- and it is one of those pockets that does in the experiment, but.

F

If you are seeking for like specific papers on like vintage address or mini columnar competition, I'm, pretty sure I can take up some to three good review. References just remind me, you know it's like it's like a five minute hunt for for the right references, okay,.

E

So, let's look at these specific proposals on that, but essentially they start with a summary on the essentially they also used. The propose is a memory, our synopsis in 2017, and it was essentially very similar to what we discussed in in in a previous research meeting about.

E

Regularizing approach for continual learning, so the idea you have, for example, a lot of specific loss function that you want to minimize, like, essentially, loss function for.

E

So f of theta n, where Negus here is easy task, the current task you want to solve. So you have the standard loss, function, try to minimize the prediction error and then you add, plus another term e that is going to regularize, because your learning process, essentially and again, as we have seen before this is a hidden essentially with a static real value here.

E

These are the ground factors yeah and then, and then what what you have here is an important value for each weight, K Omega K times the optimal value of the weight for the previous task.

E

The current weight, your trying to modify the same weight right so SK you're trying to modify to move in the direction to to essentially let use this loss right and the idea is that, as you move far away from the optimal way of the previous task- and this way it's very very important for abuse test- and this is posture- is gonna- grow up. Explore right. You.

F

Penalize changes, yeah wither square and every changes then also weighted by the importance meaning important weights can change even less exactly.

E

So that's the concept and it's uh it was shown in 2016 by elastic way, consolidation, deepmind and synaptic intelligence from Ganguly here Stanford and you know they are on work. Our own later on I mean many people, it a some outward into these ADL of regularizing learning over multiple tasks of benches.

C

E

So this is a easy fix right. The alum, the term here is just to say how much you want to your weight, some of this fat of remember in the past, or just learning about this particular task, the current abstract. So this is I, saying: okay, I won't really do remember stuff. If you, if these lambda Omega is I.

E

Lambda Omega R and Omega K.

C

E

That what, in general, it's the difference among the different strategies that you want to how you compute the importance of each way.

E

That can that it is depends also, but in I don't remedy Iran proposal. Sir changes, the version you can change, Oricon beef.

E

For each for each.

B

B

Between tasks, just pairwise.

D

E

Does hit the time I.

D

Mean you have the previous one, then you have new yeah, but so it's basically either pairwise or sequentially yeah.

E

D

But does this generalize to saying I've got in I think.

E

This is already general this formulation, because this is for the task and- and so essentially you want to sum over.

E

H parameter a year, but then a you you want to add. Oh, can, I think, is these a number of tasks: training on training me of training, training.

C

E

Yeah I guess you know it depends if in that importance it somehow like we do, for example, if that the importance term is taking into account all the possible. You know the previous tasks, then you need just one regularization term. Otherwise you need multiple of them. I, don't remember exactly how they do it, but the original proposal from did mine was to have organization term for each task.

E

So, as you move on, you add regularization terms on top of each other, so.

E

Essentially, if you move this is for the second task. Let's say, then you move on on the third task. You need to add another prioritization term that is relative to series, yes or or you can compress the importance value you can as you. As you said, you can you can you can say: okay, let's say that I am in a situation in which I consolidated the weights, so I conciliate the weights for all the previous tasks. So then I can think I'll say. Ok now all these weights are rrk are preserving already.

E

The information I am assuming that so now, I can say just for the new task shouldn't disrupt the previous, let's say weights and but you have still to maintain and the importance somehow of also the previous tasks.

D

Under what basis do you do is all they say that I confuse these things together.

E

Well, this is the assumption we did was was that you can essentially.

E

Some this importance over in the sense that you want to what you want to constitute to.

E

Preserve as much as possible, the weights are, there have been importantly passed and, and that that cannot change. It's like freezing his parameters after why? Right? So if they have been the idea, if they have been important in at least a few of previous tasks, then you want that you want them to not change right.

D

E

So, for example, if you think a simple sound, we will do is a simple sum plus a clique factor, a few like our parameters, but the idea is that, empirically, if that weight has been important for for many tasks, is these importance values very I it'll it reach the maximum and- and it means essentially that is not gonna- be change.

D

As opposed to I identify structure of these tasks and preserve them, I'm going to assume that it all kind of falls on out that, on average, this thing is important, but that assumes that there's, no, the theses can be treated as independent attributes, as opposed to there's something iconic about this particular combination of this stuff set of parameters that, if I start building things I mean I just I. Just.

F

Diffuse, the dependence right between the importance ease.

F

Hence the problem.

D

Animating them right, yeah yeah. What I'm saying is that there's a I think there's a road to examine that assumption and see or how much you lose by doing it. I don't know whether you like partition the parameters in some arbitrary way, see you know, keep them separate. You know and then see how much loss do I get when I repartition these things, yeah yeah.

E

It's uh it's only and I'm going to be made in a sense. How much you want much structure you want to impose to the network what you want that finance ever structural yeah.

D

D

E

And also these these proposal itself, it's very I mean it's something that is uh especially from you know, Matt background and perspective. It's very nice because you have just a loss, function, Indian to optimize in our without an imposing any structural.

E

It's a revenue stream. You know.

E

Okay, so this was the basic idea of essentially controlling forgetting through regularization, pure regularization and but then the original proposal was to add an additional memorization term. It was, let's say, controlling the sparsity of the of the representations at all possible layers, so it was called our SSL and then they say: ok, now, let's go deep into this, because you've been implementing in different ways. So the first proposal is this one.

E

Where essentially, we are looking at in a specific layer. Html you have several new Bronze Age activations of H Y. um It's just well, okay, the number, the number of minutes we are considering while doing this. We can ignore this, but the idea is that you have several activations and you look at all the possible combinations, its activation in layer.

C

E

You're saying that if there you have more than one essentially active Neron, then you these requisition term is growing in terms of its a certain.

F

Product between the vector so.

E

Essentially, you're saying these rebranded Iser is zero, just one one, New York is active and it's a basic idea. It's just like a sum.

F

Total overlap of the activation set, yeah all possible combinations of your activations and you just do top product and sum them up and then okay for normalization, you divide them by the capital and yeah. So if you have, you know, 20 different London.

E

Yeah anything should be gonna, be more independent.

F

I mean there's a naive, straightforward minimize. The total overlap between all the patterns that you're using okay.

E

Well, this was the idea, but then they said in as an improvement on top of these. It's it's not just ours, codeine, Miral inhibition. What is local, in addition, in the sense that this was too harsh, they said, and so they tried to regularize it with another term. Here.

B

E

Mean it just too much and it for near networks where just to be forced to be that sparse. So probably you know they wanted to as usually instead of just adding a k' winner at the top of the volume of activations. You want to add, for maybe a few divisions but scattered around all the volume.

E

There's also to be still a very principled and install work and in the sense that you want to wait.

E

Yeah eggs action the further they are from the they are needed. The two activations then consider every time the further day, our less this. Let's wait. It's gonna sum up to the loss further.

F

They are from what.

E

The distance from each other.

F

C

E

Mean beautiful mathematical in the 17-story continued this a continues instead of having you know, fixed speech, and so.

D

E

This enjoy the implementation problem solve these compromises that you're working for.

B

E

I think Rob is more much more structurally doing this right. Instead of finding compromises between constraints that you posed opposed it's much more, it's already there in terms of structure of constraints, in a sense that it is not a it's not like you're putting constraints, and then you say, okay now solve the problem globally, with these constraints is much more the constraints already embedded into the you know in the local. You know, parts.

B

Of the systems.

E

D

Are independent among.

E

Each other, it's not there's, not something going on I think it is globally taking to account all the possible cuz.

D

F

Makes more biological right is arguing that look if the form of of connection probability for the for the the tie? Synaptic, you know feedback the feedback. Inhibition is an exponential decay off. So maybe you have local. You know probability of well and in their case.

B

F

If the difference is zero, that you're guaranteed to have the full penalty right and then it decays off with an exponential term, which is I mean obviously not exactly how real one works, because you know at like a meter out the connection. Probability is not some very small number: it's actually zero, it's a physical limit to how big an earner can be, but it's a decent approximation off the axonal and I can dendritic. You know, overlap over space. My dear connection probability roughly decays exponentially in the brain.

B

B

D

B

Yeah, it's true yeah. That's a.

A

B

D

There was not just.

E

B

D

This analogy to basically say I'll include the sample. It is close to a the value that I started with and the distance wise. So you basically have this kind of saying what the neighborhood is, but this is a neighborhood at an arbitrary you know cutting along. You know these things as opposed to a two dimensional one say: yeah I've got this cluster here and there is some reason to believe that these things yeah baby, you similarly right. So that's what bothers.

F

Me about that is linearizes brain activations, that's just one vector, as opposed to you know as we think of it as a 2d sheet at least.

F

E

F

E

It said somewhere, I didn't notice that, but maybe they are just looking yet for simplicity. Not only did they talk about Ethan layers, so I don't know what they mean by hidden layers in his skin, because I've seen in their experiments on tiny imagenet, they had also convolutions and well. You know, I, don't.

A

D

B

There's a correlation out to expand yourself.

F

They're not special a County.

E

In a 1d right, this is more. This is for one for an array of activations. Yes,.

B

All day, all.

E

The beasts and is computed.

D

E

D

E

These are great details to explore, but India is I. Think it's clear enough here. Right.

B

E

And then there is a I think that obviously the results, but essentially they want to, of course, use the knowledge that you already formed in Europe in the past, so essentially you're saying and then they want to form as well a new knowledge. So you want possibly new needs required in to recognize new patterns right and this they see they've seen that with this particular implementation loss function as he sees for now.

E

If you, if you use, if you move to other, you know new patterns, you want to learn from something, and essentially these weights that have been already trained these neurons they they tend to fire again and to in a bit somehow all the other weights. So you have the same problem, and you know this as well. I think in your paper with this idea.

B

E

Units they become very active for many different patterns because they tend to you know, as they fire fast faster than the others, because they have been already trained to recognize patterns, maybe are similar to the new inputs.

E

Then they inhibit all the surrounding activations, so the ared idea was, these is discounting inhibition factor. In the same, this is the the final, updated realization term.

B

E

Estimate the importance of each mirror looking at the gradients flowing through that mid-century and and after they have understood what are the activations are important. Then they say: okay, just don't if these activations were very important in the past, just on the don't count them into the.

B

E

You you still the weights that leads, maybe to these activations are preserved right. The problem is that you cannot. You cannot I, think you learn new things, because these weights are very active and they tend to inhibit new activations new activation on different neurons in by its budget.

B

F

In your brain that doesn't happen because the resources are finite, so you have synaptic depression, which depletes transmitter, for example, and you have neural adaptation, which means that neurons are not too active. Are gonna, be inactive for a while you have refraction meter. No, that just has fire, cannot immediately fire again. It's like tons of mechanisms in the brain that makes sure that no resource can be over the important and overused and there's a biological limit to how fast you can fire yeah, like I, said this.

E

Is a lot of these limiting.

F

E

I think this is just an alternative to your posting grip, in the sense that they me we not boosting is much more, as you said, more biologically, inspired, I guess this metric.

F

Yeah, at least that's for Houston the way that you implemented instalay yeah buddy's a book, but it does.

E

The same job right, that's I, think the in this case, they're kind of cheating a bit in this sense that the more you go on the North Division, the less sparse. We are going to be in terms of representation right because.

F

Right so there's these alpha times, which stem from measuring as some coming up some magnitude of gradients, which is symmetric of reports yeah and then they sum them up I&J. So I don't fully understand that I.

E

F

The full rest of the term I don't under, like that's fine, that's fine, the Exponential's, also on the one before right, the new one.

C

E

The new one is yes, exactly.

F

So that I was like understand why you know with alpha.

E

I am NOT fajr the importance of the two neurons. You are considering the new on the ones between.

F

Which I want to update the way? Yeah and no.

E

No, no, these are not sorry, I, don't think they are.

B

E

Posters signed up, they are in the same layer he's not in the same.

F

Position right yeah, but then I don't still don't understand. Why why pattern looks like.

B

E

Andy's again, do you want to compute these on several input patterns.

D

E

Means the number of.

B

The only other is the first.

E

Yes, this is a new part with respect to what we discussed before, where you want to based on the so, if, if these weights- or one of them is very important to know, sorry not, there are no.

F

Ways there wait, there are always activation.

E

B

E

No, this is all about the activations really not take into account the weights.

E

F

E

D

F

Alright, that's the way to understand it. Thank you right.

F

You're, adding up on the exponent with music is essentially they're. Both factors. Yes,.

B

F

The exponential importance of both of those nodes negatively to imply it to impact your non-new RS LM, not just punish the overlap between patterns, ways of spatially but also discounts. The.

F

Punishment, yes,.

F

So if a note is very important right, then.

C

F

That farm is gonna, smaller, meaning they're less likely to get penalized. Younger minimization function, yeah right.

E

Okay, so it okay I, didn't you think with the interesting proposal, and then we can start looking at experience. I think that they are not really convincing. I mean there's this there's a quite undone improvement, so.

F

D

Yeah just just went.

E

To wanted to give any introductions ahead, essentially, I think the the first version was a very of the paper was not that focus on on the premium empirical evaluation, and then they add the tell what the II review was asked that to, and then they become much more substantial and now we have here are three different benchmarks, so at least see far a tiny image net. So now, if you work with them, but I guess they are still uh considering a maximum.

F

E

F

Right now, what is see fun? I'm, sorry, I'm, not a machine.

B

E

Every Keynesian, all of those are our entrance for our training.

F

Supervised supervised and.

E

Then you have uh for a few classes, you want to distinguish and we're and 100 means there's 100.

F

Classes, exact.

E

F

E

Premise, you have ten classes from tiny image, not a few more, but still you are working on a dimensional input dimension that is going from 28 by 28 grayscale one channel and to a maximum tiny image, not that is 64 by 64 times 3.

E

So this is the kind of input image you're working on and I feel that.

D

E

Because it's RGB he's out to be right, so I guess I would prefer them to use coffee this much more complex to see. It's really scaling the idea that you want to scale a bit up. Is this experiment? Because it's not really clear to me why I mean they work? Why work that differently among each other? We can see that enough, but essentially here what you're seeing is so for Abney's results on them needs of different sparsity. Let's say your eyes are activation functions and their own proposal.

E

When you see five different tasks on amnesty over time, even every task as to distinguish among two different digits of young needs, the data I'm sure you have seen that, but we haven't written digits, greyscale and in a frame eat. Rescue want to distinguish two different digits right anymore, one to the next I think, I think I.

B

Don't know it's pretty damn nice well.

E

It's not what I said.

B

E

The world training said: do you have a fixed permutation of the big cells and.

B

E

Way, you have kind of the same complexity for every task, but it's a completely different ask for a machine.

F

So that I see the different colors and all these different Reiki arises and on that y-axis you have some kind of an accuracy. So.

E

I guess these days of.

F

Performance scores I'm, sorry, accuracy, then, so this is just on em this. This is not the other two.

E

This is on Nana stand is the accuracy for every strategy on every task at the end of the training, in a sequence of all these tasks? Right and here you can see I, don't know why they didn't put here another. Let's say a few bars here in the average and they put it evil head here. But essentially you can see here in the legend the the average accuracy on all that possible.

B

E

It's a much more clear if you have the rage at the party.

F

I, like aligned, the numbers yeah.

E

For example, I don't know but by oh well, so in this case I am sure. I put a note here about the fact that it's not really it's not really significant. Here I mean.

F

I guess the point is most of these methods drop off at the 3rd 4th and 5th flanks I do another bit of competition with like what is there green at the orange the regularization.

E

Yeah l1 red, so what it's? uh The most interesting, at least to me, is to see the difference with respect to the basic or yellow activation, in the sense that okay and then another thing that I critics are criticized about in the evaluation is that you can see. You cannot there's no comparison with a training, static training on the desert, wind arable risers, so you can really tell if the improvement is because it just for from your network is just better plan.

E

Sparsity of it is useful, specifically for continual learning, right, I, don't know when you think about this, but if you don't have an upper bound saying: okay, with these sparse, the animation function, you can have this kind of level of accuracy. I cannot say that easy.

E

These are these algorithm prove improvement is just about continued learning or you chose because it just drains back.

B

E

Would like to see how all these possible activations irregular risers impact on the final performance of a moderate II strain on the world. Training set or.

E

It's okay, that's! Okay! On only in continued learning these setting that they tested it. They are algorithms, but what I am saying that maybe these regularizes of their own, even in Edina, started training, so not in a continual learning setting then free with a bra, essentially advantages- and this is saying: okay yeah, it's cool to use it because it's better, but it's not really helping me in the idea of this Integra. You know the impact of these algorithms proposal, especially in the context of continued.

B

E

Right I, don't know if.

B

That helps regular training. It makes it might not help see.

E

Yeah I know that I.

B

E

B

E

It's um it's like I, don't know, let's say that we don't have convolutions, then I, say then I say: okay, let's you have this task.

E

Despite the interesting in what you know, learning then I say: okay, yeah I now introduce conclusion for the first times and then they say: okay, it works much better on continual learning, solutions for continued learning, and you say: oh okay, that's cool, but if convolutions are working great, also on a static training set, it means that convolutions by themself out great for for learning, not for continual learning. I.

B

E

Know if it makes sense.

B

E

I mean to see if maybe they do they just better, so the accuracy in the end, it's better because it they are just bad in general and they started also reviewers. They they pointed out some other than the ad. They said yeah. Well, the point is working on continue. Learning, okay, but yeah. One of the reviewers said that, and I mean.

F

Exact uestion to the field is right. Given that continued learning, it's an additional challenge to a system. Like you, don't know all of your data upfront, yeah yeah, required on the way right- and you know, is there an actual penalty for that right, I mean it's clear that you know many systems can't do that, but if you build a system that can do that, doesn't actually you know how much of a performance in like do you get for building a system, but.

B

F

B

Wish your answer is.

B

E

You can see it's not no, but no. Actually, this is at the end of the training of around.

B

E

You cannot do that, but you you cannot tell that. No, you can't in this there's a there's a in a later experiment in this paper that in which they they assess the performance on all the training sets. You know, train together in a multitask fashion, joint multi training, how they call it. But it's just one one place you can find this information, but not for every possible. You know experiments a baby and I think this is a.

E

It would have been very useful, at least for me and yeah. Then another thing is that firmly I, don't I, don't know it's not really we're talking about the difference of three percentage points in the accuracy, so it sound pretty easy to tell I mean it seems like on this scale that matters but I mean I I, don't see that they have done a several rounds on it. I, don't think it's an average. You about the different runs when you change the.

F

E

F

Exactly there's.

E

Anything so I, yeah I, don't know, I'll use it for the.

F

Whole figure seems a bit hasty right, yeah.

C

F

In the sense that, oh just how it looks.

B

E

Will be interesting deviation to see if really that three percent it's it- has a value first or right it just not really that means, but but yeah of course, yeah I, don't I mean many people. We also discussed these narratives 2018 and control learning workshop. Now people were saying: okay, just don't use them nice for continual learning, because it's really difficult, but even at four deep learning in general I mean right.

E

Now you have so little space for improvements and for you know, I understand it is just not I mean it's used for maybe for yourself when you prototype something you say: okay,.

F

That's on a laptop.

E

And then, when you want to convince someone else,.

F

D

E

Have to skip them up it's not at least one on 64 by 64 badges. Oh, you.

F

Know, given that you know like I, don't know people are throwing around big terms like a GI and whatnot. Have that we can go beyond you know a little. What is it 32 by 32 grayscale, pixels.

E

F

28 by 28 pixels, okay, maybe you would think that we can go make the world bigger than 28 by 28 pixels.

E

Another problem here is in this in this particular setting of lemon East, where you have this permutation. Essentially people I know if they have shown it, it would be nice to show it in a paper, but we know we all know that. Essentially the network can change just the the first layer to learn the permutation and everything is fine right. So you don't have to learn a new task. You just learn the permutations.

E

D

Around the permutation or because L.

D

E

You know these first Peaks have those always in these particular occasion. So why is it so.

E

I think so, but.

D

After the first permutation, if there was any spatial correlation it's destroyed, so these.

B

Are really adverse things to try to learn right I mean there was coherence in the first one.

B

D

B

D

But then, if you keep on permuting it, then it's like okay I'm, just trying to learn to cast this in the year.

B

So it's like you're trying to learn completely different they're going to stink up in this latest.

B

D

A new presentation starting off and then make the rest of the.

B

And you still remember like the first, let you learn all right super test. The permutation is the same Kirk. Yes,.

B

D

B

F

It seems like an idiot synchrony, also machine learning, but there's nothing yeah during the past in the same class.

B

E

You hear because I was.

E

It was, it was a white boy de sá average Tracy is a decreasing like this. So.

A

Why are they wrong and learning.

E

Five tasks came up: what desire to be their most recent ones right.

B

E

B

B

F

Maybe the only testing well when I mean all tasks.

E

There only mean considered up to that point up to that point.

F

Okay, so t1 is learning all the you want asks. Where is how good is the network in with one shot yearning this one task, and then t2 is okay? What is your performance after learn, t1 and t2.

E

So, in that case, what you see, if that's the case you can say as before, you can tell by the first task ow they are different terms of performance or non-static trained.

F

A

F

Also did not put in the time to properly think through their access labels and their legends, which does.

B

Not explain that.

E

This was the the same: how they.

F

Each try to make figures and I'm unreadable I mean this is projected on a giant screen. I like you still need to zoom in, like like.

E

This is more about what I done right, so is this trend is much more.

B

Makes more sense.

B

E

In this case, you have ten tasks for I, guess so, with ten classes, each okay.

B

E

Maybe I know for these ports, if are on the rabbit on the and then for tiny imagenet split I, don't I don't really buy any case in this case they say that essentially, the improvement now is over over no regularization at all. It goes from four to eight so for in the in the CFR, I, guess, experiment and eight Indy know the almost six or eight Indy in the c4 and and and for for the imagine, a tiny machine at the benchmark. So I always thought I tried to look into the paper.

E

Why that was the I mean it's a big difference.

D

See for a hundred classes where I want to create. You know ten sets of tasks that you know. I split the categories: oh no, but I'll, just trade against okay; so that's so at least they're kind of in the same space.

F

Grandma of this y-axis table wrong, like maybe like it says, I crease, the percentage after learning all tasks, but maybe what the meanest accurate percent us. After learning, all tasks.

B

D

F

The question is: are you learning on us because it makes more sense if there's ten different tasks right and so then t1?

F

You know so, but a shown is accurate percentage after learning and not after learning all tasks, but they show the actual percentage of all tasks after learning and all tasks is not all tasks because it varies from T one to T, 10.

B

F

B

F

At least some of these are still that what you're doing continue learning so you're learning one first and then you're learning tasks in the heroic task, three right and and and by what they plotting his echo Center after learning and not learning all tasks, so that evaluated all the accuracy after learning, but they evaluate the accuracy after learning, you know certain number of tasks.

F

C

Don't mind why.

F

Else would you get like you know like a drop in performance, whereas the t1 is amazing and then it should.

E

Sound rightful I mean they used today. They think they're for old and they you still use, are you know some kind of regularized like memory aware synopsis or something we discussed before.

E

No okay but I think that still use a mass or AWC or something like then.

B

They know right.

F

B

F

I still don't understand.

B

D

Midnight it's just a live, show the every ever see all tasks, so whatever's on the graph is being average down, so I mean so I assume that's supposed to be the average say the service is the average of ours.

E

Yeah, the the numbers on top in the legends are the average on.

F

The on a sequence of ten tasks and I- don't understand, know.

E

Always at the end, always at the end, you you have these different accuracies, so.

F

That it's not a so in imagine the continual part, know you're learning exactly so straining efforts, but that's a stupid. That's not the hard continue. Learning I mean.

C

D

F

Is that you that you don't test at the end, but.

E

That you test in between yeah, you know it's it's a matter of space. We do what we say in our papers. We like to see the trend of learning right and we have the fix that set. So we test every time we learn a new thing. We test the whole system, everything so that you can see. Actually that trend going up, there are people showing that suing at least the accuracy based on the things we have seen in that particular point in time. So it's.

F

Like pretending like the point of continual learning, is at the end have learned everything that is not the point of continued learning.

D

Continued learnings.

F

That you've continually learning it's an unending process.

B

F

Implies forgetting- and that implies that there is no point in time where you do your benchmark.

B

F

Of course, yeah.

E

The best thing the best is to see it would be to see all the accuracies in time of every task right. That would be the best and then you can, but it's difficult to show in a single paper. You know 10 graphs over time above us exactly.

B

E

Many we have 400 the batches, so it's.

B

E

If you want to do in my class also in that case, your 50 class is more. We are maybe new visualization techniques.

F

E

F

Different tasks so different in performance and well so what if they thought the.

E

Inspiration was was that they were using some technique to avoid forgetting plus different regularization of sparsity regularizer z', and in that.

B

Case you know maybe.

E

You to presentational.

E

No lynching in a way such that you have that this trend, where you present more on the on the on the first task as soon as that. So.

F

For example, if you I give you if.

E

You learn the first task, then you you say: let's preserve a lot these information, then you you're not able maybe to learn more about the new tasks, but you still can preserve a loti, but.

F

I, don't know so this is consequential, so this isn't time I mean.

C

They are talking.

F

About the sequence of ten plants, yeah.

E

In the end, you, if you look at the performance on each task, you can assume in this case that for something in the second part that you have preserved a lot of information right on the on the first task. That's why I think.

F

That my t7 so bad and yeah I'm, so amazing.

E

So they maybe they then I, don't remember if the first task is different from from the the following one for Tang imagenet, you.

D

Know if it has more energy, it's not.

B

More classes, I, don't know it's very.

E

Balanced or not.

E

That's just oh yeah: they didn't upon the request of the reviewers. They expanded a bit. You know et inspections into these these Algrim and they have shown a couple of experiments on. You know the activation distribution, somehow the the parameters so that I use you know, I wonder.

F

Whether the phone files of those solicitors less than three pixels I, think it's like.

B

F

F

E

You know it's difficult or if you work on a different image, then this release, we don't know if that is gonna, be readable or not, depending on how much is gonna be squeezing the favor.

E

Yeah, maybe I, don't know what these paper is. The last version the camera-ready version of not, but it is.

B

E

Conference, not that you'll not be.

E

That's also on trees, but yeah, so a bit more inspections and then so we do without you know, using also the using goals. So the.

E

Different, let's say e equal to learning strategies plus their own regularizer for sparsity and see how it goes on. There are a couple more information here and and then another different task in which you don't abuse people. They were complaining about the the fact that they they are using actually an Oracle telling them both during training test Haley. This is a new task.

E

These are images belong to these particular tasks. These are block to these new tasks, so you can listen to uncle essentially D, both doing training. Then they were using that so they you worse, asked them to integrate a new experiment that we call it in a new instance scenario. So the idea that you, you have all the classes from the first batch, but then over time you encountered in frantically solutions of the same classes, enabled.

B

E

Tunnel, so the experiment here.

B

E

B

E

B

E

It's you can use a pro leagues, obstinate they're, not that help team designed to to work with these are not very natural, and you don't in this case. You know in your left hand, part where it's or they are not. Video sparked specific objects and just random objects from from the world.

D

So one class another class, ten.

E

D

E

Only not seagulls there couple uh I need an IAB, essentially training set with ten classes. I think.

D

That she won through G ten were ten runs with different sets of classes. I, don't think they were sequential I mean for what they're saying that, with her showing they're considered tenant, ten destroyed tasks.

B

Each composed of tank- yes.

B

I'm sure but I mean there's, there's ten separate ones of ten tasks. Each.

B

D

Total funding classes were all okay, so I I'm. Looking at that T one and.

E

T 10 is not sequential it as a drama. The different set sequentially learned, but it's totally independent yeah, that's a that was something. I was complaining about the other day in the sense that and they're saying now that you need an Oracle telling you now you have to solve these tasks of these stands you it's not a single task, where you you say: okay, now, whatever I give you you have to calcify among 100 right classes is now we are showing the second task. With these 10 classes, I give you the images.

D

And you've because they're 10 separate routes, ten separate experiments under the criteria that there's yeah yeah, you can average them together. They.

E

Needed that teen election track right.

E

You know what's the patent process, the answer is, there's no matter and that's the end of it. Then there's a there's. Also an appendix I didn't go through so that.

B

E

Just to say that maybe see.

C

E

I didn't send it as well, so we can.

D

Check that without.

C

E

C

A

Just google eat them: okay, apparently there's a second paper. I didn't.

C

Know about a second paper.

A

I'll get the link.

B

C

B

File, so it's the same setting we just discussed in the learning actually same that America needs tonight so far and my default was most interesting here.

A

Links and track so.

B

In this method, subsequent tests between using the inactive mirrors and futures of this patch by network and cows, your deterioration of the furnace of previous tests, so the whole idea in in those met the elastic wave, oscillation and the other regularization methods. You always have some overlap and, in this case, you're. Looking for a method that had su overlap and you could learn a new test and without causing deterioration.

F

A

E

F

Nice thing is that you can.

E

Use a bit of you know a knowledge encoded in previous weights, and it's.

D

E

Compartmentalized component.

B

You can still reuse.

F

B

In that sense, it's very similar to the continued running biking attacked actually like a progression.

D

B

In those case, you're just adding stuff to the network, while in this case you're not adding stuff, it's the same network. You just identify where you can change that 1000 impact.

F

B

Yeah, so this is interesting and he talks about forgetting that's interesting parts of concept of graceful for Gary. Sorry idea: that's preferable to surfers Balham all to forget in a controlled manner. If it helps we gain network capacity and prevent some control loss of performance and okay, and he shows empirically that his method continues. Learn your pruning leads to significantly improve results over current weight elasticity based methods such a very good paper. It's a pity to do in ten minutes for.

E

The government.

A

B

D

And then maybe.

B

B

I, don't know when even have time but I've just called me and said. Maybe we can do it like in detail, because we already have something next Friday so.

B

Immense that I would.

E

Direct directly to the image image image: okay, that's really explaining this idea. I.

B

Read it in in black and whites accuracy damage so I had to read the.

C

B

So so he explains the difference between I, don't think this, but he explains the difference between a sparse plates and sparse activation. So in this paper his alkylating he's using sparse equations, that of sparse rates, yeah and so here's the method actually say this. This paragraph explains everything so these in alright.

E

Yeah, actually in the image you can see that on the right there I won't use the gray Iran. Yes, so.

B

He divides it. The weights into three parts: one is the weights which are active and they connect active nodes directive notes. So this he call so he's not gonna touch those, and then he has free weights, which are weights that connect inactive nodes. Sorry active nodes to inactive nodes or inactive connective, so these are free. So if you change those they're, not gonna impact the first task anyway, so they're.

D

The gray ones- and it has.

B

A third one which the interference weights, that connect inactive twenty ones and read the lines that, where.

F

E

You, if you change them, you could be back on the TV.

B

F

B

So free is in active to inactive or active to inactive. So it's not exact.

C

B

The next one is gonna be negative anyway. So if you change this, it's not gonna change. The next.

E

On the fields you can use that to you know you can use the information of the depends on the new training when you.

B

So in active to inactive, it's free in you right, that's.

D

Okay, your question in.

B

The active connected by my skull in free, I I, don't know answer we cannot go. Let's see these. Are you not in blue next to have weights which connect any node on active nodes, which you call free so any notes in acting on this tree, and then he makes it differentiate it here at some point right, buddy.

E

I mean I, don't know if this is creating a confusion about the I. Think active in this sense is not like. They are fighting all the time, just that they are every move. They modulated activity can leave to you not recognition of particular part, and they are, let's say the blue, the blue neurons. There did not either don't have to fight together. They are just allocated to recognize several patterns of the first task right now, you don't want to eat a female grab another by well the idea there is Ju training, the internet work.

E

Then you prune it to D. Just for just adding be not a pudding. It's always.

B

So your pruning weights, after so I'm, going over to threshold.

F

E

C

Near unimportance.

E

I guess even in these cases, yeah.

B

E

So if you, if you have a.

B

E

B

E

So our octave as benign.

B

But so this paragraph its but not begins, please free weights into groups of the first loopy says: connect active to inactive, and these are domains are take advantage of previously learned features the responsible transfer line, and then you have the weights, which are from inactive.

E

The fact that you already learn to recognize the two blue dots in the first relay here these learn to recognize a particular pattern right.

E

That's it then the upper right brain iran's they can exploit his information or not, and this will be a case of transfer.

D

Previous here and you say, you're free to use that word yeah right. So it's not like you know. These things are totally destroyed, saying if this they learned something that's of interest to you, you can to.

D

Be not a lot of chain, you just say you can even even form a connection to make use of that information.

B

Where did the line say you were saying and I didn't see the picture when I read the paper, so the ones are one.

E

Way so you don't want to change it because they they could lead to interference so the or interference. If you exit you they can impact on the activations or the TV DVD. The blue are.

F

B

There's lines the same color or no two different colors.

A

B

Burning friendly.

F

E

D

E

Problem we'll see the later we can reach that point in the experiments. Is that it's? You know you quickly separate in the network this way, because you essentially freezer by the amount of even also because we prune again there are other techniques in sparsity. You don't reach the same level that we talk to pretend unless you know.

B

Forgetting this so I just move on to the mat, so the whole night of this after the end of the training he's got a spark boom. The network based on the activations right right, so he has his threshold and he's gonna prove essentially prune the connections.

B

We know he said he could so usually pretty metals retrain. So in the end he does an experience for every trend, but in this case he just prunes so not to lose performance.

F

B

Yeah, so this this is the core of the methyl b12.

E

I think it's it's nice, you have a very decent representation, Sarah forgetting about the problem. I mean also to see, if is quite, you know, small problem after the fourth batch I've seen that you reach the saturation of the network, so I don't know if this is really scalable and I mean I like more.

F

Than my saturation of the network.

E

F

You have just brought like you've just taken from the future. I mean you, you build a network that eats up all the resources in the learning now so with no forgetting.

E

Yes, exactly, then they have introduced. This idea of you know graceful forgetting and uh yeah I, don't know what they change at that point, but essentially they they. They start to change some connections right, yeah.

B

I think they change that threshold. Probably not. We then pick the Manzo's, yeah I think it's a. We specify the pick Ahmad using the highest activation.

C

D

They change the.

B

Threshold and they forget a little bit more and then they pick the one which forget just up to a threshold. They define and that had been the range of 0.05 to 2%. So they allow the network to forget like a little bit in order to have more free capacity for their future tasks.

B

Yeah right actually.

F

You know what like doing it.

B

Right, which is smaller, which has less active.

B

So I move on to text and mmm nice trick kind of side of means, but we can go over this plot. Having this plot is easier.

E

Location of the network, percentage of a location for each.

D

B

He gets to the conclusion that, as he moves on whipping past the layers, the the neurons in the first layers are, they get. They are more relevant, so they get frozen and the ones in the last layers are the ones which are like flexible. So you freeze network from beginning to end.

D

B

The first ones are uses yeah, which makes sense they're acting like.

B

E

A it's definitely interesting, but they I think that they use edge of the neuron. Is it's? uh It's not T, it's a really efficient. This is the do you eat when you don't want to have interference like in these formulation. You are you're, actually freezing everything and then you're not reusing the same neurons for.

F

B

Is this nice average network usage per desk I think this is one ends up taking a lot, but then he forget and then the last test, just like using just a little bit what's left in the net and yeah, and you run in Colima quick, very quick.

D

Right way, setting up the training for lower levels and basically borrowers from those only what it absolutely has to track back so, rather than say: random numbers, general you're, saying I'm trading on something to train something else, buying George. After all these tasks, these things really important so we'll hold onto them. Then you've got to dive down deeper. You know this.

E

Is that bar some waivers, but in that sense it's it's continually transfer learning from one task to the next yeah, but it's aligned with the other proposal. So is it a practical standpoint if 12 people capacity you can already can still up, you know.

E

In the end, it's quite dangerous because we.

C

Have zero forgetting.

E

D

Couple megabytes more, you can have, you know.

E

So, let's go if you want some practical problem without we are short learning that that could be away, and there are couple papers showing that oh yeah I'm more interesting in the idea of when you have a sparse activation, then you kind of seem to like him in the HTM model, where you have acting, then it's we're gonna, say.

D

E

He's actually used in many different contexts and recognize different paths is principled way. True, so it's very yeah exactly early, it's very it's much more efficient, as you have seen from the motto be as the are you know, these persons need representation that you can go.

D

Actually, everything that.

E

That's really cool so is exploring the real space. Yes, yes imposing some priors, clearly yeah, and there are other observations always that we have keep in mind that when you, when you do our muling training of an same network on all these dusty, you can you can fit everything. So, theoretically, you know all these information can be fitting within these limits. Is we shouldn't reach these issues of well.

D

Show how you're kind of expand these cobbles of this hyper cracker, you know space, you know, mostly strategies are saying you know. How do we get to the point where these things are sufficiently is ambiguity that we actually smartly but efficiently on space, and these are search strategies yeah. So you know, and if you have some principle way of rapidly, you know expanding into the space in a way that you know preserves the ability, it is ambiguous things, yeah, yeah,.

B

In this case, I think, even though it burns our of capacity I, think it it's a.

D

B

Of thinking of an approach of the words continue learning using extra capacity, the cases these networks or small events but I think it's an useful way of learning. I mean the networks were a lot bigger. Maybe you could extend the same thing like hundred a thousand times I thought. For me it was a good idea.

B

E

Maybe the the interesting part is that how they solve the problem that you need. You know it's difficult to twelve learn with sparse if I wrote it, and in this case they saw this by saying way, you can use all the network and then I prune it later correct, so you use so this is probably naming your credits and where you you, don't you know it's really tough, to learn already sparse.

D

This is a performance degradation same.

F

As forgetting forgetting is actually not being able to do it anymore, very.

B

F

Some point: you know that the past month would actually go out of the sense, but now you're capable of learning one new one. That would be well.

B

But maybe you can extend this curse, of forgetting to your grace, will forget a little bit of all the previous tests and then at some point, you're just gonna forget one Authority right.

F

B

But you're gonna forget like a little bit at beating your desk, learn so and then.

B

Well, in this case, they put them. Let's say: you've got 80% nitrous and then learn. That's true and you allow yourself to have 78%, and then you learn three in your lives after 76 at some point, you're just gonna forget this one. If you every time you just.

D

B

But then you're gonna, if you forget everything, good I, have to learn it from scratch, because you're using your you retrain your way to do something else right, so you.

D

Know in this context, all these things you there are some weights that are going to be essentially preserved because they're common to all right.

B

Okay, yeah yeah. Do it so.

D

So so you're forming a basis on which to kind of performance back before you know how accurate you are to do with these tasks. So.

B

That basil is still be there and you know.

B

This is a cipher and they compared it against think. That's I think snap intelligence, Moran yeah, so they compare against lapped intelligence and it's better than snap intelligence is one of the safe they're.

E

B

It's a different approach like not using a regular ID. It's at the same time, so we're used to be two approaches that you don't use a regular or you do some sort of architectural search. You add something to detector and, for example, in terms of the both at the same time is really I. Think the both architecture search and regularization- and this case is different because there's none of both, let's see Faysal, had the regularization and it doesn't work that your search I mean as in adding to the National. So it's like a third approach.

B

That's how I saw it. They don't know if you're sorry, the same.

E

Yeah I would define as architecture.

B

You define a tactic: okay, but we've seen you don't expand directories within the same or not how.

E

Do you change them out the knot you compartmentalized, you use, adjust some weights for tasks, kind of approach.

E

To greet these things.

E

E

You know it's nice to the choice.

D

D

You use this as a foundation which.

D

Which, which is nice, something like you know, like strategy.

E

That are complementary, then probably you can see. Also these one is progressing your own networks, but sparsity. Actually they suggested into the conclusion of progressing your networks that you could use probing techniques I mean he.

D

E

Gated an internal network for each task, so it was very difficult and complex and not scalable, but with rather a collection so to all the possible wait. So he is never so it was very. It was exploding in terms of dimensions, but if you think about it, it's what they do. You could think of every time you encountered skewer. It is a network training with our you doing the routing again and then that's that's a new column and you move on so I. Don't uh you don't see it snow is a similar.

F

Here amended a the.

E

Addition here is the the fact that ended in the seats either sink to us as well, is that they use parsley and they the.

E

Capacity but essentially they I think they it's quite similar to the approach. One thing is a fun and interesting.

B

Life is this: now we have in this case, there is a measure of how much capacity.

B

B

B

There is nice you can choose which test you will forget forget this.

B

I have a question for you may be related, so the proposal you have concerning somehow related first paper second paper or is like both I guess.

D

D

B

I mean going to the process.

D

D

You could recover some lost previously.

D

B

D

B

B

D

Tuning is done by retraining it work for a few methods will only help being weights with mortification is causing all of your deeds. It slows performance because of pruning, but way.

B

Right about here, okay, thanks yeah, oh, but this is about too fine to me. I. Remember you in us that Oh after they've grown, do they find.

B

You could find no, no, that's that's goodbye.

E

Final task, after.

B

You prone you could like rerun, like don't.

E

You know you don't want to store any patterns, you don't want to see again the same tasks.

D

No but I mean in the environment. If you reduce something that is similar enough, that it yeah yeah. Let's do let's do so so that okay, yeah.

B

That's okay, Kevin's, not thinking about community means he's thinking about a GI yeah.

E

B

C

The same classes.

E

D

E

You that you can thank you, the yeah I, don't know if.

E

Only to you know other performances, because in generous visit, especially with gradient descent, you've done so many other feet on the specific data you are looking at. You know: yeah, for example, I have the same class. I have only weights.

D

B

D

I mean there's certain games that I play that haven't, played it bullets, it's easy a while to kind of get back. You know to do that because we didn't Zuma Blee harness for other things, but I haven't sufficient plasticity like about the 90% of what I was yeah. So it's that kind of behavior I'm looking it and as to see reflecting as a to higher order.

B

D

You know, but I should at least have this property so that you don't get into a bind where you're, literally at full capacity yeah nothing more. There's got to be some kind of trade off these things. Now the the schema where you say: hey, no reality, we have this huge net where I can remember five. How do you want to represent that?

D

That's maybe a different. You know skills that go away to come back again as opposed to heavy yeah because of the photo fixing attacking what.

D

D

Like this thing, because the you know so many of these things that use regular losers are using how the global priors that don't reflect the experience, there's assumption that this thing somehow has alerted capability to it, but is something system. What I've learned is what's why he did the network and operating on that level means that I'm truly affecting what was learning you know and using a fairly simple principle that you know what is useful hold on yeah yeah repurpose.

D

You know and at the same time try not heavy big fire at the same time, for the hope you static.

B

C

B

C

Matt I think we're done. Okay,.

A

Thanks guys, so so a long meeting turn.

F

A

Off you're offline, as of now.

A

Okay, now I'm going to turn this off.

A

What do you guys do me a favor to take the double back for me.

A

Okay, turn this off. Okay! This is the wrong camera. Let me do this thanks for watching. First of all, thanks for watching, second of all, I hope this is not mirrored like and subscribe and we'll keep doing it. I'm gonna keep live-streaming what is up with the chat window. I can fix it hold on or.

D

I'll just break it.

A

Oh, it's broken whatever I don't know anyway. Thank you for watching I think there will probably be another research meeting on Monday which a live stream I. Think Marcus is gonna. Do something I think there will probably be a research meeting on Wednesday I might even be talking about something then I'll plant them out and then definitely something on Friday, so I hope you enjoy these meetings.

A

Please do me a favor like I said like and subscribe. That's the best way to support right now, what we're doing we're live-streaming all the research meetings, all of our journal, Club reviews, the best way to help us is like and subscribe. So please take a moment out of your time and do that right now and I will leave you alone. Here comes some chats here, everybody all of a sudden. The same thing: yeah you're, very welcome. I'm, really happy to be doing this sort of thing, I, love being transparent about scientific discoveries and research.

A

I think it's a wonderful way that we can just put all of this out there for anyone to consume and build upon and try and work with and understand. We want to understand how the brains work and share everything that we've learned so far. That's what we do here at MIT. Take care. I'll, see you guys on HTM forum. Otherwise, I'll see you live streaming. Probably Monday I probably won't be streaming. The rest of the day so have a great weekend.