Numenta Live Streams, 19 Jul 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Review: Superposition of many models into one

Description

For this Friday’s journal club, we will be looking at a recently published paper by Bruno Olshausen at the Redwood Institute. It is about a more plausible algorithm in terms of biological constraints. It also ties well with the latest discussions on Continuous Learning, it provides an ingenious and elegant approach to the catastrophic forgetting problem in multitask learning.

Here is the link to the paper: https://arxiv.org/abs/1902.05522

A

You're not embody did.

B

You spend $2 this thing about 10 or 15 minutes. It's relatively straightforward projects, so this is a very zipper from burners group at the Brennan Center on crime, Mercer position and they're trying to solve a problem in continuous learning, called catastrophic, forgetting sin generally continued learning paradigm. We have multiple tasks or changing over time, we're a show like a bad joke test, one and I'm worship, etc. Okay, I mean what typically happens with traditional neural networks. Is that you do you some will and class one get good accuracy and then is how.

B

Service did include the vaccine task two, and then it just continues to drop off catastrophic, because that.

C

Be accuracy on that's a noisy last one.

B

Would be a similar chart on average tasks, accuracy as well you're, essentially over waiting the recent samples, it looks like distributions completely shifted and the parameters are being unlearned from the previous task. This is a traditional.

A

Problem, continuous learning, so I think I understand this question, but it doesn't ask it anyway. I mean you have to be this right, so the Assumption here is that there's some sort of like simulated annealing process going on where, if you try it has to balance out the training examples across different categories. Otherwise you know just yeah. We didn't have to turn out this way.

B

Yeah, this is a property traditional, no I.

A

B

Know one could argue that value the longer you get to all the.

A

Training then they all get trained, but then sometimes you're saying I train a little bit a that a little bit B then we're seeing a little bit.

A

Otherwise it reaches up so I mean.

C

A

Why it does this is just undergrad my level, you might say.

C

Maybe the test doesn't have to be done.

A

The order what you trained, the sample shouldn't matter right.

C

Yeah, if he's part of it, but typically the order, doesn't matter, but you came on everything at once over and over again here, you're just training on these ones and at this point you're not training.

D

C

C

A

Randomize the order it would work yeah, but by putting this particular order, it doesn't work yeah.

B

So you could look at this as a positive property of a neural network as well lucky. The description fully has shifted and we're never in a city all tasks again. Why, even like you yeah.

A

Yeah right and that's yeah, okay.

A

We had in our memory learning yes, that you have more recent statistics or more meaningful, but even then we don't have this caps offering okay and have a very slow forgetting process.

A

So here the general assumption under machines, most machine learning, algorithms, most of them it isn't a time-based thing and that the order doesn't reflect reasons you have you know, doesn't reflect shift in the world. Well, statistics so I'm doing image classification. Almost all the problems are not saying well, more. Recent images Network the other ones yet at the training part, it's just understanding obvious for my friend that is proper to deal well.

C

I think what what they want is a line that goes like this yeah.

A

Yeah and everything yeah yeah. This reflects like our models, where really the more recent see stuff is more important and that, in these particular part,.

B

A

There's a balance of.

B

Importance of recency and just the ability to remember previous tasks at all.

A

Actually changing.

C

D

A

Of continuous time here he said, I know: is this surprising that to have this bad property and.

C

There's to be tons of proposals to this, which they discuss a little bit here, but but they're proposing a very in the solution.

B

They talk about two different ways that we get task fixed, so one is that the inputs are changing distribution, so this is like from the fashion I'm gonna say this has to like pants that are rotating, so you actually a rotating images slightly over time. So you might have your degrees here in degrees here in four degrees, in test three, and so that's one type of test change and the other is that changing the labels of the dataset.

B

So in the side bar case you might have ten labels from a site bartend and then the next task might be another time labels chosen from cipher, 100 and so they're. Actually, if the outputs they're producing that are changing so both cases, the distributional shift. That's a label opportunity versus the input, but we see this problem in keepers learning in both cases. So the general background. So the premise behind this paper is that neural networks in general are over parameterised, so the parameter space is extremely high.

B

Dimensional by the inputs of things we're trying to that model, are they send out a low dimensional manifold? So it's a much smaller dimensionality. That's just within a larger framers base, and that means that potentially the neural networks have this potential to pack parameters. Multiple manifolds into the same large parameter space, one for each task without without any cross influence. So illusion the possibility is yeah.

B

It seems never a big possible impact he's in a better way, more efficient way, and so the idea that they raise in this paper is to acnes using a contact as matrix, essentially.

D

A transformation.

B

Or rotation, and so they're more they're going to do, is they're going to choose a particular context, matrix for every single task, and this is fixed. We, these aren't learning parameters and we're going to use these two both store updates to the weights, as well as retrieve the weights for a particular task. So we're on task one: we're gonna have C 1 and C 1.

B

It's going to take every input, rotated and then weight updates are going to be sort of addressed into the parameters based based on this context matrix, and so we don't have to this is given. We know that we're on task one, so we're going to be using C 1. We know that as soon as task 2 starts, where we have a totally different matrix, T 2, there's a different name just.

A

A parameter no.

B

This is a rotation or transformation, but.

A

What does it actually represent? I'm sorry, it's.

B

Sort of like an address into a parameter space- that's how I've been thinking about it. So we have a huge parameter space. The context is transforming parameters or inputs into one section, one subspace or the parameter space. That's going to be used just for that task. We're trying to kind of subdivide the subspace is that the Predators that homily.

A

Biagas, you can't pack all that into a single point and.

A

C

You think about the weight matrix is the big matrix. It represents a big space of possibilities yeah in context. What might one might say for this context? You only going to use this part of the honey.

A

Way to make kinds of finding that I mean it's easy to find. This part.

B

You know five based on us. You choose this context, maybe they're just transformations so like a really simple version of this is to buy there any context where you take every element and you either multiply by 1 or negative 1, and you randomly choose where it is negative, one small, that's what this is doing is it's doing with rotation on the inputs in a way that it makes inputs. It's an orthogonal rotation. There's a purchased inputs into maximum by n matrix. It is addressing as you're learning task 2.

B

We grew up making updates to parts the way matrix. There are nowhere near.

A

9 times, then, when you have a sparse activation.

A

Its natural yes,.

C

A

C

A

C

Very careful how you choose these contexts, there's a different number of different ways of choosing these contexts. Make sure you know what the context is. You know which task area in which you don't need to necessarily for sparse yeah.

C

There's a challenge.

A

Like if you do this really well, it's almost seems like you'll end up with something like.

A

It's almost inherent to the problem that you have non-smart weight. Matrix said: you're gonna have these these proximal overlaps that are, it's almost inevitable, see my.

C

A

C

Might not end up with sparse stuff but it'll be kind of similar, but using these sparks, because it's just some random rotation of that.

B

It's a minute, it's low, dimensional manifold, but it doesn't mean that it's sparse yeah.

B

C

Right, you know you might end up with sparse representations this way, actually, because if you, if I understand it correctly for a task one, you know if you're, if you're sending in inputs, some set of dot products will be positive and the values will be active if you're in task 2, you know it's orthogonal rotation of the weights and it'll be a different set of values. That's gonna be activated different set of rally so for.

D

Any given task, it might actually just.

C

End up being a small set of units that are active.

B

Is that it seems possible I.

C

Mean almost almost might be almost guaranteed.

E

C

A correct me, but my information is wrong on this or not, but because.

D

C

Is going to be in there in the best ways a completely orthonormal? You know rotation of the matrix a so this set of inputs that cause the dot products to be high. It's gonna be completely different. Yeah.

B

I'm, just not totally getting how that results in sparse activations, because we have a fantastic.

C

Result in this part, but I think the set of revenues that would be active would be quite different in each one. Yes,.

B

I think actually different is the goal yeah, whether it's and that's I'm, not sure yeah.

A

Work long struggle trying to understand the nature of high dimensional spaces are really really weird right. It's really hard to visualize think about them, especially they're. You know binary or very low resolution right high dimensionality, instead of just always like it felt like every there's only when you actually sort of laid out this farce argument, which of course each other totally do too, but you kind of make leaders and now that that I can do the conclusion, then the only way to really think about these phases, so they don't collapse.

A

The only way you can even possibly get to having marital develop. This constantly stuck on each other is to just have this varsity. It's just a reduced dimensionality of the subspace and so I kind of I. Don't that's right, but in my back my mind, that's like I've kind of accepted.

A

That is fact so in some sense, I'm saying, if you're going to solve the problem, you're going to end up there, that's a it's I'm testing, my knowledge, but I, don't understand your riddle or don't get that whole I didn't I'm, just saying from a different direction. It feels like that it has to do this man.

A

Somehow schools and animals have to be just really so separated in space and therefore they have to be loads of nationality. Therefore they have to sports. Does that make sense? You know I, don't know.

C

Yeah I think mathematically. They don't have to be sparse. You can take any sparking the entire space.

C

Space, yes, if input space is very sparse, initially.

A

And yet these, like the major, the the connectivity, the course no, the same thing on the weight space. The weights are very sparse.

A

Rotate the wig space with it I wouldn't expect it to but I. Don't really.

C

Want to say, no, the property still holds in a rotated version of everything and in the rotated version, they're, not Sparsit.

B

More your debts, you change the reference frame by one degree, yeah.

A

Everything was zero everything exactly I, guess I'm, arguing that we taking that crazy.

C

Well, we don't want it, we don't do that, but in mathematically as possible, and you still get the same properties it's just uh now, instead of having twenty spinet so that you have thousands of synapses on each connector which we wouldn't want and that's kind of what they're doing here is they're I.

A

C

A

A would you lose your robustness.

A

C

A

The robustness.

C

If you still use the dot product, but if you use some rotated.

A

Mathematically.

C

They're all kind of the same but in practice is.

A

Varsity is always going to be the solution and because he can understand that line but I think.

D

It was the best solution.

D

Very low rate, if you leave there telling for these subspaces, the team.

B

Okay, so Amelia always depends upon this property of miracles that there are, they learn their low dimensional input. Space is low, dimensional I think this RC might be a good heuristic for finding the manifold, but I'm not sure it's the only one right missus the demonstration of a non sparse way of actually identifying these manifolds, then rotating them away from each other.

A

But ology it's the only way, I'm going to be able to think about the problem so prior to that visualization. Just trying to imagine you know low dimensional, manifolds.

D

D

No illusion of a matrix, it's like they're English, like M, for the subsets, if the dim is very low, these findings foster presentation, but I I'm, not sure that I'm using these locations today is possible to find one.

B

Find a sparse representation, yeah yeah, there's no personal.

B

Slightly each state-of-the-art, my yes I mean I, don't know. Ninety seven point, six up from 97 I think say: there was the last consolidation.

D

C

There's like two or three and other proposals to solving the catastrophic forget it easy to them very simply in like one sense yeah. So one is called a replay thing which is as he obtaining test to you, also throw in a few book cast one.

C

There's synaptic intelligence and the last equate consolidation they've got very similar. What I pick up room utilities, lots of audition ones that.

B

You keep in memory of the weights that performed well on task one, and then you set up a loss function that tries to incentivize keeping those weights where they are so not allowing that meter. Here, with my constitute.

C

Our generate excitement is.

A

Is is, is like sort of an extreme form of elasticity, just put them all in one section over here and.

B

They only get reused when what you've some sort of massive capacity. That's what.

A

You started using yeah.

A

Not what should happen in real world, you remember everything, but you have a memory and that's when you start losing stuff and then look what do you lose the day? You start losing the things you haven't learned most recently or something so that's the way system should work as with some works.

C

Would eat at least here it might be synaptic intelligent, the intuition or once was. It was something like this. It was, if you imagine, you're you're in weight space and here's a weight.

C

So it's a point in a very high dimensional space and you've trained it on past one and now you're training it on past two, and you have a choice of where to how to up it. This way, if you look at it task 1, it may be that there is a whole bunch of directions you can move it in which would not affect hassle on performance and there's a whole bunch of directions. You can move it in.

C

That would be beneficial for test 2 and there's this narrow range, or there is a range here that intersects, where you don't affect task 1 but improved class, to see that's the direction you want to go it and, if you're in a very high dimensional space. That's always these overlaps.

C

It's limiting but I think in the end.

D

C

Doing something because you know when you make the.

E

Updates for way to you're trying.

D

E

C

Doesn't affect under this rotation so.

B

I think it may actually end up being reasonable, because this is all of us maximizing to similarities. Yes overlap or things like. We know. They're gonna be in the same space, but let's just make sure that we're updating them in a way that doesn't destroy the Prius yeah.

C

But the one way to not destroy it.

A

Is it possible and to achieve all these results using point neurons basically did not support the digital point neurons and not sputtering, it's impossible or is it impossible? We know the brain goes to a dump. We know there's an elegant solution which is specifications and it's alligator works. So really the question is canopy achieve this way and you can achieve all the same results this way then you're gonna call that right. You can't achieve move this way. So it's worth asking these questions and try to keep them in mind.

C

Yeah in its own, it's not just achieving one result is so you can you do everything back part of that I think is power.

A

B

Your general point from the but the biology is that we just keep on learning it. So we need capacity. Then we start forgetting things.

A

And we, then we forget things based on heuristics that have to do. Is the college complex about recently? You reinforce that sort of thing. We follow that with this permanence nodding, which really just reflects me yeah or it provides a combination of recency and number of times. We understand the real biology is a little more complicated, but but yes there's some there's something, so we just keep learning and we don't really forget until we kind of kind of max out now. Obviously the brain actually do forget things on a regular basis.

A

Part of it, because it's just natural.

B

And they're all models, we don't think there's some pair.

D

B

What they're doing here every time a new a new task is coming in, where we're rotating, in a way that we're like packing it into a memory in a state in a way, that's going as little as possible interfere with previous memories and at some point, where native capacity and then.

D

When I start doing rotations that you interfere.

B

With previous minutes, our knowledge and a way that those rotations are selected, there's.

A

As you try to learn new things, you are actually.

A

Complex these operations are.

C

I mean the effectiveness of this paper all boils down to how they choose these seat matrices. That's.

A

My first question.

A

B

Yeah yeah so they're, essentially randomly chosen rotations of various kind of various forms and there's a few that they experiment with one.

A

Little rotation each.

B

An actual physical location or.

A

Just some mathematical term, that's a physical or digital.

B

And in three dimensions, it's loud addition: it's.

D

A

Make sure you give.

D

A

Types of problems that that just assume when you're on physical structures, I, can rotate right now.

C

It's a rotation of the it's a mathematical rotating of.

B

Rotation, so any any like square matrix is the translunar transformation.

B

Some of them are considered rotation, matrices, I, guess it's upset, but in generally they're just transformations, so ever yeah we're doing some sort of warping a rotation o space to get either in quasar parameters into distant parts of the parameter space, so they sort of the rotation. The problem with this one is that this is essentially a dense rotation matrix, so every one of these I use populated, meaning that you have to add and squared parameters for every additional task, you'll alert, and so this was inefficient.

B

We were a little bit confused by why this is so problematic, since these aren't learn parameters. Atavist- and this is the hyper parameters- are fixed before training starts, so it still seems like this is potentially useful. They've been overly reporter results from this one. They try this complex in this binary. You know three things that they talked go and study of producing context. It is kind of a.

B

Medical book seems to perform better than the binary. A complex is tough, because you're actually dealing with complex values in the matrix.

B

Describe the way that, even like temporal oscillation.

B

Two oscillations are phase aligned. You know if you know how to dip it back down there. I can see that I.

B

Don't have a great this. The way it works is it's a diagonal matrix like this, and every one of the diagonal entries is a different rotation on a unit circle, and so it ends up being a big transformation, really is a single negative PI to PI revision, and then the binary version is just the same thing, whether you're sampling uniformly random from negative one one. So you just get a binary value like this, so that means of each access.

B

Each dimension is either flooded or not flows, and so that ends up being an orthogonal transformation as well and allowing them to put all these.

A

Rows right, yeah, you know I definitely know that members in your yeah.

B

Even if you only flip one access, you end up getting into a really different part of the space yeah to be completely.

A

B

Think we're talking I was fine, even if most dimensions are the same, I think it's still.

A

All randomly they County yeah. That's what kids really were to think about things like the high dimensions.

C

Yeah I think yeah I'm.

B

Not sure there's a great slide from different in and at some point it's like giving advice on a thing about high dimensional space and said just think of it in three dimensional space and then tell yourself really loudly thirteen.

B

Yeah I think this is it so they were seeing pretty good state of the art with the complex in the binary context, elections. The limitations to me, one of the big limitations- and this is outside of the scope of this paper- was that these contexts have to be predefined and the real world. Of course, we have no idea what context for yeah. We don't want even a label of attention. So it's it's nice to be able to just say we never test. You know.

B

The little tasks we know that weren't asked you from.

B

Virginia, if you think about fashion eminence, were you rotations? The task ship is actually just a two degree rotation of the images. So there's no label change, it's just the distribution thing, but that's changing. It's.

C

Still a supervised learning time, at least the one thing: well: okay, if there's a hyperplane additional.

A

B

Know in the context that shifted, yeah.

D

B

Telling you okay now you're in context, so you're gonna be rotating her, but.

D

It's not like you, don't you.

B

Just know something has changed.

C

B

Doesn't like that harder problems automatically to take these things, but we.

C

Have no idea, that's how you need to know the index of the context matrix.

D

B

Tasks the context each of these is the context matrix one per task angle. One is what.

A

In a caste just.

B

So it so, this is like this is an example here, with rotating fashion, eminence, where you have images that are coming in your house. Buying them is like Panther shoe, you know, which.

A

B

The one so in test, one is zero degrees rotation test. Two is wonder here.

A

To be scanned, particular.

B

Hockey upon.

A

All of the data set so.

B

Is it like a full distribution so.

A

The task is going to operator on all yeah yeah.

B

A

An easier one to think about is.

B

This, like incremental sneaky type, are so.

D

B

You have like ten labels of images and then you have a totally different set of images with totally ten new labels that you haven't seen before. So that's a much clearer task boundary because you're trying to produce different levels and the inputs are completely different. But in.

C

This case, you have to know okay, now I'm getting images from these ten categories and now I'm getting images from this second set of ten categories. The now I'm getting images from this third set of yeah.

B

Maybe you can use some sort of like hainan on the color space.

A

C

It's just a different task is just a different setting that you're it you're now learning new.

A

A

B

Label change or the Eclipse change: this is the inputs, change.

A

Physics, the world that there are these uniform operations on tasks that can be performed by was, though, that the best brain model who wouldn't make any assumptions about the world yeah.

C

No, we know we have to fear it ourselves.

A

Yeah yeah, but is it not even an underlying assumption that you just don't want to rely on some sort of knowledge like yeah,.

B

But but implicitly we have children. We have to build context of some sort, because we know that certain kinds of behavior is going to be useful in one context and different behavior views. One other context: the.

E

B

Behavior is going to perform so.

E

We need some notion of, we know some notions, but we figured out.

C

Our so yeah, it's not.

B

A

It said: okay, the common functional operation that you can apply to everything yeah and that there's like this there's this whole world of how behaviors change things in the world and.

A

It's like it's, that's that's where the rubber hits the road at these things, so it wouldn't be. You wouldn't have any of these properties that workers.

B

In a grid cell world, the context is like what part of the space of my head right. That's a huge base and you're like.

A

Contexts have them really lost.

B

C

Essentially, and.

B

I can't remember the location.

C

Str access context for the sensory.

C

So in our language that worse, it's not the same context, I mean I was thinking literally rotating. These pants.

A

A

C

There's one one: inefficiency here which I don't like: if you take this sidebar incremental sidebar case, if you learn, ten categories of sidebar and I are learning another ten categories you might have learned features in the first ten. That would be very useful for the second here. They would.

C

Because in this case either you have a new content, we have the same yeah if you have the same content.

B

As you're learning the content, especially good like vectors and you're trying to get them orthogonal initially good, now you can actually learned you can learn that actually super similar. Let's transform them.

C

B

B

Like one is your actual from the exact same distribution so eventually, perhaps.

B

Their work here and nothing to share here. This was just me trying to wrap my mind around how these transformations are working. The new one commits, and there is it you know instead of the thing about this- is maybe doing difference or transformations on the premise based. You can look at it. Every time a training sample comes in 10x, you actually transform X and it's the same.

C

Because already.

D

Hard to be more patient.

C

Without going from our students.

B

As for sleepy most the this idea of like context collapse, though it's an ex parte has this or la canal and then for example, it's the test. We're training, our actually like T, 1 and T 2 are identical. We're still going to have these, like sparse, are in totally different spaces you to artifact yeah I, instead of giving 0 degrees and two trees. I just gave two degrees in zero degree, so it's actually the same exact mo distribution. I told it I told it that it needs to choose a different context, matrix for test.

A

That's just the way.

A

B

You were fooling with this networking to see what it did when you have two different tasks: they're, actually, the same tasks challenge.

A

Ability this one has me, you know you're setting them two.

B

Of these, yes, I would want an adaptive solution to be able to have needs context collapse on one another's with it's actually learned the back to those tasks to the same I'm, trying to fear where there STRs would give you that property and I think there's.

C

One item thing with SEO: you would do context.

C

So exactly how to do use STRs to do continuous learning in this kind of machine learning context is still unclear to me. You know: I can more clearly talk about what happens in the temple memory.

A

A

A Latif and then they they get really similar and and you stop here in LA or they element, we have two different. But then later you realize that they're the same yes.

B

The same columns are winning in that case right, so you just are actually learning.

D

The same years.

C

There are several situations, like suppose, you have you've learned a bunch of melodies, now you're learning, a new set of analogies. Those.

D

C

Would not interfere with the old set at all right. You would get perfect performance, the collapse alone, but it's in this problem would never occur in the temporal memory. You'd get that perfect, because different.

B

Inputs already.

C

Exactly and the temple memories really good at the second aspect of continues memories. Maybe you learn a few melodies, but then one of these melodies changes is no longer that it's it's easier.

E

C

About in terms of some real-world stream that now, in some cases, some of them change for what happened at the temple memories, just those would get altered. The other stuff would not get affected. So again, no catastrophic forgetting and the learning is contextual to the stuff that actually changed. You'll need context. Matrices you need anything, it just does it automatically and he took a context drift. Similarly, it's going to automatically be that learning circle stuff will get about.

A

Like I was thinking at least the collapse remind me of like you, have two sequences ABC and you have de yeah, and there are two separate sequences, but then later on your eyes, that de is actually a continuation of ABC. So I learned this. The GEF here is different than that de up here.

A

This is a you know the sometimes you want them to be different. You say: hey this DF at content, missus to you, but then maybe it was questions you collapsed.

A

Yeah there was a lot of issues like that.

A

And I guess questioning that really what you want.

A

Already, when the same, it was tough to wear the same, and it's already going to be actually the same. Columns.

A

C

I, don't wanna, explain yourself and I just say: okay, you just get a and mix with B and that's it and you could think of just the context for paying a complex B and we stand together and then I have this new context and I can start to think about using this new context. It's like a merger, page I'm. Using this analogy. It looks like that. So I have a starting point to learn this new desk just.

B

By merging, because it's.

C

B

Dimensional so there's some value to being able to find like overlapping sections of contacts. So you just like Murray a few contacts and then you have.

C

A starting point for.

B

Because if you think of the contact doesn't address into the parameter space, we say we want to context where we have like half of the address is still the same as this other test. We never use those parameters and the half that hurts is something you really need to learn.

C

And you can just relay something because.

B

It's automatically figure out the contact- that's not in this paper right, but so we use heuristics that are used like the distribution feels like it's in this totally new area that we haven't seen before. But part of distribution is very similar, so we said, let's take part this context and part of this context, but how exactly that happens is like.

C

I mean that's my fire I.

B

Mean the other possibilities you just try all the context, and you see you know which one is doing the best job of being a prediction, and then you just choose the ones that are making a good job like that's a possibility. Yeah, no.

E

C

So we need more hyper parameter, searches.

C

It's a cool paper. It's very interesting.

C

In this kind of regime, I think it's a very clean way of thinking about it, their clean way of looking at it. Whether it's translates a realistic computer, just learning or not unclear.

C

Okay, any other topics.

A

Matt, let alone you're done screaming. Okay,.

E

I'm gonna turn off the stream right now, thanks everybody for for watching.

D