Numenta Numenta Research Meetings, 20 May 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: ICLR 2020 Conference Recap - May 20 2020

Description

Lucas Souza does a “trip report” on the ICLR 2020 conference, which was held remotely. He focuses on papers related to neuroscience, deep learning theory, pruning and sparsity.

A

B

Okay, so this is a little trip report for uh iclear 2020.

B

Okay, so this is a little trip report for iclear and I'm gonna talk a little bit about the conference first. So these are. Some of the pictures I took in the conference was in ethiopia, uh I'm just joking, but that's uh that's actually how it was, and so it was fully online but supposed to be near ethiopia, which is a really nice move from the organizers.

B

Due to the recent visa issues, we had so a lot of researchers from africa. They were, they couldn't get into iclear or new rips because of visa, so they decided to move the conference there, which is nice, but it was online anyway, but we had 89 countries. You can see in this small picture over here, 89 countries, 1400 speakers, a lot of chat methods and video watches. So I I didn't watch the conference uh live so that week we were quite busy.

B

We had a milestone going, so I just I just catch it a week after so mainly what I watched was. I went through the papers and then I watched the video for the papers and I read the reviews and I took a little look at the papers.

B

um My general feedback is that it's not the same as going to a conference, mainly because in the when you are in the conference I mean it's, it's very tiring. You you get there eight o'clock in the morning. We just watch presentation after after presentation all day long and then at night you go to posters, but it's just you were there. You were there in that environment and you just don't want to stop.

B

You know like seeing things and they're just motivated to do that, so I could do that for an entire month, but when you're doing this at home, it's it's very much different. So after two three hours just gets very boring, very tiring. So it's it's definitely not the same experiences. So when you're watching live like it did with neuromatch, it was a little bit better. So if you're going for a conference online, I'd recommend you watching it live, but when you're not watching live and then you can stop and then you can do whatever it's.

B

uh I wasn't very motivated to do so. I had more stuff going on and I I'd rather code and watch the vectors.

B

Can you hear me well, okay, if you're not just let me know, because I usually have issues with my connection.

A

Yeah so far, I've been hearing you well. Okay, me.

B

Too okay, so they did this little map, which I think it's very nice like this grouping by topics like clustering by topics and they had search capabilities. You could search by keyword by author every paper had this small video. My feedback here is that the videos were usually very small, so even the the main talks there were only like six seven minutes.

B

uh There was not a lot of information in it and I was getting a lot more information from the review from the discussions in the review than I was getting from the presentations and yeah, and the reviews are also very interesting to see. I mean they can be very brutal at some point, so it was yeah. It was an interesting experience, so I I I have three topics when it's uh neuroscience: it's not a lot of neuroscience, just whatever it was there and then deep learning theory and pruning and sparsity.

B

There are a bunch of other stuff I'd love to talk about that. I thought they were really cool, but I don't think they really fit. What we're doing now at momenta- and I didn't want to make this into a very long presentation.

B

So, on neuroscience there was this one day workshop on bridging ai and cognitive science, motorcycle neuroscience and there are also a few papers in the overall conference, but it was not a lot. It was a lot less than an europe, so you could see narrative especially last year, had a bigger focus on neuroscience.

B

So in this workshop the main topics were being addressed, was concept, learning causal reasoning, language acquisition and learning from field data which are general topics from cognitive science, and they had these open questions, which were the questions that the papers were supposed to answer which inductive bias do we humans or animals, use to support rapid learning? How can we share concepts across multiple domains?

B

How can we have models of the word that can be approximate and useful? How does memory limitations facilitate learning and how should we represent other people's goals and intentions? So these were the main open questions in the field. Let's say: there's a nice workshop. It was long like eight hours long, but uh I pick up a few papers from there that I thought it would interest you so this one it was.

B

We reviewed last year that uh the carlo paper and then there are several follow-ups where he uh correlates uh convolution neural network activations with neural activity, and this paper it's kind of a continuation on that, but they went full scale, and so they compare narrow recordings with over 50 different architectures. So they got everything because they're in pythort model zoo and they use a two-photon calcium, imaging data set from uh 30 000 euros in the mouse visual cortex and instead of just doing regular object classification, they compare with 21 computer vision tests.

B

So there's this larger data sets called this gas economy and they compare to all the tasks there. So you can see the models here.

C

There lucas, can you remind me just what are they basically doing to make these comparisons, how they have this huge amount of uh two photon, calcium imaging and they got a network? What is the method that, by the comparison, making these components.

B

Yeah they're looking for correlations between the the the activations in one network and the other network. um I don't.

C

B

C

Seems like a very it's like a very open-ended idea right I mean uh it's not clear to me. um I don't know I mean it just that seems almost like a I. I guess I just don't understand it. It seems like there's so many ways you could go about that. You know. What's the animal doing, what are the you know the how they characterizing the behaviors of a neuron? um I mean I don't know it's it's confusing to me.

D

I mean just just making sure you get the basic paradigm, they show images to a mouse, uh they show those images to a neural network. They look if you can find mappings between the neurons of the neural network and the calcium, imaging data of the mouse, uh showing the same images.

C

Okay, so they're literally showing like uh the kind of image data that we work with all the time they just flash in front of our mouse's retina. Is that what they're doing.

D

Yes, I think that's the correct.

C

Okay, that's that's kind of interesting.

E

So they don't train, there's no, like a loss function on the uh on the neural networks, to like approximate the like real neurons or is it? Are they just running a bunch of them and seeing how close they come to mouse? Cortex.

B

Yeah they're not trying to specifically approximate the mouse visual cortex they're, just running the same task and comparing like market sets so.

A

So these are not working. These are neural networks that were trained on imagenet, so that was the last function. I guess.

E

So what are the? What activations are they comparing? Are they just going through, like all the layers and.

A

D

Like I, I know their their previous study where they used primate data, I know it quite well and what they what they do is they look for linear mappings, where they can say like this neuron in uh and primate v like v4 or it can be somewhat well approximated by taking a linear mapping of these fiber six artificial neural network neurons and and they they search for these mappings and if they can find a linear mapping. Specifically, if it's, if it's linear, then they they.

D

They say that, like that means that these neurons and this artificial layer are using a similar basis in a sense, okay, they're they're encoding the input in a similar basis, so they're trying to figure out sort of like what is the basis for it and v4 um and uh for these static images. I think they're they're they're, humble about the fact. They know that they're doing static, flashed images and that that's a limitation, but that's that's kind of the general idea.

D

C

A

C

We don't even know if uh these are rodents right, um so we don't even know if these images are even in any remote sense, meaning boltzmann right. I mean we know that they know. Now we can classify these things. We don't know that the rodent can do that either.

E

Well, I mean that shouldn't. I don't know how relevant that is, though, like.

C

Well, if, if the rune is totally incapable of the task that the network has been trained, the neural network has been trained to do. Then you would expect that there'd be fewer correlations.

C

I mean there's an assumption that the mouse's brain is doing the same thing as the neural network screen. It's not that's not clear to me, but all right. I I give it all this. It's fine yeah.

A

It's also it's weird that they're doing it in mouse, because mice also have a very flat hierarchy, there's hardly any hierarchy in the mouse yeah. It's almost.

C

All v1 right so yeah, it's almost hardly anything else so anyway,.

E

It's okay: when you stack it together,.

A

Yeah a little bit more than but not much, it's not like the price, but.

C

The v2 is teeny and it's a very tiny region compared to v1. So that's, and so it's not nothing at all. Like a primates.

A

Are they recording only in b1 lucas? Do you know I.

B

I know I don't know.

C

It seems like it's a very sort of a brute force approach. I guess, uh but could still be very interesting, so yeah.

E

I don't really see why they do.

D

I mean the the reason for doing it in mouse. Would the the easy reason is with calcium imaging? You can record a large number of neurons and yeah. That's the techniques there. They can it's easier for them to do that. They.

A

Do they do it primarily in.

D

Primates, but that it's harder to get that data.

A

Yeah but that's like saying I have a hammer, let me find a nail, you know.

C

Yeah he's like oh.

A

I have a cool technique.

C

You know what can I do? Yeah yeah, but you know, but mark is right. If you want to do uh primary studies, it's really really difficult. I mean it's just so many other layers of problems and issues, and you have to find the lab.

A

C

A

That's true, but even doing this is a you know, it's a pretty significant effort and yeah. I.

C

Assume that I still expect.

A

To learn from it, you know yeah.

C

I don't know I, basically, they have labs, who do a lot of two photons, calcium, imaging and you know, and they give this- they hook up them, hook up the mouse and show these images. I get it. Okay,.

E

Isn't this problem kind of uh degenerate like these? These networks have like many. I don't know how many, how many units do they have like many thousands right, um these image net networks and then like you, can find several, especially if you're doing a linear, you could find several different types of uh network uh of company linear com, linear combinations of units that would ex look like some unit randomly I'm sure they do some statistics on that. I just like. Don't really don't really see why this is interesting.

D

I I I can I can give defenses for these things, because I've talked about the previous work, where they were really careful about that, where they were really careful about, like having number of layers line up with specific layers and cortex. uh So there are answers to these questions. This is going to become a long discussion, though. If I keep answering.

A

In the context of primates, it makes a little more sense to me. It's just uh you know, even that I think it's somewhat dubious, but at least it makes a little bit more sense in the mouse. It's like so.

C

Just just getting past that and um I I need a refresher on the r squared value, the variance component, to know how significant these results are, what they found like I, when you go through that lucas, maybe you can just remind me what these.

B

Yeah, so just to answer a subtech question. First, uh that's so the paper says: 65 000 euros collected across the fuso cortex of 221, awake adult mice, and the nearest sample includes six areas of visual cortex and four cortical layers.

B

That's what it says on the paper.

A

Four cortical layers: oh.

C

Interesting that would be like what one through four or something like that.

C

Yeah they're, probably just not they're, just they're, doing surface uh optical imaging on the surface like it only goes so deep.

B

Yeah there is no specific, so.

C

They can't they can't reach the deeper layers, because uh the technique doesn't allow that.

E

You can't reach later it's very hard to reach layer six, but you can you can reach layer five easily.

C

Yeah, well I it sounds like they're they're, not oh.

E

Yeah, I'm just commenting just just fyi.

C

It's harder, I understand it's harder, but but maybe not okay,.

B

Okay, so jeff asked about the r2, so the r2 is just measure of how much of the variance can be explained and that would go from zero to one and these values. They seem very low to me. Zero point: 0.1 0.15.

C

So what can you just explain that in english? So when you have a 0.1 variance, that means there's not a lot of correlation. Is that right.

B

Yeah generally, as my understanding is 0.1, it's not uh that's.

C

That's what I was thinking.

F

C

So we can argue all day long, but this is a good experimental setup, but if they don't find much here, then it doesn't really matter right. It's kind of moot.

B

It's still some I mean zero would be, there's doesn't explain anything so point one. It means there's some correlation. I mean generally machine learning. I would find point one to be very low score, but maybe I don't know in this state is they're. Considering.

A

B

Be a high score, so I.

A

Don't look at the one on the right. It's if I understand that correct. That's, just a randomly initialized network, yes and.

B

I also, I think they wanted to show. There is a difference if you're pre-trained and if you run initializer, it increases the parents a little bit.

A

B

A

Did they try? You said they did 220 mice? Did they try to relate uh one mouse to uh you know a bunch of mice to other mice, I'm just wondering what the best you can.

D

A

You know what is the best neurons are very noisy. These things are very there's, gonna, be a lot of variation between animals, and so what is kind of the best that you could do in some sense.

B

I don't know the paper is very small. It's like four pages before the reference. It's a workshop.

A

B

So kind of all the information in the papers that, when we talk about here, so they might release something uh bigger after, like a full conference paper or something like that, so that this information's not there yeah, I don't know.

C

G

And they say clearly convolution depth alone are not enough to sort of explain what they're trying to explain. I'm going to say that I I think we sort of jumped the gun and just sort of assumed that they were going to sort of make some kind of conclusion saying that um there would be a high correlation, but their sort of results are there isn't.

F

G

The architecture doesn't explain it well enough.

C

Yeah, I I think, you're right mark. I.

G

C

We are assuming that they're going to make some claim about this.

F

A

Yeah, I just find it really interesting that they would even spend the effort to do this 220 miles. These experiments are not easy to do. Is.

C

It possible they already had this data somehow and- uh and someone said hey, we already have this data, let's just crank it with a few lines of code here and see what we get.

A

Oh yeah, that could be I could I don't know that would be more likely.

B

It is I don't they didn't collect just for this. They had they used the ellen brain observatory, visual.

A

B

A

B

They just used an available dataset.

C

Somebody had somebody had at some point say.

H

Hey, let's show all these all these mice, these. You know these. uh These image data sets we use in machine learning, uh which is an interesting question why they did that, but anyway, okay, I think so. Yes, I think the conclusion here is there isn't a lot of correlation and this technique may not be very successful.

F

E

uh Justify so the r squared is the squared of the correlation coefficient. I think, which is easier for many people to imagine so, like uh r squared of 0.10 is about 0.3 correlation, which means, if you plot the you know, if you plot your variables, the x versus y, it will be like 0.3 slope. Basically,.

C

But I think it's easier for me to interpret it. Perhaps this is r squared it's like okay, you go from zero to one and point. One is not much right. Well,.

E

Just if you wanna like visually okay for me, it helps to like look at the correlation coefficient, but I guess the different different people.

C

Well, presumably, did this r squared thing, because it's somehow more relevant.

E

I'm not sure when you use one versus the other okay.

B

Okay, so I move on. I I show this because I thought it would be kind of interesting to discuss it. Certainly.

H

Better in a lot of discussion, yeah.

C

Well, I think it is interesting to see what people are trying. You know it gives you a sense of the state of thinking about these things, uh which is helpful to know.

B

Oh god, you're not gonna like the next one. Okay.

H

We'll try to hold it I'll, try and hold my tongue. Let's see, I don't know, go ahead. That's big.

B

I haven't seen it yet just a a quick reminder, um so I haven't spent a lot of time on each of these papers like presentations like five minutes, and then I scheme the paper. So I I might not know the answer to all your questions, so this paper is from the main conference actually not from the workshop.

B

It's called emergence of function and structural properties of the head direction system by optimization of recurrent neural networks, and their point here is that you could use neural networks not just to model the neuroactivity in the brain, but you could also recover the neuroactivity and the anatomical properties of neurocircuits and for this case they're using the head direction system as an example.

B

So that's a head direction system of a fruit fly and they use rnns to estimate the head direction through integrating the angular velocity and what they show is that they, some of the neurons, would be uh the activity of some of the neurons would be correlated to that activity of compass nearest and shifting neurons in the fruit flies. So what they're implying is that you could recover the same type of neurons in the recurrent neural network by training it as you'd find in an actual fruit fly.

B

So that will be the main claim of the paper. um Yeah.

C

So you know look at that seems to be less less of a stretch than the previous one. So, okay, you know this is a very simple network. It's not doing you know head direction, cells you're, just trying to recreate a certain um breakpoint property that is actually might be implemented in a very few neurons in the brain. So um so, just in case you thought we're gonna jump all over this one, I'm not I'm! Okay!

C

It's interesting.

B

Okay, I I didn't think uh I'm opening the presentation another um screen, because I had notes and the notes are not showing soon, okay cool, so uh so this this paper is a big paper for results, you're there. So this other one was in the workshop. I really like it just because of work on robotics a little bit in the past, so this was by leslie beck cobling from mit she's, a roboticist- and this was a this- was a very open and frank discussion and she she approached the conference as a roboticist.

B

Like I don't know anything about neuroscience. I don't know anything about cognitive science, but uh these are the things we've learned in robotics over the last 30 years, and that was a really nice overall presentation, and these are the things that we as roboticists want to learn from cognitive scientists that can help us, so she talks.

B

So I pick up some of the things from her slide, so she first differentiates between intelligent systems, embedded system embodied systems out there to humans, so humans, it's like a small subset of all these larger sets, and these are what she lists as the main modules we need in robots like we have to learn, transition models, inference rules and the search control and the inference mechanisms we need are these in robotics in general, right.

C

I have a question before you go on to the next slide.

B

Yeah go ahead. Ben.

C

In the little diagram, um it's interesting that she puts, I think she right. Where is it um she? She uh she puts intelligent systems as sort of the outside of the box. It's like it's the least common. It's a it's. You know I I'm not sure what what intelligence systems are in that diagram if they're not embedded in bodies or animals, you know it's like what qualifies as intelligence.

C

You know some people would say: intelligence is sort of the peak of of something you know peak of uh cognitive ability, and this sort of suggesting intelligence is sort of the opposite of that. Am I interpreting that correctly.

B

Yeah, I also thought it curious, but intelligent in this sense. It's it can learn right.

C

Okay, so that's an important it's a very simple idea: uh yeah, okay, you know it's interesting because I'm dealing with this issue in the book- and um you know trying to define what intelligence is and there's there's various people with different ideas about this, and it's interesting to see this. This is sort of almost the opposite of you know what I would yeah, what most people might.

A

Change, I I think you're going back uh going to what luca said. I think the way to maybe interpret his intelligence systems here might be sort of passive stuff. That's just getting data in and processing it, that's the most general thing and then embedded systems are maybe embedded in the environment and maybe getting streaming data coming.

C

Yeah, I got the rest of it yeah and everybody there's like.

A

There's actual movement involved.

C

Yeah, it's just it's just interesting that I think most lay people would not. They could understand everything, except maybe the intelligent part being outside. Of that. You know it's like uh just an interesting observation that um I think it grows a bit counter to what the lay person thinks about what intelligence.

A

Well, if it's a venn diagram, then that's a superset of everything.

C

Yeah but but but most people, some people would assume that intelligence doesn't extend beyond animals. You know what I'm saying um yeah, I I'm just you.

H

Know, okay, uh anyway, okay, this is interesting.

B

Alright, so so that the kind of questions she had for uh the current design, so this was her workshop presentation. She had a larger presentation, the main conference, but this was supposed to be. Like a conversation, I mean it didn't happen as it's supposed to happen because of uh you know remote, but the question she she asked was: what kind of knowledge are innate, so what sort of things we can assume and use as inductive files in our models in robotics?

B

What corners can we safely cut? I mean: what can we ignore? What we don't need to worry about things? We can learn from the brain right. What kinds of multilaterality we see in the brain that would be useful to replicate in robots, how do brains, encode spatial information? That's the million dollar questions. Everyone is working on right now and what are the multiple scales and mechanisms of learning that we have in brain and how do we? What are the mechanisms that animals use to stop repeating the same unsuccessful actions and how do we?

B

How can we model other agents? So all these questions are questions that we can learn from neuroscience. That would help robotics. So there are the questions that are being asked at the same time in this book in these two disciplines so yeah there is a huge gap right now between robotics and cognitive science, you could even say between robotics and deep learning and they're, not the same field and I've done robotics in the past, and it's very different for machine learning, but she's trying she's trying to close the gap right just.

C

Know the same questions are being asked: where is that list of questions.

B

Oh, uh I didn't put in this slide. It's on my notes. Maybe I can I thought.

C

I could show.

B

My notes or something.

C

Or just we're just uh I don't know I can share it, just share it on slack or.

B

Something like that, I don't know yeah I shot it in slack. I put it in the chat anyway, but then I.

C

B

uh Okay, more chat, but then I will share it.

C

Michaelangelo is that right is that, where you just sent us that the list of questions oh.

A

B

I got it, it's nice.

C

I got the I got. It.

F

C

Here, thank you. Oh yeah,.

B

Okay, okay, so okay cool, so so the whole point of the trip report is like showing these interesting papers, and then I put the link and you can take a closer look there. uh So this last one I'm not I'm actually not gonna talk about it. So I just thought it was interesting because you know you had that uh freestand paper, I think 2009, maybe aries concrete me, which is a reinforcement, learning or active inference, and then a lot of people are working now on using active inference principles in reinforcement learning. So this was.

B

This paper was like based on that idea, but I really don't want to go down the active inference road right now, it's going to be at least one hour discussion, okay, so uh deep learning theory, I pick up three papers which I think are really nice. So one is is also like a huge uh effort like a huge work. It's called fantastic generalization measures and where to find them, it's a reference to the movie fantastic beasts or something like that, like the harry potter movie.

B

So what they did is they evaluated for 40 different generalization measures over more than 2000 models, simple models in two data sets and the idea was to uncover uh causal relationships between each generalization measure and generalization process, so they wanted to see which ones actually had some relationship with generalization and they had some very interesting findings there.

B

One of them, which was quite surprising, is that norm based measures. They failed to correlate 12 generalization. Some of them were even negatively correlated and the reason why it's interesting is because we, uh the way we force our networks to generalize is we use uh l2 regularization right, it's essentially a norm based measure and what they are showing. Is it's not actually a very good.

B

It doesn't correlate with generalization at also there there's better things we should be using, and it also doesn't mean that, just because one measure uh correlates well with generalization that we should be optimizing for it, because when you you add it in the loss function, you change, you completely change the the load surface. So it's gonna.

B

It's not gonna be the same optimization process, but still it's a good direction, but you could use these measures to know how well your network are gonna generalize early on and then you you can come up with some algorithms, some heuristics to try and fix that.

B

So the the measures that were better at correlating with generalization were sharpness based, so sharpness is the sensitivity of the loss over the entire training set to perturbations and model parameters and optimization based ones which are like basically, the gradient noise and the speed of optimization. So this is a sharpness based measures are more difficult to calculate, but optimization based are very easy to calculate, especially when you have access to to how gradients are being calculated, and they are both very good predictors of generalization, and I think this this this was a massive work.

B

This is by, I think, the google team and it lays the groundwork for a lot of future work which can be done in improving generalization. So it's very interesting go ahead.

A

Yeah, could you give a quick description of what these two optimization based measures.

B

Yeah gradient yeah graded noise is, uh is the variance you know how the variance of the gradient between matches or between epochs, how much the grain it is.

A

B

Oscillating and speed of optimization it's how how fast you're you're moving towards your your local minima. You could say that, so these are both measures you can take uh while you're optimizing right and sharpness space you you can. You can even take like a static snapshot and they're like uh pc, bayesian bones, uh but they are a little bit more difficult to calculate because you have to do a little perturbation and see how how it changes so yeah sharpness.

A

These things you can actually incorporate into the training like a gradient noise thing. Is that something you can actually try to lower the noise in the gradient. Somehow.

B

Yeah, so that's the second point, so the goal of the paper is just to measure correlation of of these measures with generalization so and it's the groundwork for other stuff which like how? How can I use this to improve my training right? It's not necessarily that if I just add it like to the last function like we do with l2 norm, it's going to improve it. May maybe it won't. Maybe just change the last surface in a way that it's just going to make it worse, but we can try and use in other ways.

B

We can use it to evaluate models as well, so they're, a bunch of work that can be done. You know like based on this fineness.

B

G

uh I was just gonna ask I imagine that for the gradient noise, one um to bounce off supernatural questions, I imagine there's like an optimal range. I don't know if you remember they mention anything like that. Like sort of like, like you know not too much not and not too little, is that.

B

No, you remember that I don't remember they mentioning, uh because they actually just measure correlation right. They didn't measure, you know like what. What is it going to improve it just better if it correlates well with uh with generalization or not, and it does correlate well.

B

G

I know I guess it's like the correlation well like, but did they was it like? Okay, like a lot of gradient noise correlated with good generalization or a small amount of gradient noise like how.

B

Do I oh okay, uh like did they quantify? I have.

G

B

Asked but uh I don't know yeah uh I've assumed that small gradient noise means a better generalization, but I might be wrong, but I think that's the fair assumption. Let's see.

B

Okay, so I'm sorry jeff. These are a little bit more on the machine learning side.

C

No need to apologize.

B

But I clear it's like a deep learning theory.

H

B

Interesting, so the second one is actually going to support the first one. So this is a tool, it's called a backpack and it's a it sits on top of pipe torch, and the whole idea is that it allows you to compute additional quantities which are a byproduct of the backward pass.

B

So all the all all the frameworks do these days. They they calculate the backward pass, but they make it very hard to calculate extra measures you have to. You can do it yourself, but it's not going to be efficient. So what these guys did? They did this package and they did it in an efficient way and it's very easy to use as well.

B

So if you want like the variance of the gradient, for example, you can just this is how you use it and so this package, let me see what they have, so they what they have right now, it's individual gradients of the mini batch estimates of the variance or the second moment, and they also have approximates of second order. Information like uh here the fischer matrix and um yeah so field stuff. It can be very useful. For example, you usually use this approximates of the second order in continuous learning about the.

A

B

Of the the hessian and you could use the variance, for example, like we said in the previous slide, you could use the parents to estimate generalization and do something about it. So this is more like a tooling thing that it can be useful to us. If we go down that road yeah.

A

This could be very useful for.

B

Us and yeah yeah also from a big research team. I don't remember on top of my head, maybe google, probably google, let me see.

A

If it's by george, it's probably facebook.

B

But it's not it's not facebook. That's uh it's.

A

B

Okay: okay, it's not facebook, okay, and this third one is actually. It was quite interesting. uh There's a finding here, it's uh kind of contradictory to what we think. So what they show here is that the early phase of training neural networks, it's crucial for final performance and they show that there is a break-even point.

B

So if you use a very large learning rate or if you use versus using a small learning rate, you're going to go in different directions in the low surface and what's going to change, is that if you go in one direction, you're going to sgd is going to implicitly regularize your network, while you're going the other direction you're going to go to a a bad bad loss surface, you could say that so what they're showing here so here is the spectral. I think it's the spectral norm of the hessian which shown here.

B

So if you take the red direction, that's like learning weight, 0.1 you're going to go to this direction where the spectral norm of the hessian is slower.

B

If you take a smaller numerator you're going to go to the left right, so here are two different examples, and so the contradictory finding is that if you just increase your bet size for example, right so you're, actually reducing your effective learning rate, you get a larger variance of the grade. So you would expect that if you increase the bad size, you get the smaller variance of the.

A

B

B

Didn't and what they show is that keywords is very effective at that, and so one one simple adversary attack is you use?

B

You generate uh the adversarial examples um using graded that you get through back propagation and one simple defense is that you up obfuscate, the gradient somehow and then to counter that you could use what it's called backward past differentiable approximation, which is you find an approximation like g of x, of f x, you find like a different function, which is differentiable, so you get around the obfuscated gradient thing and k winners lets you count.

C

Look yeah lucas. I have a question when we talk about the adversarial attacks, the the they put them in two classifications, right, one where you know the internals of the network and when you don't. I forget the terms for those. um Is this.

B

C

Which one is we talking about here is this: assuming that we do that, the uh the attacker knows this or not,.

B

Yeah, no, this is a white box yeah you, you have access to connect.

C

You have access yeah yeah, just as a general question. Is that considered uh an important problem these days, given that you can easily obscure the internals of a network or is it? Is it considered still a significant problem.

B

Well, it is, it is important, probably because um you can, I mean the networks. Everyone is using. It's almost the same they're training, the same data set. So it's not it's not hard to replicate uh a work at all right. So there's no such thing as a true black box attack unless you're using like a very, very widely different network uranium, like some.

C

B

C

Topic here, but wouldn't it be possible to just train your network a little differently, uh maybe in a different order or using slightly different data set or something like that and with that then make it a black box.

B

Well, it will make a black box per se, would make it harder, um but but still, um but still you can.

A

B

Are different classes of problems right there, black box and the white box, which are saying that is the white box relevant? I think it still is because somehow you can get closer to it. Even if you don't get all the full way, you could still easily get the model which is really close to it. Then it just becomes a white box. I think okay.

A

B

It's not my rf expertise, but.

A

I just I was just curious, and some of these networks are very very hard to train and very time consuming like these nlp ones, and so you know very few people will be able to retrain it, but a lot of people might use a common set of networks. That's widely available.

B

A

That case, most people.

B

Just you're just gonna download from the same place right because.

A

It might take you a month or two months with 100 gpus to train a bert model, or you know that big nlp model and you just don't- have the resources.

C

All right! Well thanks. I was just curious.

B

So so the idea between the k-winners- they call is that it's a non-differentiable function and that could hardly be approximated by smooth functions and they show here the last surface right here. The e epsilon is just small perturbations and how the loss surface changes and they they show that k winners. Alcohol has this very weird cloud surface and this discontinued in in the k windows econ network. They can prevent graded base search of adversaries, samples and, at the same time it doesn't hurt training right.

B

That's that's a claim in the paper, and they show that the robustness of these networks they outperform uh other traditional networks under white box attacks.

B

uh I thought this was a very interesting addition to the work we've already done on robustness. We mainly used white knives, but we thought about it for zero attacks. A lot and this paper actually goes goes ahead and shows it and also shows it. Theoretically. So there's all the the map is there.

B

Okay, I need more questions on that, so I think we talk a little bit about this paper and slide just part.

D

Was useful, the last surface thing is thought-provoking and it's also scary. The fact that, like our lost surface, looks something like this yeah. It just scares me from a theoretical angle that we're trying to optimize this. So apparently it works, but it's scary.

C

You mean the effect is so wonky and crazy. All over the place.

D

Yeah we're trying to descend the gradient.

C

D

B

Yeah, so I I can say they were also surprised in the paper you know like if you.

C

B

They they're also surprised that you can't still train those things so.

C

There is isn't, isn't this go back to this idea that uh I first heard from sir gangly, which is like it looks like there might be a lot of uh local minimum there, but in high dimensional spaces they really aren't and therefore it look as long as you want, but there's always going to be a gradient. But that will take you where you want to go.

A

Yeah yeah, I think, it'll look. uh Potentially it could look much better in high dimensions than.

C

That yeah, that was a that, was that sort of surprising insight which I think other people have talked about now too, is that.

F

There's always it's always a saddle.

C

In high dimensions- and you always- you never get stuck.

D

Maybe, but at the same time, to the extent that you're right, I could also prove that I'm going to be able to do adversarial attacks on that network.

C

Yeah, okay, I'm just saying you said it was scary to look at this like how is it going to work and I'm just thinking well, that was maybe one.

B

I think maybe we can even do like. uh Maybe a journal club on that one day might be useful.

B

Okay, so this next paper we actually did the journal club on it last october, uh at the time, was archive and then open review. Now it was an iclear and I included here because it's something I think we can actually use even uh in the current dynamic sparsity models, we we have so what they do here is they have this dynamic, sparse model which is like magnitude-based pruning?

B

It's very similar to the one uh ching did you know ching from cerebral we talk about, uh but he adds this extra extra thing here is that he keeps true network right. He he keeps the the prune weights the mask and the regular weights and what he does at every step at every epoch. He computes the mask based on the actual weights, and then he computes the gradients based on the prune uh network and updates just the way it baits on the prune network.

B

So the next time he prunes he's gonna prune on the full weight. So if there is uh any weights that has like a large change and then they they're going to be included back in the mask, so he always keeps a copy of the full weights and they can go back and forth. They can be removed from the mask or be included in the mask right, so the gradients are always calculated only based on the prune mask, but the updates are done to the full weights and the mask is always calculated on the full weights.

B

So this is a technique that could be included in any dynamics. Varsity work, we do, it will be additive to it and they actually showed really good results. So they showed improvements over other techniques. For example, he used the same technique that chin used, but using this uh these two like keeping a copy of the regular weights and he can get improvement over that. So.

I

Just so what? What does the hysteresis look like? For that? I mean how compared to them kind of going out uh being removed in the next pass, how much uh retention actually goes on when uh they use the scheme.

B

What are the words you use.

I

B

I don't know what that means. Sorry.

I

Okay histories is basically saying uh you potentially, uh if, if the system was highly reactive, it would immediately flip, but hysteresis means you hang on to the previous values for some period of time uh before it actually uh switches over.

B

Okay, uh I don't know it's a good question. Definitely I'd like to know, but.

A

I don't know- and I.

B

Don't know if they did this kind of analysis in the paper. I don't remember that, but yeah it would be good to see right. So uh maybe we can even incorporate this in regard. I don't know, michael andre, you can uh look at it later.

G

Yeah, that's what I'm thinking about while you're talking to about it. um I'm not! I think it should be uh uh like orthogonal to rigal, but um I just want to be sure about that. Well, maybe.

A

We could talk about it, offline.

B

Yeah, yeah, okay, all right so how many eleven o'clock- okay, so the next one, but maybe something we can also do so. There is this paper from 2019, it's called snip. I don't know if we discuss that's the the idea is that you can do this pruning prior to training, so you can adjust your network in such a way.

B

Your your sparse weights in such a way that you maximize your signal, propagation right, there's, also like a bunch of theory from 2017-18 showing that you know that's what batch norm is doing, but the idea here is that you have a prone network, a compressor or network, and before you even train you can just adjust it in a way that you're gonna make the signal propagate as best as possible.

B

So it's going to be easier to train, uh so what this work does in addition to that, so it's need to something that's been out there we can use. uh So the idea here is that you could.

B

Wait just trying to remember that.

B

Okay, so what he's doing here is that for all layers uh he's trying to minimize so here's the forbidden's norm- and here the c is the mask: w- is the weight so there's an inner product minus the thing the forbidden norm of this, and the idea is that you make a you. Try to minimize this in order to improve yeah, improve the your topology and make it more trainable.

B

I I had this on my head in monday, but I actually forgot to do this, but what this paper is pointing to is that there is future work where he wants to incorporate this into the training itself. So the idea is that every step when you're, using sparse networks or proof networks, you could at the same time be trying to improve your signal propagation because, as you prune, it you're you're, going to make it first right, you're going to make you're going to break this dynamic zombie tree.

B

Even if you ensure that initialization, as you prune you're, going to remove some weights you're going to add others you're going to break this and you could keep trying to restore signal propagation at every apple. So this is the idea he's pointing to it's, not where this paper, what this paper did. Yet he what he did is just did this at initialization, but he includes there as future work and next step as uh to use this during training itself.

B

So the reason I included this here is uh because of it's something we can do right. We can use some sort of. We can do this analysis at initialization and even if we're dynamically pruning, we can make sure that when we start training we already have the best network we could have in the perspective of signal propagation and then at some steps, uh some bad books. We could recheck this and we could adjust our weight somehow our connection, somehow as to ensure that we still have a network in which the signal propagation works well,.

I

Could I ask how he's defining signal propagation? Is it basically from layer to layer or end to end.

B

So it's from layer to layer, so this minimization here is for each layer, uh yeah.

A

It's um I thought they do solder, though,.

G

Like like, I thought they sort of uh like, I think they end up sticking out some term. That would be layer to layer, but I think they do derive it. Considering the intent.

B

Okay, I I didn't know that I didn't create it.

I

Because I I was, I was, I was thinking. If it you, you basically help the signal propagation between two layers, but it for whatever reasons you prune things so that it gets blocked on the next layer. So it seemed like there would be some disconnect there unless you had some overall guidance as to where the salient paths are through the whole network.

B

Yeah, in that case, you do have to consider and trend, but.

I

B

Didn't maybe I didn't walk through the full derivation until then and, uh like michelangelo said, they're, just compounding these things and I'm doing it end-to-end, I'm not sure. Actually, because I didn't went that deep into this paper. It's like last minute edition.

B

There is a actually another paper here that I didn't include, that does same thing, we're doing with magnet pruning, but it's looking at the impacts in the next layer as well and the next like few layers, and it's saying like if I remove this phase, it's it's going to have an impact on the next layers and and it's adjusting based on that, I may I may send it just like later: it's not you.

I

Okay, thank you.

B

Okay, so the last two ones I have it's on the lottery ticket, so lottery ticket came for night, clear and this year they had two additions to it that I felt interesting to mention they actually had a bunch of papers based on the lottery ticket.

B

So this one it's called early bird tickets and the whole idea here so the lottery ticket, you train the full network and then you prune, and then you retrain the full network and then you prune, and then you keep doing that so the scheme here is that you don't have to train it all the way before you do the pruning right.

B

So the idea is that uh at some point, you're just gonna find a ticket and that it's not gonna change much, even if you keep training, so you can just stop there and it's a it's a very simple technique, actually that they they apply the pruning at each step. So they calculate the mask at each step and then they keep it in a queue and they they calculate the mass distance between this current network and the last network, you- and if, after a few steps that mask, didn't, change a lot, then you just returned it.

B

That means it's not going to change a lot, even if you keep on training. So it's a simple idea, but just by doing that it could replicate the results of the lottery ticket using six to ten percent of of the calculation they use in the original paper with the simple idea, and they also showed that this works even under low cost training scheme. So if you use uh low precision, precision training, for example, just eight bits- training which are even faster, you can still find the same mass, so you can make it even faster.

B

So it's very empirical work, but they got good results and the next one is by the same motors, actually frank on carbon plus uh render who's the first author and what they did. They they extended the idea of the lottery tickets. The lottery ticket says you could rewind the ways to their original values and what they investigated here is they tried rewinding their weights to intermediate values as well. They showed that it's better, you don't have to rewind it all the way you can rewind it just a little bit so they introduced.

B

This notion of you know not resetting the weights, but just rewinding a little bit, and then they also tested just rewind in the learning rate set up the weight. So you just go back and you're learning where it's kind of like a few steps, and they showed in the end that if you do so the orange ones, learning rate rewinding and the blue ones weight rewinding. So if you just rewind the learning rate, it's enough, so you don't even have to rewind the weights and.

C

Then compare it to lucas, I'm lost on this one a bit, so I just the basic idea. Why would you rewind? I mean that sounds like you're just undoing. The latest learning is that right.

B

uh Yeah so the lottery ticket's a kind of strange paper: okay,.

C

So this assumes, I know what the lottery ticket method is, that right.

B

C

B

The method works like this: you train the full network and then you prune and then you rewind the weight all the way to the beginning, and then you train again so you're going to train again. I.

C

See so you rewind and then trade again.

B

Got it yeah and you can.

A

But you only train the only train, the smaller network.

B

Yeah yeah yeah yeah, yeah, yeah, okay, so the extension here is that first they show that you don't have to rewind all the way and then they show that you don't have to remind the weights. You only need to rewind the learning rate, so the in.

C

The original technique, you only have to learn realign the learning rate. Is that what you said.

B

The learning rate- yes, because.

C

In the original pruning.

B

Texting well go ahead.

A

So when you say you only need to rewind the learning rate, that means you randomly initialize the weights.

B

No, you keep the same weights and you rewind the learning rate and.

G

Then you return a little bit rewind the learning rate and then continue training. Is that correct? Yes, okay, which.

B

Is different from the fine tuning, so the the one, the worst one they're comparing is fine tuning, because what they did in pruning before is that after you prone you kept training, but you kept training with smaller learning rates. I mean the idea is that you just fine-tune the network. All right had a good network, but you made some changes and then you find one, but they.

A

Show here actually, if you increase the learning rate again uh and go back down.

B

Yeah, essentially, is that.

G

Sorry does it show that you need effectively less number of epics in the retraining uh when you just rewind learning rate, as opposed to just relying the weight like? I presume that that would be the case like let's say before. If you re-round the weights, maybe you would take like 100 epic to get back to that original accuracy. But if you run the learning rate, does it take less ethic.

B

That uh that's a good question. That is some so, but I'm not sure, okay, you're, saying that there must be a point to it right what else they will do that so.

G

Like itself, but hopefully it would also be computationally more efficient.

B

I'm not sure, michelangelo, uh maybe.

G

B

G

Go through okay, I guess this is going to be the same question. I guess I'm just like wondering about there, but.

B

I I I actually don't think so. If you look at the pruning algorithm here, they they train for the original training time. So the trend to completion, prune, 20, lowest magnitude, weight, retrain using learning rate rewind for original training time, and they just keep doing this until you get to this parts that you want, but.

I

B

Get you do get better accuracy right, you don't have less effort, but you have better accuracy.

G

Oh yeah, I'd be interested to see if, like there was some combination of the previous approach and this approach like if somehow you can keep track of like when the maths don't change that. Much and then just rewind your learning rate and just continue.

B

Yeah I mean the early bird and this one right. You could probably combine.

G

B

These are two different research groups and they release the papers at the same time, but you could definitely combine both yeah.

B

All right, so I thought I had to mention the lottery ticket just because you know subtitle and agree. It was the best paper last year, but uh they had a lot of issues, but it also brought some new perspective on the pruning.

B

Approach uh I had cool branded stuff, which yeah I thought I didn't have time for it, and I was right. uh Some sub just talk very quickly about it, so there are some papers on using deep learning for mathematics and especially, I think, the one I like the deep learning for symbolic mathematics. They were using it for what is it good?

B

Let's see if I can remember they're using for symbolic integration and differential equations, so usually what the approach they tried in the past was for arithmetics, like simple arithmetic, and now they were doing like more advanced stuff, so it was really good.

B

They had this bold paper called, can quentin algorithms for deep conclusion or networks where they showed a possible approach that could be used. uh I say this with care because we know that you know there is no algorithm right now that we know of that would work better in quantum computers than regular uh architecture architectures, but they they show, they show a step towards it, and uh it was interesting to see that someone is working on the problem and yeah there's some other papers that won't go there, so yeah that's!

B

This was this, was it I hope it was useful? I tried to select a few papers that I thought you guys would like.

A

C

This is great, it was good. Does this remind me how many talks? Overall, you said it was a lot, I'm just getting a sense for that. They.

B

A

Well, you said.

B

A

1336 speakers 1400 wow in in your slide here, lower left.

B

Yeah, I don't think they had 1400 papers. That's.

C

B

Maybe they added the number of altars per paper, uh I'd guess like 300 400 papers, that's my guess!.

C

It's still a huge number.

B

Still here you know yes,.

C

This would be so hard to know what to look at you.

F

A

And it's one this.

F

A

Smallest, this is one of the smallest of the big conferences. You know, nurips and icml are way bigger than cvpr yeah.

B

C

Mean it's interesting, how you say yeah, I understand what you're saying when you're in a conference and you're physically there you're sort of forced to keep going at it, um but on the other hand, I find when there's so much data and you don't know how to sort through it.

C

A more productive strategy is like the way I read papers you just you know you can scan abstracts and figures on many many papers quickly until you find something that resonates with you and then it just just your search strategy for finding some useful information is, is so complicated when you have so many different ideas out there and I'm just speaking off the top of my.

H

F

I'm just lamenting how difficult it is to find relevant information. The information that's really going to help you sometimes.

D

I think another another factor that is uh when you go to a conference. You, like you, put it on your calendar. You are out of the office. You are mentally engaged in that conference, no matter what search process you're using having that blocked off time, where you really are mentally. There is so valuable.

C

Yeah, I agree it's so like. Okay, I'm dedicating 100 of my time versus the time I'm spending might be more fruitful in an offline search, um but there's a balance between those two. I agree with that. So and sometimes you sit through a presentation and you just you, don't think it's got any relevance at all and then all of a sudden, you know sort of the end of it's like holy crap. You know look at that. That was a great idea. I do I've gladly heard that it's just interesting how difficult it is.

C

F

I'm thinking about.

C

um um I'm just going through, so you assume it. I sent out that paper uh this last week or this weekend or something about the the layer. Six paper um remember that type of time. Yeah yeah and it was like a layer, six anatomy, and I was going through that paper and there was like dozens and dozens of references on layer six. It was like my head's swimming, it's like, oh, my god. These are all new references, they're all new papers, so many different things and it's like it just makes.

C

Sometimes it's just overwhelming and you just you have to sort of give up in some sense. But at the moment I don't know where to begin here. I have to come back another day. um It's just an interesting problem. We have there's so much data both on the neuroscience side and on the machine learning side, it's difficult to figure out the big piece of something.

B

Anyway, that's a good summary. Thank you. I think, there's a third component. There is that I used to run, I used to run a half marathon and when I was training for it, I was training every day. I could run like max five miles six miles and I was just exhausted because I was training by myself, but when I was at the marathon uh you know I could just keep running after finish.

B

I could just run another one just because you're there, you know like that big crowd around you know like people cheering and people there with you like running with you and you don't feel the pain you don't feel the suffering, but if you're running alone after five hours, it's just done so I think that is that effect as well, like you're, really tired, but you know there's so many so much people, so many people going around so.

C

You're saying well you're at the conference, it's like you're running in a crowd and therefore you just have more energy and keep going yeah.

B

Exactly like you feel all these people are doing the same thing. I have to do it and you just keep going.

A

Yeah they just find the strength.

I

B

A

Have for you know these online things you know, can they reproduce the social aspects of it? And you know all the different ways you know I find going to this conference. Is that the talks that I have within people? Those are as much as valuable as listening to the papers themselves and you know: can they really effectively reproduce these kind of social? And you know in-person, communication aspects of it.

C

Well, I thought I thought you said earlier: they had like separate uh chat sessions for each each um poster. Is that right so.

A

That's, I think, that's what neuro match is doing. I.

C

Don't know I cleared that.

A

C

So I that seemed like it might work in that regard. You know that yeah yeah, you could just have a zoom call and just sit there and chat with the person.

A

Yeah, that could be better than a real poster session.

C

B

Yeah they had that, but I don't think you still replicate. You know like the effect of being in a crowd, so I I hope they keep all this instructed infrastructure. They did for the online thing, that's really cool with the physical conference, so you can have both. That would be perfect for me.

E

An ar conference.

B

Yeah conference.

A

With no free food and coffee mugs and things like that,.

E

Imaginary, coffee, mugs.

B

Yeah, all right all right thanks. This is great thanks. Thanks.

H

A

Right so we're done yep.