National Energy Research Scientific Computing Center (NERSC) Data Seminars Series, 12 Apr 2019

Previous Meeting

⏯

youtube image

►

From YouTube: 2019-04-12 - Gerald Friedland - Sizing Neural Network Experiments

Description

NERSC Data Seminars: https://github.com/NERSC/data-seminars

Abstract: Most contemporary machine learning experiments are performed treating the underlying algorithms as a black box. This approach, however, fails when trying to budget large scale experiments or when machine learning is used as part of scientific discovery and uncertainty needs to be quantifiable. Using the example of Neural Networks, this talk presents a line of research enabling the measurement and prediction of the capabilities of machine learners, allowing a more rigorous experimental design process for machine learning experiments. The main idea is taking the viewpoint that memorization is worst-case generalization. My presentation is made of three parts. Based on MacKay's information theoretic model of supervised machine learning~\cite{mackay2003}, I first derive four easily applicable engineering principles to analytically determine the upper-limit memory capacity of neural network architectures. This allows the comparison of the efficiency of different architectures independent of a task. Second, I introduce and experimentally validate a heuristic method to estimate the neural network memory capacity requirement for a given learning task. Third, I outline a generalization process that successively reduces capacity starting at the memorization estimate. I conclude with a discussion on the consequences of sizing a machine learner wrongly, which includes a potentially increased number of adversarial examples.

A

Sealing mics are on and participants on the phone I assume this is being hurt. If not please write it down. I, don't know how that's going to work. Welcome.

A

Professor of electrical.

B

Engineering science.

A

Mit senior fellow and he's.

C

All related okay, good yeah yeah, thanks very much for inviting me over yeah.

B

C

I, a little more mostly enjoy working with physicists and when you go present, the machine learning the answers and then I understand what and that actually led me to like, say: okay, let's go in and and treat it like engineering, let's start measuring it and the questions of course, what to measure.

C

So this is what this is about. Just about me.

C

This is for examples something I did for my for my PhD while ago 2006. It's still just automatically extract for run from background. Much of the mostly multimedia.

B

C

A purpose and an infrastructure to to basically give access to 100 million images and million videos for research and not just give access. Of course, I can dump you a petabyte of data or the major point is make it so that.

C

As part of that- and this is how this all started, apart from the other physicist story- is that we have a very good.

C

And then they asked me.

C

They said: okay, fine, but what's the budget that you need.

C

So that's two orders of magnitude discrepancy.

C

Magnitude for budgeting these exponents and the big trick.

C

The only got two books on popular one science book that I had on my shelf for a long time before automation, theory, inference and and learning algorithm chapter 40. He does something, that's really interesting, he goes and says what are we trained in a genre with random thoughts? All we do is we say a uniformly random point and we want to know what are the mappings that machine learner can still write, because you can't infer anything from random points really.

C

So what this means is I.

C

Can give you a number five, two, nine six thousand four. But the point is what can you say about these numbers? Nothing, but if I give you two four six, eight ten, we want to say ha plus two right and it's like here. We go just inferred something because these numbers are not random to infer the world. Now all you need to store is class 2, not 2, 4 6, 8 10 and the.

B

Funny thing is: if I give.

C

You 4 6 8, 10, 12, 14, 16, 18, 20, say.

B

C

With I got it already.

C

Much Dana the need to give you too much data, and so on so interesting thing.

C

C

B

Can measure that in.

C

And predicting a national capacity first of all, I love true for tasking dependent comparison. I can I.

C

Can also be given a mission, an honor and a training data that I have okay. So what I do.

C

And that gives us a measure of generalization basically points divided by the capacity and that's a bits per bit. Okay, now you say hey, we expect an expectation. Obviously your expectations, ation is dist. Before we go.

C

Logic can assume one of two states right that means Sigma is usually will become an alpha betas. All the states can soon or all the symbols. Symbols are States right, so we this is 0 1. That means it says to.

C

If you have two variables.

C

Each of them has two states.

C

Which is the same say.

B

Fractal, because.

C

If you do up two or three.

C

But also that log to actually.

C

Dimension you can, but what is more interesting for us right now is information is reduction of uncertainty, which is exactly clear itself. What you say is well you know this in front. You say summation, lock, P P is some probability. Now company can be distributed in worst cases, it means not to one of states. Now. Here's what's interesting.

C

Yes gives you try to bound, but it has another problem, which is you need to assume something which is a and B. You cannot need to know the partitioning, but what, if I say, I really only care about the worst-case guarantee that whatever the distribution is I can put stuff in well. But you just memory memory.

C

Is not true of the number of states, because that's normal in laws.

C

C

C

What is that thing is doing.

C

So now, when you do its, you can choose weights WI to just multiply what.

B

You do, is you ask yourselves.

C

Is that dot for that radar? Then it chosen is the cradle and that's all you're doing and like.

C

Why are we actually doing this boy just to get a sense connection? Wow ended.

B

By two people and.

C

Decided I want to create a signal base signals which is a big current Moga. We don't switch place of a single research based on an energy. That means the signal times time. So until is enough energy, no switch. Yes, Wow! It's exactly one neuron dots! You only think about.

C

C

C

Okay, so now, let's talk about this, this.

C

C

C

You can choose any other.

C

C

C

So if you have.

D

D

Channel that we.

C

Don't care about identity, child but probably decode, and this is again memorization when we eat coma. We have those weights and the same because again it's testing on training just for memorization. Nothing to do. We say: ok, you code. This then the question our lives together, of course, labels. Prime out. The question, however, is where the style is fine, equal latest, under which condition kevin neuron, as member.

C

But the salary of.

D

C

Is the number of points.

C

C

Say how many do I need.

C

B

He says: well, we produce again sakura's.

C

Of course, if I'm another person Kersey, yes, it works and it turns all mine. Yk is one you can absolutely guarantee you can able it. So these are randomly chosen with any points, whatever position and whatever you choose. The focus.

C

Everything can be like he's perfect training if you can't your training has a story, but again you talk of memorization now. The next thing you do is you go to the second one and that's what actually.

C

Compute information information capacity in the knots, the asymptotes, the turns out it's kind of funny because even for random points, he.

B

C

Of and compress a little bit even never ever to the point where you have twice the number of points, then you have number of ways you get. This is 50% records, be 50% for binary. Classifier is obviously the cutoff, but the point is this is an interesting range. Not most work concentrates on this interesting ranch.

C

The way the thing about this is I mean it's also like. What's the maximum, the maximum is you stay under this asymptotes is a bunch of I. Don't here I'm an inking, you I want to give a guarantee. That means to limit that I care. For now. Is this okay? But this is this.

B

Is for single neuron, nobody.

C

Uses a single neuron to suffer machine learning problem. What I want to understand is something like this.

C

The right thing, because that is what we think of.

C

For example, you cannot solve this single layer which can also solve do this, and that's a very interesting one, because the number of neurons here is much smaller than here, and you still get so, and it turns out that is called a shortcut networks for decades. Facebook.

C

So now why does res network.

C

I know, but for now we don't okay, so the first thing: okay, well its.

C

C

F of X is X. Okay. Now, when we see a machine learning and what we usually concentrate on, this is a binary classifier. We have this situation f of X is a wife.

C

B

C

B

C

Capacity for an arbitrary network.

C

C

Next test, so let's say you have, which is three bits of capacity. You have another perceptron that has the same input, but as independent weights and buyers, obviously just add some others. That's complete.

C

What do we do with those layers? How do we stack those layers and s HP and he said: hey. You gotta, take into account the data processing inequality, which is you can't it's F of F, of F of F X 4x.

C

The maximum memory capacity of a layer of perceptrons, depending on the this limited by the maximum output impedance of the previous okay, so I think let's do some examples. First of all, single person treatments are pretty simple.

C

Next one is this one right? So what we have is 1 2, 3, +, 1, 2, 3, 3, + 3 is 2 times. 3 is 6 base, 2, outputs.

C

And for fun because it makes more.

B

Interesting I did not put a virus in this one and that.

C

Means this is just for fun and so 2 bits. Sorry, this is 2. This is 6 plus 2 is 8.

C

Sorry, next one now here's what happens. This is 3 bits again: 1, 2 3. This is 3 bits again: 1 2 3, safe space. Ok, now what happens here? How many does not? Now what why only we create bits? That's a prop! You count! It fires enough fires, you duplicate! The signal is on fire outputs. You have same right so now with that you have 1 2 bits of input here right. So one bit from here hundred from here.

C

That means that in reality, you want this one enough, six bits or capacity, you're gonna work with two bits: who needs six bits of capacity when you only get two bits so.

C

Right and yes, what do you.

B

C

About making this very deep doesn't do anything to try this out.

C

Because all the two bits you can't create four bits. So not all this means you get the minimum two and then.

C

There's nothing you can do um it's data processing, equality.

C

Wow, this is three pills again, nothing. You know here, but there's two Goods coming input from completely different. That means this thing basically makes an informed decision right. It's basically looking at the two bits.

C

Perfect decisions are the best. What this means is we get yourself in its capacity, despite the fact that we actually have them stacked, and this is there's a paper deepest.

B

Neural networks.

C

What stack them this way? It's the most efficient way you can do it and the total.

C

C

So what you have here, it's it's so funny looks just like it's the same curve, except now. We actually scare the dimension and we also scale the number of neurons in the middle. So this is this: we layer and not P, when we actually just do the math, it's simulated with the master Stadium like this and now the question. Obviously that's the theory. How does the practice look like in practice?

C

This is scikit-learn.

C

As we have Hypno's well, yeah I mean a bunch.

B

C

Stuff now he is my father. We tested measuring Turner's, then caved in you did you trust a machine like this I would write at least wanted to be able to memorize what I give it my random thoughts. It's not obviously get to chose it one more of this already guess what 20 bucks no networks, this right. Nobody ever checks on affected. Also new networks such as programs that have a bunch of bucks and I, can.

C

Do networks I do not trust, and these are some of them- weren't owned to the image community yeah. So this is scikit-learn. Let me tell you: is the best coffee cup I use it for everything?

C

Long-Standing implementation, part of Python gets forgotten over all the hype of tensorflow. Just saying it's, a flow isn't bad by the way, but this is the best. So.

C

But didn't want to stop there because my car so I do the table of some data right. To give me a table and.

D

B

C

Can be abstract to this? What can be generalized so, for example, each X a lot of money? What is it experiment? Each diamond is a row of kind of variables right. What are the experiment? Is a million dollars I'm gonna, be super cool to say: well, we hooked kind of affording 50 of them, but here we start to generalize and see which parameters we need you to which parameters can be already predict which which works to influence off now.

B

C

C

C

Because we only need to sort it's real fun. Remember also, memorization only allows for permutations was the best okay, you memorize. There should be anything more needed, no relations work back and forth right, so they expected cases.

C

Wasn't stupid if I actually give the food freedom to the parameters? What would be expected welcome exponential training by us? Okay, so this is the.

C

To create a dump network n log n time for this I will it's a capacity of in terms of the number of.

C

Now I ask myself: okay, so have a dumb.

A

C

Using the weights, that's a good question, so is highly inefficient in terms of representing the points because it's done, okay, it just takes a lot more composition and it's mostly on person accurate. But obviously you can create cases where your hash collisions and all these things, but.

C

If you have a coalition, who cares? That's what I have training for so I just come to have new Ronnie? Okay, all I need to do is come to runs now. Interesting part is that I said: okay now, I have a dump.

C

What would be the most the smartest hash table? I can build just don't think about back propagation, don't think about others. Think about what would be if I give you something to memorize points in the database. What bigger structure comes to mind.

C

C

Perfect binary search tree, but also the bee tree, but a bee tree is a binary structure that stores initial values at each point, which is exactly what.

C

So be tree skills in the side of the points. So what I'm, assuming is the nice thing, is linear me and then a lot of the number of neurons and we have a page in the best text again just memorized, and it turns out that gives you a prediction that I tried with a bunch of things. First of all an end, it says many two bits which is actually right on X. It says you need four bits which is right because it can be done can be done.

C

What is three places on the run, and so the validation is actually training a real network with the same capacity and stupid couple. These are just academic examples, but it used image that I said. 2,000 images did you process now.

C

So that number was much better than I ever hoped it would get all these results. Please repeatable test them out. Question them sent me ma'am, okay, because I don't.

C

E

E

Yes, you have a sense for how.

B

E

C

C

B

Memorization is worth it.

D

Generalization this is how I started.

B

Right, it's a typical problem, also.

C

It's five to nine number of verses. Two, four, six, eight!

C

You asked him a question.

C

He actually just memorize so now.

C

Nothing is ever random. Okay,.

C

You should be able to.

C

Just wasting time.

C

Doing but the trick is to maintain accuracy as much as you can, while reducing capacity now parameter reduction is well known, people what we need to reduce parameters, but the trick is.

C

Influence right, so you think about memory capacity and that's another point.

C

Yes, regularization reduces the freedom of the parameters. By some way of doing early, stopping trouble is the same as a capacity reduction right. You can either.

A

C

The parameters presidio or you just say, Oh, any good super confident have full freedom same thing. Now you stop at the minimum capacity.

C

Like this right, so.

C

And this is your capacity that is increasing or decreasing, then for your training set. These were random. Points is still going to render points. I'll show you a bit.

C

Addition of why this has to look like this, for a training set for men, which is.

C

C

C

C

C

C

So we tested the thing we're basically just trying to avoid a brute force on extra parameter, but you don't see how to tune it because you input data, there's no way to check this out the other, my boss and that's the major point. Ok, less parameters just give you much higher chance to have anything explainable and also in general,.

C

Why well I just think about this? The memory capacities he says each parameter each day of each part. Well, let's say this: what if each family memorizes ten the labeling of ten points, the chance that if you have known independent test set coming and one of these parameters, can take over to the it's much higher, because.

D

B

You can anything.

C

Better say in progress so the chance that you have to add at the situation where you actually camera and another parameter when, when you have a representation of like let's say ten points, it's much lower right so also I cannot tell you and there's all these other portals right. So I don't know, hey, let's try them, but I do is I, generate random points and then report back.

C

C

Because I made sure that the period length of these random points is huge, so also.

C

Should be selected, I took that from Wikipedia and I. Don't know they know that they have a joke, because that's the explanation and that's and.

D

Should go into that.

B

C

C

But the does anybody know the original.

C

The original version formation was a letter and translates like this. Don't multiply more often than you have to now. You say: okay,.

C

Don't mind if I'm often, then you have to you're talking about dimensionality here, remember parameters: you need help each other energy, and what does that mention nothing else, but the number of times you have to multiply. We are absolutely.

C

B

C

C

C

C

Is the Machine honor to give you generalization, so that means well. Our change fits in the input not relevant to the outwards.

C

This is, this is.

C

C

B

C

Could have, for example,.

C

But not only that practice.

B

We never have infinite.

C

What I can say as well I have 20,000 points in my table, but for some reason in the same memory capacity as for 10 points. Well, that would be super awesome. But let's say then you do the generalization of four in pivot. Four bits fall into one bit: that's what it is and that's all highly measurable.

C

B

C

Right, so if you know those I want to try yourself, this is a bunch of abstract stuff. Here.

B

It's TF. What you do is you.

C

Get a couple of data sets and when you click on them it shows you the stupid capacity and the expected just 40 bits in this case.

C

This novel features features.

C

Ability with all features for now now what you do is you check those features in, and you came here plus minus it layers and also plus/minus neurons in each other. The output neuron is here, you press play and stay. So now people go and say: how is it that you describe all this one scalar.

B

C

Thinking about learning.

C

One and two others I say really real time is volume. This is basically volume capacities like origin. How many limbs fitted how many functions are many points? Can you label.

B

C

So that means, but.

C

Before and you see that by these neurons not doing anything.

C

And try that out, you will see how much.

C

Yes, CNN CNN is nothing else, but deep network fully connected, which is this plus a convolutional there. They can just thank you, my just to place those parameters that were randomly chosen with the parameters that were choosing and hopefully a more intelligent way by the jpg, coffee and guess what what okay? So, what you do is fixed parameters in front of the convolution are nothing else but filtering out. It's statically, because you can't, because.

C

C

C

C

These Germanisation undefined, because generalization.

C

Chemistry by measuring it's time to convert.

C

Into actual science, and then we can go put together and say: let's call it a science, okay,.

C

So first thing: right: now: we np-complete and.

C

Allowed for something you asked to memorize, something is an isomorphism not resumes.

C

For example, is sorting is n log N.

C

C

C

Whatever it is, it should be: okay,.

C

C

C

So that's not so bad training.

C

F

Even like you retrain reinitialize, the pros network range wise. You will actually get to this. You just said you did mention like this.

C

In the explanation of how these.

C

Costco is but shapes so now you've got impossible with a car. That is exactly the capacity of your of what you buy at Costco.

C

So you buy my stuff at Costco and exactly the capacity each one is measure out now honey. We will take you to the when the exact when you need to match the capacity exactly it's been a exponential time now, if you take a car, that's four times four times the volume. Well, that's great! You know what you do. You go and load the car done and, of course,.

B

C

To look at different architectures, but is the problem he predicts where the items will end. So you can. You cannot just there's.

C

Practical purposes they imagine not generalization the measuring actress and yes, we get a hundred some accuracy. All the stuff is in the Cosmo is a new conference how the percent, but can.

E

You generalize from that. No okay, so there's a different line of research over monetization, actually.

F

C

Other I can also give you one more thing that biases research, which is you, don't have perfect training. It's a real problem. That's.

B

Why I'm saying training.

C

Training claim the promise, if you train in some weird way, they'd already introduces some buyers. You can you can infer things from that. Actually, the first thing we would need know to get to a more accurate notion of this. It's perfect training training that guarantees doesn t have. The capacity would actually get you.

F

Didn't mention at some point that it's easier to train capacity.

F

C

You need easy in general, that's true, because you have to to the public in the States and if you just have any larger of NZ.

C

You see memory capacity, this reality, artists, not only sector-specific, will expand this, but also this I see below the article also has some kind of factory need to be done. These four things. In order to make the simple the artist manufacturer says when you do.

B

Have as L the number of.

C

If it was memory right now, you could say, but my father, what is refuted in this way I can say forces yes, so you zip and get out some generalization or optimize. Your heart is in some biased way, so that these specific files can be stored in there. Okay, which is my two operating systems, was like a couple decades ago. We did this for images in the kids story.

C

I think wasn't a spark, so these kind of things can be done, but if you say I want to store what to memorize random points, which is anything needs to fit right, then into the notion of memory clearance. There's nothing else you do from there on you can only yes, you can compress in various ways, and there is.

B

C

Be careful assumption biases, so you have various things to do to make things smaller, but the objective measurement is: if I need to resolve enough space parameters volume to put in anything. What is that working and then, when I do whatever I do? What is the volume and I'm ending office, and that's that quotients channelization.

A