Numenta Numenta Research Meetings, 18 Oct 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Variational Inference provides a taxonomy for training deep networks

Description

Just present some observations that will be helpful if you want to dive in deeper someday. Most networks / objective functions can be translated into the language of variational inference, and doing so often provides useful insights. I’ll show an example: how Gaussian dropout can be described in this language, and how this tells us something interesting about quantization. (This observation comes from the variational dropout paper http://papers.nips.cc/paper/5666-variational-dropout-and-the-local-reparameterization-trick)

Oct 18, 2019

A

Here I am talking about something: that's not new to researchers like plenty of people know what I'm gonna point out, but it's something that a lot of people just know quietly and asterisk. It's a cool fact about variational inference and it's linked to a lot of the ways we train deep networks and the first half of this I didn't mention, but the most. The first part of this is material from this paper: practical variational inference for neural networks. This is by Alex graves, mr. Minnick, 2011 and and then.

B

Toward the end, I'm gonna fade into talking.

A

About this, other one variation will drop out in the local free parameter is Asian trick. I'm gonna talk about this. Just a.

C

Little bit certain.

A

Parts I'm gonna discuss so.

C

It titled this.

A

Variational inference as a taxonomy for training, deep networks, I, don't use the word taxonomy very often, but it's useful here when somebody talks about a taxonomy, an example. They'll often use as the periodic table is a taxonomy for elements and the discovery of the periodic table. A lot of stood a range everything in a way where we were able to start looking for gaps in between them.

A

It gave us a way to just organize everything and that's kind of what the taxonomy is a useful thing, and it gives you a way of of mentally arranging things and now here I'm, going to point out that variational inference which I'll define shortly a lot of people know about it at least a little bit, but I'll try to just make it seem straightforward.

A

Yeah I'll, lay that out and then talk about this this table of having that the various methods fit into variational inference. So you can think of training, a network training, a neural network as and.

C

Specifically, I'm going to focus on classification.

A

Here so you're, given a set of inputs, a set of labels and you're learning from that you can think of training, a neural network as being given a set of and put some labels. It is that X's they've, set inputs and Y is the labels for those and inferring a probability distribution over what the weights for that model might be, and.

A

These numbers, the the idea of a set of weights having a certain probability of being the correct ones, is actually like a well-defined thing is not just hand wavy.

A

If, if your model, if your model outputs probabilities, if what it does is it takes inputs- and it says this is a coffee cup with probability a it's a marker with probability, B, etc, that you can now invert this probability, distribution to figure out what weights would have been the best at giving those classifications, though, in doing that, and doing that you either implicitly or explicitly have some some prior over the weights. You have some initial estimate of them or or you have some notion of what weights are more likely to be correct than others.

A

So this is where the word regularization comes in this notion of having a having this idea that, like hey large weights, are less problems in small ways or or networks are sparse or you know all sorts of you can have these priors encoded and too into this term. But the point is, if you can explicitly say with your prior, is on the weights and if your, if your model outputs a probability distribution like this, then this is a little kind. It's this idea.

C

A

B

A

A certain weight matrix having a probability of being the correct one is, is like well-defined, it's a real thing, so variational inference is, is choosing to.

A

Create your own probability, distribution over W and and optimizing it optimizing some function so that it more close so that it closely approximates the actual one and the.

C

Actual one is this big, complicated.

A

Joint distribution, this big thing, like you know it is, it is assigning certain probability to the matrix as a whole, and then it's assigning a different probability to a different matrix as a whole. It's a big, complicated thing that you can't visualize and in variational inference you take some much simpler probability, distribution and that you can visualize and you and you tune it and the core thing you're doing here and I just want I wanted to make. This seem simple.

A

The core thing you're doing here is you're maximizing the likelihood of the set of labels and the inputs you are given by tweaking that distribution. This.

C

B

Is this is the log likelihood.

A

This is with up with with a neural network. Blog likelihood is, is how you is, how is your error function, and so this is very much like training, a neural network, and if you also want to minimize the divergence from your prior over the ways, this is a regularization term. This is how how different are the weights I arrived at from my prior distribution over the weights, so.

B

You have these two kind of conflicting things.

A

That you're optimizing and, in this condition shown you can it can be shown that by by doing that than maximizing that meant minimizing the divergence you.

C

A

A

Q that you are finding the distribution and you're in your approximate family that is closest to to the to the correct one you are by by doing by by optimizing.

D

This you are optimizing.

A

This you're making this smaller but.

D

He can't sit when you say it's a correct one. It's the correct one. Is this cute.

A

What do you mean by the correct one? You are finding the Q that is closest to this close as possible, one, it's. You are not you're, not yeah under.

D

The restrictions of the forms of Q- yes, so.

A

Yes and Q is often probably very simplistic. Yes, it's like it might be very different from the best.

C

A

Yeah, like you're like searching this like this cute, this family q you're you're moving around this family to trying to find like the distribution that is closest to your action to P the actual one is it.

B

Your actual and live lives off in.

A

This space somewhere that is unreachable if you're stuck on this, you know this manifold of of functions that are that is given to you by by Q, when you're, finding the one that's closest by a distance metric by the statements metric.

E

By not trying to address the huge CV matrix full space.

E

A

Just the easy statement to make is, you are certainly not going to discover P sorry.

F

You're not you're, certainly not going to discover.

A

Their track distribution- that's the best answer. I can give okay.

E

So I was back in of obviously there's a vast space of queues. This is one particular instantiation of it I'm, assuming it's kind of a reprojection. This is what we're dimensional, or at least as far as number of variables, in the distribution.

E

But I was just wondering: are there other ways approaching this? Not this particular formula? There are other ways there are families of queue.

A

All there, because here I, don't really specified.

D

This is independent of the family.

A

Look you, like you haven't, said anything about it yet about what Q is Q could be any. As far as you said it Q could be anyway, the idea of separating it into a bunch of things, although that is what.

E

Negatively happens. Okay, so this is their way of bounding. How far you are from.

E

A

Far, you are from it now there's now it's pretty much always better to give it as powerful of a distribution. As you can it's flexible of a distribution. This you can to.

C

Move to the best of my knowledge, it's.

A

Always better to do that.

A

Contradicts that, okay, all.

D

Right yeah, let me make sure it's well defined. What the optimal P of W given X Y is.

A

Well, I think it is what it is something you could I could write out. The Bayes rule version of this how to calculate this, and it's going to involve integrating over all possible weights to compute it. It's going to be something that.

E

Is that exists, but is intractable, so presumably the full ceiling thing would be every possible joint distribution of every matrix coefficient against every matrix coefficient. That's.

G

That's why you can do this absolutely weights. You can actually pull that out. Well. That space looks like.

D

Well, Bayesian stuff, always I, guess.

C

With the reason why I'm saying is.

D

In sometimes the best P of W, given X Y, the.

C

D

Is the stuff that's going to work on future inputs? Well, not on this.

B

C

Sounds exciting? It's not really.

D

B

Can't even define it.

A

Really, that's this data.

D

Set you really have to define it in terms of all future.

E

So that's really what you're trying to.

D

Optimize right so here.

A

I'm, just yeah, given a prior and given this anyway.

C

Given a prior, given your model itself like.

A

How many weights are this is a well-defined thing. However, that's not like that's your.

B

A

As it's not like, you have actually solved how to classify these objects, because your prior might not be right or some other way. You've set up the problem, everyone so yep. So this gives me into so. You have two decisions to make when you, when you perform variational inference to two high-level decisions, one.

C

A

You choose your cue, you choose what is the approximating function? You're going to use and the other is you choose your prior over the weights? Now here's like the little fun facts. The park that I just wanted to make sure people knew so.

C

Let's start with some with a really.

B

Simple distribution and.

A

I'll define this I'm sure a lot of people have seen it, but a delta distribution which is sort of a half. It's sort of a cheeky way of saying that, like hey, you have a probability: distribution, I'm gonna, give you a delta it's a way of providing a non distribution. It's a way of saying like okay, you want a probability distribution, I'm, going to put infinite probability mass on a specific set of weights. I mean I'm, just gonna. It's not a proper distribution.

G

A

In the area under this curve had this one yeah.

G

There, the area under winner of the aptitude and.

A

It so an integral that captured that captures this.

A

So if you choose Q to be the son.

B

Of these, which is just to say, you're you're, just tuning.

A

Weights and you.

F

Don't really define a.

A

Fighter you assume it's just a uniform. All weights are equally likely what.

C

Comes out of that is your standard way of training.

A

A neural network or that's gradient this if you're.

B

Passing the entire data set, it's.

A

It's like it's like performing gradient descent on your entire data set. It is how would we trying to networks or the the basic way of doing it if you're, using if you're, using negative log likelihood that's your loss function, which we usually do if you using soft max? That's typically what you do. That's basically saying well, there's no.

C

D

There's going to be no uncertainty about what the weight is once we find the weight. That is what it's going to be and we just treat it as a explained: yeah yeah, if you don't need to integrate over uncertainty, interval yeah now.

A

If you take that and make and keep keep your if you function the same but have a have more of a prior, what the winxs our moments are more likely to be the others. If.

C

You set that, prior to.

A

Be Gaussian, you expect a Gaussian distribution of weights with the with the mean at zero, and you choose some variance that you take that do the math. What pops out is standard batch, gradient descent with l2 regularization or weight? Okay.

A

So when you do the most common wet, probably the most common way of training networks, you are performing variational inference with a delta distribution that, with.

C

A

And Gaussian prior around the weights, there's a.

C

Very similar story.

A

You can tell with l1.

C

Regularization.

A

That's the same, except with a Laplace distribution. Well pastas like two Exponential's, is like a symmetric thing. Kyle seems like a bell.

C

Curve full classes.

A

Like two Exponential's, but otherwise.

A

C

Batch gradient descent without.

A

When regularization.

C

Is the same thing as.

D

A

Entrance with this prior and that distribution, where.

D

Would L zero regularization.

A

Fit in there I don't know, and the funny thing is that I should know that, because there's like something the appendix of the paper and l0 regularization has a section relating it to variational inferences but I.

E

Have an exponential squared, Oh.

B

A

B

A

Saying would probably be true.

D

For l2 regularization from a variational standpoint it just we don't know what the variance is we're going to find that through our training. Yes,.

A

If you choose to then take a variational approach to it and and because in variational inference you don't necessarily have to tune every parameter. If you want, you can keep this fixed and tuning. This is similar to tuning the weight decay.

A

D

C

The way decay, constant.

D

Will leave some particular variance, so here every way be a constant assumes types.

A

For the variance, so yes exactly there's a copyboy decay constants to the variance variance, is interesting and.

D

Then, but even in so what's even in those cases, people tend to initialize the weights with a uniform distribution, so you're kind of initializing your weights with something that does not manage your prior on the weights, which is kind of bizarre people. Still when you didn't wake a devil still do this, whatever came in initialization of what I like? That's, that's, that's uniform, but it seems kind of bizarre dude in.

C

Michelin, if.

D

You're explicitly assuming a Gaussian prior on calories, the initializing it with a uniform distribution.

D

When you take weight decay, you know what the variance is. Why don't you initialize it with that very.

A

Now briefly, I'll point out that that now there's this whole other set of methods available to us, you can now put a more complicated distribution in here and anyway, a set of other possibilities become possible. The the paper I, the graves paper about variational inference is kind of where I got all this information idea. That did all of this all of this boxes. Down onto these things, we've already been doing this.

C

Is totally possible I'm.

A

Not going to talk about it very much right here, but this is just this is this is a very variational approach to training a network, but now I want to move into something.

C

That's just like interesting.

A

Of now we're going to kind of go the other direction. Let's take a let's take something we're already doing and work backward and figure out what we've been doing all along in variational language, and this.

D

A

This is basically.

C

What that second.

A

Paper did about variational Java, so so here, I'm just talking about standard batch gradient, does that organ about gradient descent with.

C

A

Dropout being whether.

C

A

Just randomly about the gallop two units well.

C

B

A

To just say it to just answer this question: it takes some. It takes a lot of thought to to to figure out. How do you describe this in variational terms and there's kind of a two-step process so drop out shortly after it came out. It was also pointed out that you.

C

A

Get pretty much equal performance with faster training using something called Gaussian dropout. They call it that because it's they make a case. That is fundamentally the same in the sense that so what.

B

Gaussian dropout.

A

Is is you simply multiply each active activation each each.

C

Cell activation by some noisy.

A

That's Gaussian distributed and in the Navy case that these are actually the same. They made the case using.

C

The central limit.

A

Theorem the.

C

A

B

A

Receiving a set of inputs and all those inputs are each dropping out with some probability central limit theorem. This is like a sum of Bernoulli variables, but the sum of a set of variables is always going to trend toward Gaussian. So the idea is that that you.

C

A

C

A

C

A

Of noise to the network by.

C

Doing this we are no longer dropping out units, so you can now.

A

Train much faster because every units still being trained every iteration take one small.

D

Confusion: I have the drop pod. You don't use it during inference that you only use it during training, typically right, so where's the inference.

A

C

A

Multiplying by a constant so here sometimes it's increasing and sometimes it's decreasing right.

D

But only during training very attractive.

D

Okay, it's a very doing inferences.

G

Where you actually run the network, no.

D

They're not so that, then they would.

E

Well would if there was implied noise being fed into the input it would be modeling it right. I mean it's basically making it resistant to this annoyed noise, yeah.

D

But I thought part of them I'm just stuck just confused about a turn, because very my testing phase or the quote unquote inference frankly, you're not doing the drop out anymore, I'm and so you're not really doing a variational process.

B

D

A

Keep in mind that here, when I'm talking about doing a variational process, I am we're solving we're figuring out what are the weights during yeah learning, okay,.

E

A

Your inference might still choose a set of fixed wings. I mean it's like the fact that you might be the fact that you might be storing a variance for each weight. It's still.

B

A

As during learning, in most cases, right, okay.

D

So they're not really doing a variational product taking during the inference correct, but they're called variational.

A

Inference it's variational inference of the weights. Now you might be overlapping this with them. I don't want to get too deep into variational auto-encoders right now, but variational.

D

Maybe that's why I was getting my.

G

Weights, not inferring outcomes.

G

D

Are variations on techniques where they try to integrate over there, so yeah.

A

That's why I was getting variational auto-encoders really are essentially performing variational inference during and.

D

This is my confusion. Yeah.

A

And I think it's best to talk about variational Oh.

D

I thought this is all right: okay, variational the term variation, so.

E

This work, it should be really very short, training. There's.

A

Variational inference of the way it's variational inference predates well: okay, yeah.

B

D

Course nothing predates.

A

The neural Nets, but like these.

D

Basic ideas predating.

E

On my shirt, yeah right but I'm, just saying when we see this term of this context, variational inference in our minds, we should be really thinking about variational training, because you're trying to infer the weights. As you said, outcomes, oh yeah,.

G

Properly inferring weights, the correct.

C

It's not used how we use it most of.

G

The time yeah in a Bayesian.

D

Sense is still correct.

A

That there are chapters of textbooks called that are I, think the day to make a textbook. It has a chapter called inference: tesslar we're learning as it four times to the point is that this this idea.

C

A

Inference to learning is okay.

F

You mentioned is that they prove that the second methods equal to the first of- why are people still using the first method.

A

This is something I'm curious about at this point. I am well read up to like 2015.

F

Like old Lee, yeah.

A

So I'm curious about that as well. This.

F

B

Wrote roughly here because, as this.

A

Paper points out this.

B

Paper has a claim saying.

A

This paper, the second one, the variational dropout, one, says that they're almost.

B

A

This this is a little hand. Wavy there is handing is that this is leaving out and who knows, may have turned out to be.

D

When I answer to that, I think is that baton arm basically killed crop up.

C

D

National came in a tripod, really doesn't help any more wrong comes a lot of things and.

F

Variations of normalization and.

D

D

Bachelor basically tries to normalize the outputs of everywhere, so they form politic. I've seen need Gaussian estimation and that helps otherwise. These outputs can really blow up right.

D

You can think of it in neuroscience terms as a kind of homeostatic homeostasis kind of function- that's happening at everywhere, so I'm gonna.

B

Renormalizing, the outputs.

D

Of every cell, it's maintaining a balance right.

E

So you're turning in a kind of a Gaussian distribution pills into.

D

This is different. I think this is like literally saying the range of outputs coming out of a layer right should match a Gaussian distribution about what's coming out of layers of magic, now see that's different from applying Gaussian noise to the outputs.

D

Okay, like you could have Gaussian noise on the outputs, without the outputs themselves being Gaussian.

E

But or you actually plug into the outputs, are you applying it to the.

E

D

Bachelor renormalize is the output of a lair Oh, almost every layer right now trying to be a Gaussian okay. So if you guess it without I'm, sorry, not just a Gaussian by Gaussian with a very known, limited distribution, I think zero, meaning. Okay. So let me what are the unit variance? What.

E

Are the what are the ways that I thought of dropout was that it basically tested you know by removing certain inputs, it kind of made the whole network more robust against kind of failures in a kind of you know, hard fashion to be I'm. Not sure I could relate that to yeah you achieve that by making the distribution of cowshit.

A

You know just making sure you reality dead, I think I. Think maybe you know this just making sure thing and drop up you're performing it on every unit.

A

D

I think he's reacting I said with Bachelor gonna.

B

D

Out anymore, and they seem like very different things right, which is true. It I don't know that there's a theoretical reason, but they're certainly empirically that's norm with bathroom. You know, you know, really unique drop out anymore, doesn't help different.

F

Forms of regulars and bachelor.

D

Is more powerful than okay.

E

D

Not sure I have a good answer. You might know better.

E

I, don't okay, but you say empirically at least.

A

I'm going to continue with working this variation.

G

We've been doing stuff and queues in peace, yeah.

C

So so what I just pointed out was.

A

Was that drop-off is roughly the same as Gaussian dropout, at least in a fundamental way of what it's, what it's doing to the network? Now.

C

A

Paper came as one of those he does a lot of these papers in variational inference. Oh the second one pointed out that if you do something like this, if you do something like this, where every weight has a as a mean and a variance one trick for computing, this is to take these variances and rather than having the weights be noisy. Have the cells themselves be noisy, have to have that?

A

Have the output have the unit itself, because you take the variances of all the weights and kind of clucks them up at the cell and add the noise there.

C

A

Of adding them to the padding the noise, the weights and basically, he worked.

B

In the other direction, saying.

A

That this is also equal to Gaussian dropout, except where every unit has a different neue to different different variants.

A

So so, both of these drop, the dropout and this.

A

They both converge on to what can be called Gaussian Java. So now we have this interpretation that that.

A

Your standard way of training and neural network with adding dropout is, is, in a sense, performing variational inference where, where you were, every weight has well weight, but it hasn't mean, and it also has a variance, but the variance is held fixed, but it's kind of fixed in the special way of where it is its variance. The weights variance increases as the way increases this.

C

This is a simple idea: you're multiplying a.

A

Vector by the weight to figure out how its? How how much it varies, how big of margin of errors and.

A

But in order to finish mapping the sunset variational inference, you have to figure out okay. Now what? What is your fire over, though over the weights and in.

B

This paper they pointed.

C

Out that the implicit.

A

Prior of this technique is that you have a log uniform distribution. I'll show you I'll, show you very soon.

A

So the way these weights tend to be distributed here, I'm showing four different example weights and their variances are like.

E

A

They're small, if they're near zero,.

B

A

They have a lower variance and if they're, if they're larger, they might vary over a larger range and I'll just go in the order here. So the implication here, I'll talk about the prior and like 20 seconds, the implication here: networks with dropout are roughly equivalent to networks with Gaussian weights. They wear the standard deviation or the variance is proportional to the the weight itself to the mean of the way.

A

So the prior here this logged, uniform prior, is that here.

D

A

As each of these intervals await being somewhere between 0.01 and point 1 or being between point 1, 1 or being between 1 and 10 are equally likely or they're, equally probable, under this prior, what you're doing with when you're doing a drop out? Is you.

C

A

Holding fixed some noise constant: this is just used across everything you are keeping connected with this again. If you are, you are choosing to do an optimization where you're, holding this completely constant this this term, here by by because you're holding alpha constant this term here is being held, is never changing in your optimizing. This and your alpha term is essentially choosing how how precisely you're encoding your weight.

G

G

The value of that transformation, yeah.

G

That is subject like easily changes right, but if you have the same logarithmic mapping on these on these priors, then you can actually hold it constant, and then you and you can vary these- these other things.

G

You know took to help with outside of the equation, because you maximize in halting its.

A

G

A

Because this is set up in this way, where the variance just automatically scales with them. You know that this is how you get this property. We're changing the means changing the waves does impact this, but it doesn't impact this that that property holds when you have the log uniform product and that's the only that's the only distribution for which that property holds very.

B

Curious, we need both the sort of definition of this, the Sigma IJ s and the log uniform prior to get KL to be fixed like we need both those properties like that we have the yep.

A

I would say that you need holding alpha constant and having a bowl of uniform prior yeah. Those are the two requirements: okay and in holding an alpha constant, is equivalent to having Sigma square always equal. This.

C

A

So this so an implication from this that can become so so here here like this is taking a method, that's already out there working empirically and and looking at it from a kind of mathematical standpoint and and saying what insights.

D

Can we derive from that.

A

The a conclusion or a on a yeah, a guess that can that kind, this kind of votes on I, don't know the right way to say this is that is so if this method that is working empirically implicitly assumes this, that implies that an optical encoding scheme for wigs will be to distribute your encoding, see here, I'm talking about like how you encode weights, using integers or floating-point, or whatever it's going to distribute its codes. Binary codes uniformly across this line so and conveniently floating-point does exactly that. That's what floating-point does yeah floating.

C

A

C

Numbers floating point has a larger margin.

A

Of error and in absolute sense, the.

C

B

C

A

Margin of error and an absolute sense, and so in, and they point out in this paper that and in choosing your alpha and choosing your new noise terms that you use in Gaussian dropout or choosing your dropout rate and standard Java. Because.

C

Alpha is a easy.

A

Mapping to the it's 1 1 minus.

C

A

Over P, that's just a really straightforward thing: you are at what you're doing in a sense is you're choosing.

B

How precise so.

A

That you encode the weights and and that's.

F

Kind of what alpha.

A

Is capturing you can train networks of different offices to get different, Precision's and, and just very briefly, I'll point out that this paper goes into now full-on variational dropout, where you're allowed to train all these parameters, you can learn them all and you can go even further. You can have a different alpha for every synapse for every weight and that's all possible, and the second later paper I forgot to.

A

Google's, first variation will drop out. There's a there's. A management drop out causes this far supplies networks. When.

B

You have different.

A

Alphas for every weight, it is a good way of sparse by the network. You.

F

A

It as taking every weight, widening widening its curve, trying to make it where it requires as few bits as possible to represent and then should keep winding and winding it until it's not even doing anything, and at that point you remove it and that's a form of specification.

A

B

A

How this variation works.

G

Actually, the way it does not matter because it's uniformly distributed.

A

So that's pretty much the end of I felt.

D

A

Was useful to have this? What I would say is someone of.

B

A tech savvy a place you can mentally.

A

Put these methods in a larger framework and it might inspire.

C

A

Methods and it lets us take existing.

C

Methods and analyze them.

E

In an interesting.

A

E

A

C

Just a quick thing: this is.

A

C

Deviation scaling.

B

A

The absolute value, but no I, don't okay, because then.

E

You would be living from Gaussian to.

G

That is, that is true right, because you have a ratio of one between the variants and the mean yeah I'm,.

E

Just wondering how they transform this thing, because what this is arguing here is floating point is essential in training, at least in these classifications and quantization. You know these. You know patients of quantization but I'm, just kind of curious, saying advanced right here, but it was just an observation that you know there might be a transformation of the soul. You change them.

G

The final factor on excitatory weights in politics, like what is the major between.

G

G

Because the weird thing is for one: it's quantized, because no other information in transaction is quantile, because you either release a vesicle. You don't release vesicle, integer numbers or basically the same a small integer right.

G

Exactly might you know, have like an MPSP word like extend to east one vesicle, so there's a jump from 0 to some value, and then you might be 13 or 12 or six vesicles right, depending on how strong your sandwiches and all of these help from the preset, which is the cosine data, obviously can then still modify them, but I can't help.

G

But think of all of these things in the logical distributions, because the biological weight distributions obviously not uniform and then on either a little bit the fashion in the sensor like most ways or nonexistence, sparse right and then off the waves that exist most ways already.

G

G

That's why there's so many of them, but then there's also obviously.

G

So it does decay to zero in finite distance and there are pumps if you look at these distributions, strong sign up, C's acquired with a specific weight and they cannot grow beyond that because of an aesthetic effects in exactly how that is fine, inventory weights. But they tend to be a lot more sort of like very strong inhibitory weights and we can throw away so I'm thinking this actually probably a gap between the distribution on that.

E

E

But if you were talking about finite small numbers, asada is probably the more accurate model for.

D

Don't confuse about this floating point.

D

It seems to me that if you were to pick fits in a floating-point representation randomly you would get this property yeah, but that's not what we do in typical floating point, math and typical the way it's used in the deep learning system. It's not exploiting this prior, oh, so it's actually the fact that we're it's not like. Oh just by magically using floating point we're going to be in this regime. You still have to explicitly take that into account.

E

D

A

It's like when.

D

A

D

Don't you show because we're doing it uniformly between soared one or whatever it is, or we're not actually picking bits in the floating-point representation randomly that's what you need to get this property right and when we do the math we're not we're fighting against this property Rosen's right. So all of the fact that we're using floating-point representation does not make this any easier. It's.

C

D

C

See I mean it's nothing.

A

Right right because.

C

A

We're using so many bits that were not even at a point where it matters even.

D

If we weren't, even if we had a smaller one, every all the math is set up to fight against this. So.

A

D

Ok, do proper man despite this problem, so.

A

I guess the intuition I'm giving his like is.

A

A

Like supposedly want one way like.

C

Wait for T to just not.

A

Clash with these numbers over here equals 0.1, plus.

C

A

An error, epsilon yeah, don't wait.

C

A

Equals like 1 point, 2, plus or minus Epsilon I'm, saying that it's appealing for this epsilon to be smaller than.

C

A

C

A

And I'd say we're not really at the breaking point of floating point where that matters, but.

D

Even if we were to use 8 bit floating point where this would you know, the math would be set up to not exploit those properties.

A

Well, the random number generation wouldn't, but the idea of, if we there are a limited number of weights that are even possible if you're choosing how to distribute those ways, you probably your higher number, is you want the gaps between your possible weights to be larger right.

E

Most of our representations have a bias to them positive, negative exponents, but imagine that you only allow things 1 or smaller than one, and then you give this distribution. So you don't need the things that get you 2 4 8.

D

16, oh, you would only get this distribution if you take the bits in the representation randomly. If you use a uniform random number generator, you would not get this distribution I'm.

A

Not really focused on right, yeah.

D

Obviously that's one example, but all the math we do I'm, not sure it really would exploit this property. I guess.

A

I'm just saying ignore almost old enough and just focus on what the network's doing during inference. If you're focused on like two inference, not trying.

D

To vary their way to a fully.

A

Trained network, yeah and you're doing that and you have a limited number of weights. Writers are even possible. I'm.

B

Saying that distributing your weights.

A

Like this is prop is probably the optimal way of distributing.

D

Approximating that function by having these little flat sections yeah, it's a Gaussian. You have a flat section, that's wider further from 0 and narrower. So.

A

That this would be like the point yeah. This would be like the 1 point 2, and this is probably a good way for an infant once you're performing that's.

D

Right but you've been doing that math with that, does not see him a prior like that, you said I mean you have to explicitly do these technique said there, but they're suggesting.

A

Yeah I think we're most angry in agreement just settling on certain things.

D

It's not like you do floating-point use employee by representation, and you automatically get these priors you don't you just get rest error. You know if the error will be different for different things, so you're not using that prior to doing it populations not doing the calculation.

A

Yeah I mean you could argue that floating point was designed to be optimal for these types of number systems or systems where yeah that could be error. Matters yes way.

A

I think that might be it.