Numenta Numenta Research Meetings, 27 May 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Quantization in Neural Networks - May 27, 2020

Description

Subutai gives a basic overview of Quantization in Neural Networks, and then reviews the paper “And the Bit Goes Down: Revisiting the Quantization of Neural Networks” by Stock et al., 2020.

http://arxiv.org/abs/1907.05686

A

Alright looks like we're done: it's recording, okay, thanks Marcus, so we're gonna talk about the network quantization today and in speaking with different people.

A

We have a quite a wide variety of backgrounds in here and, and it seems like you know, for some people, just the idea of quantization is new, and so let alone network, quantization and hardware stuff. So what I thought the way? I kind of structured, this? It's not going to be terribly long.

A

I have a couple of really basic things on just the background on quantization and why it's important a little bit of background and well. How do you even think about quantizing, a floating-point number and then a couple of kind of specifics on neural nets, specific things sort of how do? How do we think about quantizing and neural network?

A

What are some of the approaches and that'll be a very high-level overview and then I'll go into one example paper, which is a state of the art paper that just literally came out very recently, and it's it's a nice paper, because it incorporates a lot of different techniques into it. So the idea here is not to give a comprehensive review of lots of different papers, but I'll just use that paper as a as an example. So that's that's. Basically the presentation shouldn't last too long.

A

Okay. So what is quantization for those? You know that that the kind of dictionary definition is it's the division of some quantity into a discrete number of small parts. So you can think about this example here of a sine wave. This is taken from Wikipedia you could you know it's got lots of real value numbers and you could imagine quantized again into eight different values, as shown here.

A

So the red line is that full precision floating point representing, but full precision, real representation and then the blue staircase things are, if you quantized into different values and at each point since there's eight different values, you can index in using three three bits into these eight different values. So now you can quantize every point in the sine wave using a three bit quantity. Okay, so that's an example of quantization, as should be somewhat obvious.

A

Accuracy increases exponentially with the number of bits that you have, and typically with neural network training, we use 32-bit floating-point, so it has something like four billion possible values, and many hardware implementations actually focus on eight bits for inference. So you only have twenty fifty six values, even such a said, into eight representation, and so as you as you can see, as you kind of chop off bits, you think the problem gets exponentially harder if you to really represent the full precision of values.

A

Okay and quantization is a huge, very long history, at least till nineteen, forty eight, if not earlier, in EE. So the main focus of this is just quantization for for neural networks.

A

Okay, so why is quantization important, so floating-point operations are expensive and slow on many chips and Intel and binary operations tend to be much faster. Another thing has the reason has to do it. Kind of just the overall size of the system and memory usage, so quantizing from FP thirty-two to intake improves the size and speed. You know if I factor four and sometimes more, depending how you you quantized and so with quantized ways. You can get to much smaller and faster networks, so this table is from this nice review paper by guo.

A

It shows kind of the number of parameters and kind of modeled modern, neural networks, and these things are increasing really really rapidly and Reza has sixty point. Two million weights, remember parameters and the number of floating-point operations that you might use for doing a single inference pass here would be you know about eleven billion in there, and you can see you know if you look at image net on the right as you've added more and more parameters.

A

The error rate has decreased over time in ResNet 50, which is one of the ones where focus done has about 25 million parameter. So it's somewhere in between kind of does Google napkin and ResNet 152.

A

Okay, some other things as energy usage is also lower for intake than floating-point that it goes with the smaller and faster and fewer resources. Quantized networks could also be a little bit more robust in noise, because you've kind of bucketed the values so small perturbations are less likely to have some change in the output. So that's another reason. Sometimes people have quantized and some even for adverse adversarial robustness. Some people have used quantization as a technique to defeat kind of adversarial methods as well so Quan.

A

Is it deep networks really rely on high precision for their training and and offer their inference, and so the question is you know how do you best quantize the numbers? You know all these millions of numbers in a deep network to 8 bits or sometimes lower and have kind of the minimal impact on the error rate?

A

Okay. So let's take a step back and see. Okay, how do you quantize a single number? So there are two basic methods.

A

One is the most obvious thing is the is uniform quantization, so you can take the number line and you chop it up into a discrete set of buckets and and for each bucket you associate the real valued number and it's a here. Delta is kind of the size of the bucket and in our scalar encoder we used to do you know this kind of quantization so and here you can go back to the real number from any buckets.

A

If you look at the you know some bucket it's associated with some, you know multiplied by some scale factor, so you have energy associated with the bucket. You multiplied by some scaling factor and a Tobias, and the bias here would be like half the bucket with that would give you kind of the number associated with it and, of course, anything within a single buck. You can't tell the difference between numbers within a bucket okay. First up and rounding is one example of a way to do this, but there's many different ways to do this.

A

The more kind of powerful method is non uniform quantization and the problem with uniform quantization is that the the numbers that you care about may not be uniform in the number line. They may be many more numbers, let's say near zero and many fewer outside of zero. It's an example: it's a non uniform quantization lets.

A

You have buckets that are of different sizes and then, with each bucket you associate kind of the canonical number associated with it, and so this is like a clustering kind of technique, and so to do this, usually I have a codebook that will map some index ki into the actual real value associated with it. Why I and clustering is you know one approach so that you know each region has roughly the same number of values. So if you had a lot of values near zero, you would have more buckets near zero and purer outside.

A

So that's kind of one. One approach to this. A pretty kind of powerful version of this is to not think about one number in isolation, but to look at a vector of numbers. So weight matrices are not single numbers they're. You know they're multi-dimensional. If you look at a vector of weights, you can do vector quantization, so you can do clustering in a high dimensional space.

A

So this is a more powerful type of non uniform quantization. So by clustering in a much higher dimensional space, you can often exploit regularities that you wouldn't get just in a single dimensional clustering, and so now you have a codebook.

A

You still have an integer number of codebook entries and but each integer now points to a vector of values in some high dimensional space, and this is the much more common kind of technique used in deep learning. And then these approaches, kind of the overall size of the codebook can determine the number of bits you need. So if you, if your codebook has 20 56 entries and you need 8 bits for each 15 code, each quantized number and you can kind of choose the size of the codebook. Depending on how many bits you have.

A

Okay, so how you quantize our neural network, and so there are a bunch of things you could imagine quantizing. So there's the input data that's coming in so that needs to be quantum. Aight have to be quantized. There are all the weights and parameters in a network, those have to be quantized. There's the activation values, the actual you know, dynamic values that are floating through the network.

A

Those have to be quantized, and if you're going to do training, then the back propagation error gradients also have to be quantized and in the literature it's looks like weight. Quantization is by far the most common and and most papers actually ignore the other stuff and there's I think they're just concerned with compression like how do you compress the network into the smallest network and they don't care about some of these other things and, like I, said if you're interested in training than the gradients have to be quantized as well.

A

One kind of minor detail is, you know we use batch norm a lot and for inference batch norm is usually folded back into the weights. It's just a linear operation. After the activation- and so you can just fold it back into the weights- and you don't have to worry about quantizing the batch norm for inference in there okay, so there are many many ways to quantize a neural network.

A

The most basic way is, you could just uniform, take each number and just do this uniform quantization and you can decide on the Delta and the size of the buckets just by using histograms and just looking at the min and Max of possible things, that's kind of a simple way of doing it.

A

Some people have done things where you also scale it so that the weights become powers of two, and this is nice, because now multiplications, if you deal with parts or two you can confirm, you can kind of transform multiplications, Editions and then work in kind of a long space.

A

B

A

Architectures now support mix, precision quantization, so different layers will have completely different Precision's and this can be exploited, and some of these techniques pruning is a is a lot of techniques do pruning. First, if you can set some weights to zero, that's the easiest way to ionization and that will reduce the size of the network. So this is something we've we've done quite a bit of as well: clustering weights using vector quantization, a lot of techniques through training and fine tuning along with quantization. It's not like a one-step thing.

A

It's you kind of treat it as part of the core loop. There are couple papers that have done SEP, similar to what Marcus's has discussed before, adding or simulating noise during training, variational techniques and their. Although I don't think this is a very well explored space. There's some really sophisticated ones. I saw one that uses reinforcement learning.

A

Basically, then it actually proposes some quantization runs it on an actual hardware platform, looks to see what the result is and tries to build a predictive model of how how a particular quantization technique will actually work on a given hardware architecture. There's different hardware architectures have very different limitations and and characteristics. So this is. This was a pretty powerful technique. I saw so it's you can see. There's tons of different things you can do this table gives kind of a nice break down.

A

This is from this goal paper of the different things he kind of splits it up into deterministic versus stochastic quantization and in the deterministic category. Rounding is the most basic way. It's kind of one of the ways of the uniform, quantization, there's vector, quantization and kind of a interesting one is quantization as optimization where you can treat quantization as part of the overall.

A

You know back propagation or optimization, algorithm well and then there's a bunch of stochastic techniques. Here.

A

Okay, I said this: this slide, I thought was kind of fun. This picture was a nice picture. This is from work on something called deep compression. This kind of puts into a single picture a lot of the different techniques that I've been tried and most papers seem to follow. You know some aspect of this, so the idea would be here.

A

You either start with an untrained, Network or a full, fully trained network, and then you go through some sort of pruning step, and this could be an iterative step where you involve training as part of it, and then you prune the connections down to some much smaller set, and then you do kind of the hardcore quantization step, which is this middle here, and the basic idea is here: are you know very common that you cluster the weights using some clustering algorithm? You generate this code book and now you go through this iterative process.

A

Can-Am process to constantly review once you've generated a codebook. You reek want eyes the weights with the code book. Then you update the code book itself and then you do this as a loop, so you're sort of updating the code book and the weights simultaneously and what they they don't show here. Is that often there's retraining of the entire network also into and done as part of this yeah.

A

So the middle piece is kind of the core quantization piece and then there's some sort of if you're interested in compression, then there's some sort of encoding process at the end anywhere, and the idea here is to retain as much of the original accuracy as possible throughout this. Don't worry about these reduction factors that they have here. This is for a different network scheme.

A

Okay, so most papers follow some aspects of this, so the paper I'll talk about focuses on the middle one. You know some of the papers we've discussed internally before like this one by Michelle Cobell. They kind of follow the basic scheme in.

C

A

As well any questions about any of this or I go into one specific technique.

A

Is this pretty clear.

D

A

Well, some questions I had before I didn't know not not from you necessarily but I didn't know, okay, so this is well.

D

I think I think one of the big takeaways like you've said so far is that people are doing this for riotous reasons, but if it may be just compression yeah.

A

A lot of its just on compression, which is a little surprising to me, because that's only one of the TAS, you know they almost all make the point that you know running on hardware. Architectures is also super important, of course, but big very few do anything more than quantize the weights, which is kind of interesting yeah.

D

B

Special case of this seems like when you have all binary weights and all binary features, so is that somewhere, you're headed with this.

A

Yeah, so the technique I'll talk about doesn't have that but yeah. That's a lot of a lot of the techniques actually go to binary or even ternary. You know maybe minus one zero and one kind of values, so that's kind of the kind of extreme end of it and in our HTM stuff hm work in the past. We've always used binary synapses and the binary activations as well, but.

D

We're not brain.

A

Is very binary in many ways, but.

D

We're not looking to do that here now and then no.

A

Not yet I think eventually we hope to get there. Yeah.

E

One of the things I've noticed is that anywhere there's potential for a nonlinear operation like in softmax, that's also a candidate for a fixed point quantization, and that also can be optimized.

A

Yeah yeah, that's right! You know. Michelle Covell talked about that in her paper and they spent quite a bit of effort figuring out what the right quantization for the activation function. It's too similar to dealing with if you're dealing with value, it doesn't matter too much. But if you're dealing with 10 H or some other activation functions and it's important I just.

F

Want to mention quickly a very quick anecdote. Three bits is not terrible. It used to be that the CDC 172 supercomputer at Oregon State University, had a console with a 3d bit deck and so the test program to boot. The machine included a three bit recording of Merle Haggard and it sounded like Merle Haggard.

D

E

D

Get that my complaint for some time. That would be a lot of fun.

A

Yeah, it's interesting, it's you know so much of deep learning. Also is you know it's just the deep learning system is a black box and you just use PI torches tensorflow and run on these big GPUs, and this. This whole aspect is you, you know quite ignored and forgotten, but it's a you know getting things running in practice. It's it's pretty important.

A

Okay, what yeah this! This is a some figures from Michelle's cabel's paper and belugas paper. This actually shows the histogram of weight, values and I. Think this case is I missed, but this applies to a lot of different networks. So the x-axis is different. Buckets of weight values and the y axis is the frequency and you can see it's pretty non-uniform and the distribution changes as a function of training I think they said it's it in these pictures it looks Gaussian, but it's closer to a laplacian distribution. So this is a typical distributor things.

A

Unfortunately, non uniform, quantization methods, as best as I can tell, are not well supported in hardware, and some of you may know better than I. Do you know you need to store this code book which can be clearly different, four different layers of different parts and you hardware architectures, don't always have good support for that.

F

A

F

Way it's been done before is with analog and you'll, see this in radar signal processing where they work in log space, and so the the laplacian distribution on the far right. There is suitable for that kind of thing. Easily yeah yeah do that along as much anymore, yeah.

A

Okay, so I just wanted to focus a one example paper and now I'll be done, and this came out pretty recently and I thought it was a nice paper. There's some nice ideas in there. It's also, if you read the paper, there's a lot of different techniques. They use in the middle as well to get their results, and so I think this was. This is kind of a nice paper that encompasses a lot of different techniques and they also have the code available by torch code available down there, so the core.

A

So this came out of Facebook. So the core idea is the following: they're going to do non-uniform clustering of weights, but the clustering method is going to emphasize classification error, not closeness to the original weights. So what I the clustering technique I mentioned earlier with vector quantization? It just looks to have a kind of a uniform distribution of weight values assigned to the clusters and it's trying to minimize the reconstruction error of the weights.

A

So if you were to go through the clustering and then figure out what the the quantized weight values are and compare it against the original weight values you're trying to minimize the difference here, they say: well, that's not really what we care about. We care about the end accuracy of the network or the performance of the network, and so this figure is kind of a nice figure that walks through this idea.

A

So what this shows is there's kind of in domain inputs and out of domain inputs, so let's say you're trying to classify dogs versus cats, so in kind of this big gray region here are all the possible dogs and cats outside of egg. You know, of course, it's a very high dimensional space and I'm gonna look like this, but this is conceptual, but outside of the dotted region will be all other possible images and, most of which will be completely random noise.

A

But all we really care about is the subspace of cats versus dogs. So let's look at kind of the original Network. So this is this side here in the gray, with the Gregg boundary, so the gray boundary here classifies cats versus dogs, this kind of complex nonlinear things on one side of it, I think below. It are all the dogs and above it are all the cats but outside it, but it it has values outside of this dotted region.

A

In fact, most of the thing is outside this donek region, so you know, got a very complex shape outside of this region, and the standard techniques will just try to match.

A

You know the accuracy of the weights themselves and so most of what it's trying to do is match what's happening outside of this region, because most of the region is outside of the region. Most of the volume is outside of this region and they don't ignore, what's happening inside of the regions, they might end up that. If you just try to match the shape of this curve, you will match it very closely outside it, but it might be that inside the region, you don't match it well and because of that you'll miss classify.

A

Let's say this husky dog or this cat. As an arm, even though, if you look set back and look at the entire space, its modeling, the decision region pretty well, but the INRI in domain region is actually a very small part of it. So instead, what they're trying to do is what this green boundary shows. They don't care, what's happening outside of the region, they're just trying to model the decision boundary within the in domain region. Okay, so that's only within the subset that you care about so they're.

A

Only good they're going to specifically look the Reconstruction era for Indo many inputs and they're going to use that's kind of the essence of their idea and then they're going to use fine, fine tuning the weights and the codebook after quantization, and it turns out, they also use knowledge distillation. That Lucas has been looking into as and this ends up being pretty critical in the fine-tuning step. And so this is the basic idea, their technique with.

E

Chris Sean yeah good winter beater argument made that you should have done this for the original network.

A

Yes, so the original network is trained with in domain inputs right.

E

But I mean just suppress the out of domain. Oh.

A

Yes, some techniques do do that to some extent regularization you know, tries to get at that. You know, there's some. You know. Typically, what supervised learning you had. You know, but you know two or more classes and you just focus on it and you don't even ever look at it outside and you know.

B

A

Learning techniques we'll try to look at stuff, that's outside of a domain and incorporate that into the error function, but yeah.

E

I was thinking in terms of of inhibiting digital spoofing. This would seem like this would be something you wanted to do all the time. Yeah.

C

There is a technique called negative sampling and it's used a lot in natural language processing it. But that's kind of the idea years are of them and images or language to train your network, not to recognize that.

C

E

Just thinking that, if you did that originally then then as an alternative technique, rather than using applying this technique to the network that had issues outside, you could more closely track a more correct network than the beat and to begin with, yeah.

A

You could it would help this problem, but it's really hard to do properly cover the out of the main stuff, because these remember these are million dimensional spaces, and so it the the volume of the space, is huge it. It's really hard to. You know exhaustively characterize the out the stuff, that's not at the main, okay, so here's kind of they're they're clustering method. They have a couple of tricks that they do.

A

If you look at the convolutional filters here, you have your input features and each input feature has a K by K kernel. So let's say 3 by 3, and then you have you have, you know see out of these filters and what they do is they split each weight vector into smaller sub vectors to make it kind of easier to cluster, so typical, vector quantization, might take this entire volume as a single vector and cluster that, but they split it into smaller K by K sizes K by K size sub vectors.

A

So basically they reshape it into this into this matrix and they they look at this. You know K by K chunks as the vectors and they're quantizing, so they're codebook at the end will have d of these.

A

You know codes, you know vectors in there and each vector will be of size K times, K upper kid uppercase K times, K I, don't know why they choose lower K here for the number of code book entries, that's confusing. Okay,.

E

A

That's so they, the first step is that they take these. Oh I should say: they're gonna, do this layer by layer so for a given layer, they take the weight volume and they create these little sub vectors, and then they do this clustering technique. So the first step is the assign each of these weight values to the to a cluster and the clusters might be randomly initialized, the beginning or initially initialized, using some random subset of these.

A

These are these weights and then what they do is they pick some subset of the training set X and they try what they do. Is they look at the difference between the true weight back sub vector and the cluster the cluster index in each of these I'm? Sorry, each of these code book entries they project that difference using the entries of that training set, and they try to minimize that that error. So they try to pick codebook entries that minimize the projection of the true training set and I should mention that this X here is.

A

The output of the previous layer is the activation of the previous layer into this layer. Okay, so they take the the training set. They see how it's coming into this layer. They look at the activations, then they try to minimize the projection of that input activation by the code, no country so.

C

This is how they're going to incorporate.

A

C

A

Yeah they start with the lowest layer and they move to successive layer. So you know imagine this is some layer in the middle of the network, you've already quantized the stuff below you now you're, looking at the activations coming in from some subset of the training set and now you're trying to minimize the you're trying to reproduce the those input samples as faithfully as possible in this projection? Okay, so that's how they assign a codebook entry to each weight value through this projection, and then they do. This other step is okay.

A

Now the codebook entries themselves were initialized, let's say randomly now: they do a least-squares thing to just update the code book entries to minimize the difference with this projection.

A

So they do this two-step process yeah, so they first figure out the best assignment to the codebook entries for each weight value, and then they update the code book entries themselves based on that assignment, and both of these are done based on the on the training set.

A

After this is done, then they fine-tune that layer and.

A

They do this in the following way, so each layer is basically retrained after quantization, so here they keep. The assignment of the weight vectors to the cluster is fixed and they're. They're fine-tuned, actually they're the codebook entries again and what they do is they run the network through the training set all the way to the top, they compute the error and they back propagated and they compute the average gradient coming into each cluster and update the clusters using SGD.

A

So this day, I think they do this for an epic or two after the previous quantization step, and they found that if they did this using knowledge distillation using the uncompressed network, as the teacher that gave them significantly better result than if they just use the training set to do this fine tuning. So they do this for each layer.

A

Then they go on to the next layer and the next layer and the next layer and the whole network is quantized, and then they do this global fine-tuning where they fine-tune the entire network again using knowledge distillation, and the only difference now is that the batch Nam layers are in training mode.

A

Okay is this: this is pretty clear.

A

So they showed a bunch of results.

A

The way they show their results is based on compression factor how much compression they get and they have different block sizes for the size of the code book entries and they show you they do a bunch of different. They show a bunch of different stuff and basically their technique seems to be at least at this kind of compression range seems to do better than any other published technique by a pretty pretty large margin, and so what's an example you can see here. You know with K. Here is the number of code book entries.

A

So if you look at K equals 1024, you would need ten bits to represent each code book entry plus 1024 code book entries itself, so using that they can compute kind of the overall size of the network, and that gives them a compression factor and that compression factor there are several percentage points above some of the best existing work on that I think when they get to kind of smaller compression factors they seem to be. You know close to the state of the art.

A

It may be they're a little bit worse, it's hard to tell from this graph, but at higher compression factors they seem to be quite a bit. They hold up the accuracy quite a bit better. This particular one is without knowledge, distillation, here's another one where they had with in dollars, distillation and here they're, showing that you know for ResNet 50. They can maintain accuracies up to about 76 or 77 percent. Seventy seven point: eight percent error with pretty decent compression.

C

A

If you can hear my dog.

C

D

C

D

Lawyer, it was.

C

So that you have a supervised.

C

A

Is the network that they used as their knowledge to selection network? Oh that's.

C

A probably the same one we are using right could.

A

Be yeah, it's it's one of the Facebook ones, I think.

E

Yeah, so the reference methods Haq in DC. What are the acronym stand for.

A

Haq is hardware aware quantization. This is their really complex when I mentioned earlier that does the enforcement learning as that part of its its loop. So that's the other sort of method that seems to be doing really well.

E

And and the DC isn't the graph.

A

That one, I don't know, I can look it up later. It doesn't seem to it doesn't do as well as there Haq well.

E

If you look at further on outfits.

A

Oh, it's here, it's pretty comparable. You know yeah.

C

A

Pretty much it that this is basically as far as I can tell kind of the state of the art and in weight compression and weight quantization right now.

A

Here are two other papers that do a nice review of colonization methods. The Glo one is kind of more sweeping, and this is I haven't really looked at the Krishnamoorthy one in detail yet, but it has a lot of very, very specific techniques. It's a little more kind of concrete, so this might be a nice one to look at as well.

A

Okay, that's it any questions.

C

Are any of this technique specific to convolutional, neural networks or all right? I can see. There's a whole review paper on it, but is that? Are those techniques very specific to convolution or notice is more general? No.

A

I, don't think so like this one is you know this? One is not specific to convolutional networks. You know you still just have to decide what the size of the block sub vectors are, but you can use it for little layers as well or other other layers.

A

A

There are a couple of techniques actually I'm. Sorry, the hardware aware quantization. If I remember they take into part of what they do is they're trying to certain convolutional networks. You have a lot of shared weights, and so they are trying to enforce the sharing of weights.

A

Well, take it back, that's sexually within a convolution that so that's not that's not specific to convolution layers, either yeah. So the answer short answer is no. Most of these techniques are pretty general.