Numenta Numenta Journal Club, 9 Aug 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenAI Paper Review: GPU Kernels for Block-Sparse Weights

Description

GPU Kernels for Block-Sparse Weights

https://openai.com/blog/block-sparse-gpu-kernels/
https://d4mucfpksywv.cloudfront.net/blocksparse/blocksparsepaper.pdf
Discussion at https://discourse.numenta.org/t/openai-paper-review-gpu-kernels-for-block-sparse-weights/6440

A

Can you see my screen.

A

Yeah so today we'll move over this paper. Briefly, the GPU, kernels or blocks are sweets still st. charity perm duties. What I do to Prasad great Allen record and he came up from Albany I. They also have an Associated blog post in here I'm, just going to focus on the paper. I think that blog post is this. Is the papers.

B

C

This paper, just like I'm using published or just yes.

A

I was trying to figure this out. I couldn't find it on archive, even though Google Scholar cited as an archive thing, but it's not it's not an archivist on this.

A

But the basic idea here is I mean very related to us that we're interested in just how do you get sparse networks running fast and their whole? Take is I. Think it's a really good discussion of velocity, which is I, think sort of the limit Nouveau Gigi's party, so I put down a bunch of some high-level questions that we can go over and I know.

C

I'm sure I know if yeah other things which I like I, understood all the words of the paper. But I didn't really understand what that one. You know how this maps onto networks yeah it's confusing so.

C

Questions you know I'll sort of go back and forth between the paper.

A

This is not a planned presentation.

A

So, first of all, what is lot sparsity.

A

Basically, the idea is here: you think about layer, two layers of neurons, so there it was here at al Arif of the neurons here. The typical scenario is, you have completely dense, connectivity's everything everything here is gets input from all other neurons at various big values, and you can display that as the matrix. So you have input units and this layer, you have an N by n matrix and the problem there is that the waste scales quadratically with love, sparse weights, are shown here and basically you have.

A

If you look at a matrix here, you have blocks of dense.

A

If anything that you've been drawing. This is a block sparse matrix it everything else is forced to be zero and you can specify the Blas sparsity with a mask over the weights like they have on the right-hand side here. So this here, you have a one wherever there's a block.

A

The reason for doing this is that for GPUs doing a dense, matrix multiplies extremely efficient and they're optimized. For that and this converts a large dense matrix into a bunch of smaller matrices, and so each one is treated as a dense matrix, and you can just apply here here. No.

C

That's my first question. My first question: okay, great break this up done bunch of little problems. They then really consider the separate little problems and you allocate them to GPUs. If you want.

A

C

I, don't know GPUs to understand like okay. Well, these are small blocks. So how does a GPU handle small blocks? I mean I, have a cheap. You couldn't use work, but assuming that they like to have these big matrices and running up to certain size. So this.

A

There's going to be some overhead to setting these up each one of these. So when you have.

D

A large matrix, that's find over.

A

D

Now you're making the small.

A

X lower this matrices, so the.

D

Benefit of doing that.

C

D

A

C

There's a single GPU run simultaneously multiple blocks, yeah, okay, so you can set it up that way. It's like this or parallel process.

C

Is theoretical partitioning of a problem which then you feed into a GPU and a more dense form should be actually actually kids a bunch of these at wants as much of entitlement. So just read this right: yeah yeah, so, okay,.

A

Yeah and I think this is the key issue, for here too, is the size of these block blocks matter, because.

D

As you get smaller, smaller.

E

D

Of running each one.

A

Becomes a bigger.

D

A

Yeah I'm just explaining the basic stuff Blas car city here, hello, so as the size of these blocks gets smaller the overhead associated with each block I'm, assuming it's a constant or something, then it becomes a bigger fraction of the compute constant in here they say they can go down to 8x8.

A

That's the smallest thing going, but all this house they report is by 32 by 32 thousand blocks, so I think that's kind of the sweet spot for them. 8X8, probably the.

C

Benefits are not only the capacity of tipple. What is the it's.

D

C

Side speed you brought without: how do you divide it up he's not a reasonable.

E

Fully connectivity, yeah.

C

At one time, I mean Sonny, you know if the partition, okay, you.

E

Can get it's pretty slow? Oh I thought they were I mean you'll, run you'll run out of memory. Once you start.

A

And this is important to us. You know most of this partial matrices that we deal with our.

A

So this is, this is the same as a block.

F

Is how like the size of your block, actually affects, like the final accuracy of.

A

Your model yeah, we.

F

Don't know we haven't really.

A

Really studied varsity at all.

A

Right now we don't have any.

C

My assumption is that once you pick these blocks, you got to stay with them right I mean you can reconfigure them as well. Okay, yeah! So so let me ask credit so like we on our side, the left side. There we have this sort of some bit-by-bit sparsity. If you oh I, can I translate that into 8x8 blocks a 32-foot.

A

C

Can I take the 32 bits that I do 32 connections I do have here and put it in a block I.

E

Think actually, I think it makes sense. If you think about like is.

C

There a coordinate transformation.

E

Yeah basis for the two vector spaces- yes, I, guess that's a question. You can get some.

A

Type of block form, so they do they mention distinguish I, haven't.

C

A

A lot of you know some of this stuff. They do is they they have a block size and then they go through shuffle and then at least to a random matrix. But you do it all in this. You do the multiplications here. This is a one-time, shuffle and I haven't really internalized.

A

D

But that one question on my site: yeah.

A

D

So the location of the part block in the bigger matrix is that a that trainable thing or is that how do you? How do you figure out the location? Yes,.

A

I think you can. You can certainly change this dynamic.

G

There's probably a little bit.

A

Of cost to changing it, but not much so at.

D

F

A

Maybe that you can change these.

D

They can even learn that it's.

A

Not learn is well it's tunable. It's there's an algorithm to learn. It.

F

It's what I believe the author, Scott gray, is actually working on, learn, sparsity.

A

I mean that makes a lot of sense. Yeah.

E

Is that you can have you really have two meters right? One is all the block locations and then the other one is the sets of weights with h-block sort of you can adapt both.

C

Yeah yeah my question I mean just go back as I did this sort of a transformation here is I mean so we have some randomly sparse connectivity here talking about it looks kind of red emission outside it's not blocking and what you know can't you just take that remap. Maybe one of those connections or bits into a block do all your processing and then map them again yeah.

C

So so that seems if you could do that because should I even miss all that.

B

Because there are transformations.

B

Physics problems: you can row and column permutations to kind of get them all clustered, but there are some sparse matrices that lend themselves to that. When you do the row permutation here, it upsets the order over here. So the simple example is: if you have you know one, you know one: zero zero one doesn't matter which we permute. It's going to be the same thing. Yeah.

A

There's gonna be constraints like that, because if you do this and allow including random shuffling well, then this has the N squared property.

C

Of it you're just kind of pushing the problem here so well, I guess maybe the question is: is how often do you doing this? These transformations of the network give the blocks and the block mappings and the block sizes don't change very rapidly, then? Is it possible set this up? So that's the system is very efficient. You know once you get it going, inference could be fast. Something like that mind. Reading is wrong on I. Think.

E

They said it that.

C

They have is pretty efficient.

E

They were able to break some some networks, folks.

G

If you're watching.

E

Please, like the video, if you don't like a static, sparsity block size and everything.

A

A

Can increase the.

A

Capacity more power than.

E

In elysium there's there's two sort of fundamental. You know things.

E

One is like the number of framers you have, which tells you how much of the task can you remember, but then there's also width and in the number of the number of actual neurons and as you as you go sparser you can have more neurons, and that means that you can remember more from your past from the past in the history, so the to the amount that you can learn from the task itself and the number of neurons affects how much of the past history you.

A

E

It's going to get you guys to charge them again.

D

Tomorrow, I think.

A

That's really where the power of this is it. You don't typically see too many too much too many fully connected, there's and that works today. Instead, you have, when is this actually helpful and real Network? So one is a fully connected feed-forward network- and that's where know this so the most common one is CNS and they're. Already you have kind of lost each and each neuron doesn't connect everything else. It connects to a small subset of these neurons and then you kind of copy that so this is this.

A

It's already kind of taken care of a lot of the problems. Now the CNN weights themselves can be somewhat large, and so they said you didn't like blocks our city to CNN weights, but the real power thing is that the recurring network, so in that case.

G

Thanks for the sub moss, Turner and.

A

Because they have a current all of these guys project back to themselves, yeah and and units here you have N squared weights and that's analysis more than that, the recurrent networks yeah. That's here, really limited in how large you can and I don't know. People had looked at convolutional structures for these recurrent weight, but it makes less sense here and so the block sparsity allows you to add much.

B

Coarser and therefore, once.

A

You can have larger units here. You know.

E

A

Much more complex.

A

Centrally simple order: dimensional space yeah, so this is being a real limiting factor for our nets in HD ends. We also have with current structure like this, but our connections are extremely sparse. Okay, so we have in our default new big thing. There's 65,000 cells here, but each active dendrite might only have 20 connections so.

G

In super sparse talking about temporal memory, so we can even have very large.

A

Okay, then, there's a whole connection to small world networks which is kind of interesting, and it took me a little while to I'm. Not sure I really understand it fully all the implications of this, but it might be worth.

C

A

C

I I remember: reading the small world networks that I couldn't remember anything about so I was too lazy to look at all last night. So if.

G

A

A summary about.

C

Value or reminder it if everyone.

A

C

What it like sure, yeah.

A

I think the simplest way to think about it. If you have n square connections any unit, this recurrent network, if you have n square connections any unit, can anyone else.

A

F

A

The number of times you have to pass information through to go from any random one, so the path, length, misery's and so there's a trade-off between how much how many steps you have to go before you know: information the versus the sparsity and in a small world network you can you can balance those two lofts I. Think it's like n long ends.

C

Any of you here is too small because you're not connecting very far yes.

A

Here's the definition you have small local clusters that are almost fully connected and then as and then you have given for each.

D

A

You go further further away and so spacing at fewer to fewer connections. You know so this whole world comes from these six degrees of freedom, so he had login path length and then there's various ways of creating these small world networks.

A

Okay- and you know, this is almost like a side discussion about this is probably a lot when you talk about here and you know, people make the claim, while the rain is small world yeah I'm, not sure whether I fully agree or not.

A

If you think about our very networks here, they've.

D

Got a layer and you just look at the.

A

Recurrent pathways, if you say it's long world, but if you think about the impact event right now,.

C

The small world- well, the whole thing is the brain, of course, that got this huge complex structure on top in.

A

These networks, so it's not just like bunch of connected. Yes,.

A

D

Think they make what they do. Is they.

A

Deal with networks that are small world at the level these blocks, and so they have various ways of generating these networks so that they use this as a way of figuring out how.

A

We already talked about this. The other things that's helpful. Is you want to have really.

D

Large networks.

A

And relatively large block sizes.

A

I guess we talked about this. Is it helpful for us it's unfair, but there are stuffing our.

C

Stuff for any I mean took over I, mean they're, making the claim it helps a lot of sports networks right so yeah.

A

Right now, we kind of rely on random connectivity gives us we don't you make the connections dependent on the pain?

A

Where is this those constraints on it that I, don't know its construction higher up the other day for us is we have sparse activation, which is not handled at all. These are just dense, too sparse. It's a dense activation against the sparse lot, sparse.

B

C

Would that change things.

A

Basically, here you would just treat the sparse activation.

C

Because you want something yeah.

A

What planners make is offenders to the benchmarks correctly? They have this. Both bass and sparse waters have the same number of browsers. The only thing is how the connections are made and the work we're trying to make here is to get the same level of that we spit with fewer connections 1%.

A

So what they're showing here is that you can improve experts bonus, but you still have to say yes,.

A

And that's if you were just focused on accuracy they're there when it is evening boss, party or losing parameters in losing power, so don't increase the size of the network so for the same amount of computation, I'm, just repeating patient, you can get more and it doesn't have to.

B

Be I think the issue is the representational space is much larger, even though.

B

A

Means keys blocks the.

E

Way, I'm visualizing, mrs. Bute, essentially a window already. We know where the output units and then you have sort of a dense or sparse contribution to Besant a block is it's sort of like a chunk of English talk about weight and then all those can you make sure each other which I was like a cliquing, a small network. So is that actually, what is that we're a small world? This is.

A

Coming from as this block style- and we not get that from the random person, you.

E

B

Get that from random specimen.

A

From constructed specifically but yeah here, you get the small piece right here. This is a lucky, maybe side-effect, of the way that we're in additional minutes from block architectures as it gets I I. Think even here you have to construct the blocks when the small word assumption.

E

Every block structure and we block sparse matrix, but this.

E

Is not just clustering, you also have.

B

So it's a dense, bipartite graph, I.

F

Think in the paper they test box varsity with just random blocks, varsity and then with small world of the box varsity, and then they show the small world network outperforms the right a little more.

F

Interesting yeah sure so the exact same words I just so they constructed the box.

A

Varsity using just random blocks forever and they ran some experiments and then they did the same thing. But instead of randomly assigning the blocks, they use the algorithms to generate a small world network of the box and they show that that one outperformed, the random block.

A

You can look at their results here.

A

So here they're using a pretty large matrix, twelve thousand by thousand and a block size of 32, which I think is their sweet spot and they show that, as you have hiring our varsity, you get a speed up. This kind of curve is what you would expect for our perfect and a speed up, but it's a little hard to read this.

A

What I want to know is so you here, if you look at 80% of our sitting, that means 20% on zeros units in a pure case, you'd expect the 5x feet up, and so, if you go up here, it's hard to read exactly where it is. It's very run-through yeah. So I would like to see.

B

This chart, where you can really see, are you getting exactly this video.

B

Here coming on understand this.

A

And here they show the.

A

Eight is much slower than.

B

Again here would it be nice? You know.

B

Basically, they get good speed. Ups.

A

They mentioned this issue about proving Lucas.

B

So that's collect here, yeah.

A

So they've mentioned some of the proving work here that when you prove you you know you're not taking, and you know that you can prove quite dramatically decent performance, but they mentioned that this erratic routing does not meet.

A

Is there a way the way they phrase it? Is there a way to prove while taking advantage of the blocks, if you're running a TPS, yeah.

B

Otherwise, today, there's no running on any semi heart. We're also taking advantage of the idea is: how do you I need to still down the operation so that the art gets, and one of these is that, if you're randomly.

E

B

You're not going to probably get any effect, but if you as part of your criteria, say well, if I choose between this- and this actually gets me to a slightly denser representation in some space, the neck, that could be a regular Iser of sorts. You know that we tended towards.

A

Something that would be more amenable to standard art yeah, that's the way you would want to prune with that light, breakfast ensign and again within the block, even still it's yours to try to get it. Yes, another.

D

A

That small word networks, but that only happens in the current era.

A

Understand I think it still has an impacted. You have learn multiple layers and you want to be able to communicate information computing any other mode. Here you it's not one small world network, but you need multiple small world connections, I think to to increase the probability of that there could be a cascade effect if it's really sparse here. There might not be any connection up to here, but if it's small world with where the path length is three, let's say you could I think it was like that, like.

E

Begin to the distance and there's like that expanding cone and as you go forward in network yeah,.

A

But again, I think the more realistic issue would be port numbers. Is people just tend to use, CN, NS and so kind of you already had a lot of communication anyway. So the issue that becomes just? How can you optimize the skin in Chrome.

A

Okay and they believe this stuff, this was used in the gp2 t2 or some of their latest enthalpies. You know their food, a large skillet, there's less than anything yeah. They use the the use of box varsity other sports transformer paper. I don't believe they use it in GPT, too,.

D

So they don't that's interesting. Ep2 is their largest network. They don't use that different.

D

Ict btq has just proved that if you scale architectures about.

F

Adding too many other new bells and whistles like sparsity, even that alone can get you state-of-the-art results, so I'm curious to see what will happen if they gt2 and then Esparza fie it and definitely larger DBT.

E

To but they literally just.

E

Is it just by meeting this on NLP in particular, there is an architecture where you just buy 90 a stable tall size. You can get much better, yes, but they're still using this.

C

So that particularly qpt to it isn't exactly production system where's that in one of the demonstration that it.

D

E

Against there's so much.

C

Power that it's impractical, it's.

E

It's ademma, I don't see, a lot of the tree is just transformers in general, people are starting to the point, but I'm sure smaller than gbgt gbg to is huge people.

A

Are using I saw them in Canada. Just generates characters for, like comedy, show.

A

A

Generally, machine learning world is very constrained by what you can do what you use today, and so the only way you even really consider sparsity is.

B

Using this blah structure, you can't really even think about residents are safe and still love the SPS I.

E

Mean whose farce is decent, but it is for arbitrary sparsity, but this is really just like I'm, more fine to more.

A

Highly optimized yeah they mentioned that coup Blas was faster than cars, soku, sparse, I, think point zero for whatever reason showed almost no speed. Ups over normal class operation completed now, there are like nine point: two I don't know what they did, but it's it's like 20 to 30.

E

X no know what it wasn't. So this is route since this paper was released. Okay, so you would expect their conclusion, whose sparse would arsenide you see. This was certainly better, but who sparse now? Is that an abuse, okay or arbitrary sparsity.

A

But for blocks our city I think they say it is the coolest.

A

B

Just a bunch of dense matrices, I've heard too sparse the in a library they struck that trying to have gather operation to try to solid eight, some I'm, just learning it so I don't have.

B

But that's I think a lot of this stuff devolves down to Harvard to have scattered gathering saying these things are over out here in memory. How do we distill it down run through our tight functional unit blast? It could really eat glass out in every situation that.

B

You see that of PSPs. You see, that's really.

B

The data elements together to take.

A

Get a sense for how good versus and from.

E

What we've seen this previous decent certainly certainly has speed-up over than a youth equalization, yeah.

A

Yeah, if it in the night, we have let's say 5% non-zeros,.

A

Speed up in the perfect implementation like in a 28 speed up whatever.

F

The cou sparse what I don't have an exact number for that, but for us who sparse has been valuable because we're modeling we're using matrices that are just larger than what can fit on GPU memory, so maybe 50,000 by 50,000 and I'm with like 99 98 percent sparsity, and that just can't be done with the standard. It's just a memory just.

E

Even from either one Richard said that there was like 20 X, yes,.

F

Ready in Colorado.

A

B

E

A

Percent non zero, nine nine percent sparsity.

A

I, don't think those are integrating to know.

F

They're not really my purchase is painfully.

E

Slow yeah yeah, this parsley, I, don't know Tenzer school tensorflow.

F

Is better, but still not as good as just juice bars for blocks varsity yeah I think they're working on it. There are pop or soda Emily, one guy, a city whose videos, including.

A

That I guess parse to mother for abide are.

A

We making a chocolate.

D

A

A little better.

A

A

Thank you, wanna! Stop the live stream.

G

G

I'm gonna say bye. Thank you guys for watching this.