National Energy Research Scientific Computing Center (NERSC) GPUs for Science 2020, 5 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Introduction to the Roofline Model

Description

Samuel Williams of LBNL presents a talk on Introduction to the Roofline Model. Recorded live via Zoom at GPUs for Science 2020. https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/ Due to some data loss, this recording is missing the start of the talk. Session Chair: Yan Zhang.

A

This forms the basis of what we call a dram roofline model, which is kind of the simple version of compute and dram. So this basically boils down to answering the question which takes longer. Does it take longer to move your working set, your vectors, your matrices from dram to the processor or dram to the gpu, or does it take longer to actually compute on them?

A

This? We can basically come up with a very simple equation that says the actual run time for your loop nest is going to be the maximum of either how long it takes to compute, which is the number of floating point operations you execute divided by the peak flop rate or the time that's required to move the data, which is how big are your your data sets and how fast can you actually move them? What is your peak memory bandwidth now?

A

We can actually do a little bit of arithmetic and transform this from a simple equation of time to a rate. So here it's time per floating point operation.

A

We can then manipulate it again to get floating operations per unit time, which basically says that our flop rate, our sustained flop rate for a loop nest, is going to be the minimum of either the peak flop rate of the machine or the product of this, this ai term, this arithmetic intensity and memory bandwidth.

A

Now this arithmetic intensity term is incredibly important. It's a term that will come up over and over and over again. What it really is is a measure of data locality of data reuse.

A

Of course, unfortunately, arithmetic intensity came about in the the early 2000s when, when I was writing this as part of my early 2000s or seven as I was writing my thesis and it has nothing to do with artificial intelligence now I apologize if I use the term ai and it can be confusing, but when I use that in this context it will always be arithmetic intensity. What it really is is going to be this measure of data locality.

A

It's total flops that you perform to total bytes that you move for your loop nest for the dram roof line. This is the total number of bytes that you move to and from dram and thus will include any kind of cache effects, any kind of pre-fetcher effects and will almost invariably be different than the total number of loads and stores that you uh request from the memory subsystem.

A

That is, you request certain elements of the vector the cache architecture will filter all, but a subset of those that will actually go to dram. The number that go to dram is the term term. That's actually used in this metric.

A

So how do we visualize this rootline model? Well we're going to take this equation, this flop, uh being the minimum of peak flops and the product of arithmetic, intensity and peak bandwidth, and we want to plot it on a log log scale. Why do we do that for one? It makes it extremely easy to doodle things on a whiteboard and two. It makes it very easy to make extrapolations for moore's law where performance used to double every few years.

A

So how do we do this? Well vertical axis? Is the attainable flop rate for our loop nest? The y-axis is the arithmetic intensity for our loop nest. What we see is one term in this equation is the peak flop rate of the machine. The other term in a log log scale, is this product of arithmetic intensity and, in this case, gpu hbm bandwidth, the roof line. The minimum function says that we must be on or below this curve.

A

This curve has an inflection point where we transition from being limited by memory, bandwidth and data locality to being limited by the peak flop rate of the machine. This inflection point is the machine balance and it is the ratio of your machine's peak flop rate to peak bandwidth.

A

So it's very important to keep this in mind, because each machine will have a different machine balance. If you simply increase the flop rate, it may not actually improve your application performance. If you increase bandwidth, it may not increase your application performance. It all boils down to. Where does your loopness arithmetic intensity lie? With respect to this inflection point.

A

So, in essence, what roofline does? Is it tessellates this space into five different regions? We have this region greater than the peak flow rate of the machine up here. That's simply unattainable performance. It's faster than light performance, you'll, never get there. No matter what you do.

A

You have this other region of uh being less than the machine balance, but greater than the product of ai and peak bandwidth. Once again, this is unattainable performance, basically you're you could think of it as you're having to move data faster than the speed of light. It's not going to happen.

A

You also have this region here, where you are less than the machine balance, but you're also getting close to the machine bandwidth in this case, I'm showing like a 50 of machine bandwidth in that region, you're, basically, memory bandwidth bound. You have about a factor of two to go to improve performance, but that's it uh correspondingly.

A

There is a compute bound region where you're getting at least half of your peak flop rates, but you're also greater than the machine imbalance and then finally, there's this other region where you are below either peak flops or below uh half of uh peak bandwidth. In this case we might deem as just poor performance. That is we don't we don't? We want to get ourselves out of this region and into either the blue or pink region, but we can never be above the the solid, blue or solid pink line.

A

So today, let's think about an example, our typical machine balance today on a gpu or cpu might be 5 to 10 flops per byte. That corresponds to about 50 floating point operations per double precision. Word: uh that's really, fundamentally, an artifact of technology and money and it's unlikely to improve in the future, so just get used to that.

A

So that means we can set our inflection point our machine balance. At about five, uh we can consider a trivial uh vector, vector operation in this case kind of a dax b, where we uh read two vectors x and y scale. One of them add them together and write to a third vector z.

A

This uh loop nest does two floating point operations per iteration and it transfers 24 bytes per iteration. That is read, read and write. That gives us an arithmetic intensity of 2, divided by 24 1 12 of 0.083 flops per byte. Where does 0.083 lie on this graph? Well, it's very very far to the left of the machine balance. That means that, no matter what we do, we are ultimately memory bandwidth bound on this curve and we will get a fairly low flop rate.

A

So, let's consider a more uh interesting example where we have some degree of data reuse. So in this case we have a pde, that's descritized into a seven point. Stencil.

A

We can observe that we're going to read uh seven points here right, one point here, but if we do this simple calculation of saying seven flops divided by 64 bytes, we get the wrong arithmetic intensity.

A

What that intensity here is actually doing is telling us how much data is moving from our cache to the compute, but what's really taking place is we're actually having a cache filter all, but a couple of these reads and writes that is in the end, a perfect cache will filter all, but this right, this read from this leading element here in the stencil and the right back to main memory.

A

That means a more accurate estimate of arithmetic intensity is going to be seven flops divided by 64, bytes or 0.44. Where does this slide in the roof line? Well, 0.44 is definitely greater than 0.083, but it is still less than the 5.0 for the machine balance.

A

This means that we will improve our performance substantially when going to this applica this loop nest, but we will still be memory bandwidth bound.

A

So, let's think back to this question of what is good perform, and so, if we think back to that initial set of loop nests where performance is completely random, what can we do? Well, we can take that completely random set of loop nests and rather than ordering them by just the order in which we happen to run that loop nest. We can order them based on arithmetic intensity.

A

We can then plot those performance numbers relative to the mach, the roofline model, the the flop and the bandwidth capability of our machine and highlight which of those kernels actually lie in our bandwidth bound, which lie in the compute bound and which lie in the uh poor performance regions of our code of our application machine.

A

This means that we can get some kernels that actually have very low performance, but are making good use of the machine because they're making high use of memory bandwidth, we have other kernels which are uh getting high performance like this red one, but it's actually making poor use of the machine because it's not actually getting more than 50 percent of stream or 50 of peak bandwidth.

A

Thus, we can focus our optimization efforts on those three red dots to try to improve their performance. Broadly speaking, roofline is going to be made of two components and that's kind of your high level. Take away from this, that is, you have the machine model which defines the diagonals in the roof line. This is your uh peak bandwidth and your peak flops and the obvious transition point between the two and then you also have the application characteristics.

A

These are the dots that are defined by for each loop nest, how many flops it performed and how many bytes it moved the basically. This boils down into two activities: uh machine benchmarking, to give us the lines and application instrumentation to give us where the dots actually lie.

A

So when we go to actually optimize these codes, what do we do? Well, we think about how we first, how do we get to the roof line? That is, for these red dots? We want to take that red dot and get it as close to the peak flop rate of the machine. That is, we want to move this red dot vertically, but that's not the only case we may have. If we're down here at this green dot, we can't really move this green dot vertically too easily.

A

What we really need to do is increase its arithmetic intensity to increase its arithmetic intensity. We have to remove data or move less data by moving less data, the ratio of flops to data movement increases, and thus we can basically take this green dot and have it slide along the diagonal to higher performance.

A

So this raises the question: how can performance ever be below the roofline? So there are a number of scenarios that this actually can occur in. uh We can imagine the case where we have insufficient cash bandwidth and insufficient data locality that is rather than having dram boundary performance. The cash bounds are performance.

A

We can have cases where we have a lack of parallelism, that is on a gpu. We have thread divergence or insufficient occupancy, or we just run with enough thread blocks.

A

Unless we have completely idle sms, we can have something else where we have an interesting instruction mix where we're not using the fuse, multiply, add on a gpu or we have mixed precision or we're not really using tensor cores for machine learning applications or finally, we may be running integer heavy code codes where we're not really thinking about how many flops we're doing we're thinking about how many instructions per second we're doing in the traditional roof line. Arithmetic intensity is flops per byte.

A

If we have no flops, we have an arithmetic intensity of zero which, on a log, log scale is unplotable, so in that case we really want to switch to some thing like instructions per byte or instructions for transaction. So this has given rise to a number of research activities that have been codified into papers and methodologies.

A

uh In this case, we have a hierarchical roofline model for gpus. I point you to the paper uh by charlene uh from last year: uh there's.

A

Similarly, a roofline scaling, trajectories methodology that khaled ibrahim developed, which allows us to understand thread scalability as well as potentially gpu thread block scalability, uh where you can actually understand. As you increase the amount of parallelism, how does performance and data locality change that is?

A

Did we actually drive up performance to the point where we became every bandwidth bound when it comes to the instruction mix uh charlene as well as more recently uh torsten, yunsung and yan zhang have been developing methodologies to allow for the analysis of tensor core accelerated applications, a preview of that existed in the same paper I mentioned on the left and then finally, nanding developed the instruction roofline model for gpus, which allowed her to understand performance as a function of the number of warps per memory transaction that tells us for integer heavy codes.

A

How likely it is. We are limited by the instruction throughput rather than flops. So in summary, we can think about why we use roofline. First, we use it to determine when we're done optimizing code. That is, if you're right at the roofline limit, you're done in simple optimization, but you may need to motivate algorithmic changes. That is, if you're at the roofline in the bandwidth bound regime.

A

You may want to think about how you actually move less data if you're limited at the roofline in the flop, limited regime, you think about how you reduce the total number of flops for your loot nest.

A

Think about how you identify bottlenecks and motivate software optimization.

A

It allows you to understand why cpu's gpus or other architectures may be different or similarly how an ampere may give you better performance than a volta, and thus it allows you to predict that performance on future machines, so takeaway roofline allows you to understand that application relative to the machine capability, and it really is useful for helping frame the conversation between application developers who know the application well, uh computer scientists who may be very good at performance, optimization, applied mathematicians who understand the understanding underlying algorithms and the processor vendors, who may be extremely knowledgeable about the mic or architecture of the target architecture, but may not uh understand the applications as well as other uh groups here, and thus it provides this common mental model and common language for discussing optimization.

A

So at that point I will take questions. I think we have maybe a minute.

B

Yeah, thank you, sam. That's, a very good presentation. I think uh both of our panelists and our audience, who learn a lot. I think um blue fly model is a must-have tool for hpc developers. So maybe we have just one question if uh from the audience or or we can also like, uh if you can stay with us for several minutes in the chat, um if that's okay, for you, um jan.

C

Can I ask you a question yeah.

B

C

Okay, so sam thanks and sam I'd like to ask uh actually two questions. The first thing is that, can you give some of the examples that have used roofline model from start to end, mainly relying on the data that was shown on your roof line, model? Outputs and second question? Are there any alternatives in the sense that I don't think this is a silver bullet and in the case, if we go back when you have published it with what termina and paterson for the first time, you might have already have looked into the alternatives.

C

A

Right so, in most cases, we've actually started with the applications. After the fact that is, uh the applications in many cases have existed uh for years, if not decades, and thus we're applying roofline to that existing application.

A

uh There are a few cases where root plane has been very useful for applied mathematicians to think about where their performance actually lies relative to the roofline for a given discretization of a pde. When you end up in that regime, you may see that now and forever, your existing methods will be bandwidth limited and the only way you're going to get better performance on those is to increase memory bandwidth. Unfortunately, in that case, bandwidth increases very slowly relative to compute, and thus they are in a bit of a quandary.

A

They rectify that by reinventing the algorithm, they change the discretization of their pde so that they actually move less data while performing uh more total flops. The result is, they can get both get better performance and they do less work because they're, basically exploiting uh the excess compute capability on a cpu or gpu.