National Energy Research Scientific Computing Center (NERSC) GPUs for Science Day, October 25, 2022, 9 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Current State of CRAY GPU Compiler

Description

John Levesque (HPE)
Current State of CRAY GPU Compiler

A

Q, okay, so I'm gonna be a little bit different than the other last two speakers.

A

um What I want to do is show you the use of our tools and compilers to move an application to a GPU, because it really is I mean they've all talked about putting in due concurrent, um you know putting in Target directives or whatever, but the question is: where do I put them, and so we have developed um over the years.

A

First, with the uh Titan work at Oak Ridge and now with the frontier work at Oak, Ridge and um El Capitan at Livermore, we have developed a suite of tools to really help users move to the gpus Okay. So um this is once again like the previous speaker, a three-hour talk and I'm. Just gonna go hit the highlights so. First off we have an excellent uh performance analysis tool called perf tools and Apprentice is a GUI to really look at things like timelines call chains, I, MPI, communication Etc.

A

The other thing which has recently been extended to generate openmp offload is revealed now I would not use reveal on a C plus plus code. I would use it on Fortran. It is really a scoping tool and I'm going to actually show the use of reveal in this presentation now, contrary to where Nvidia is going these days, I love directive based programming models.

A

The problem is you're, always going to have to use some directives, and um you know they're portable.

A

um There are a number of vendors that support both open, MP and open ACC. We support both and we generate code for AMD and Nvidia, and so um you know really there's portability.

A

um They really are incremental, which which is excellent and that's kind of what I'm going to show and then you can go in and optimize once you find out what is using all the time.

A

um So what I'm going to show is just some perf tools: output from a very simple code called Amino I, don't want to do a complicated one, because we're really short on time, but to use perf tools. All you do is you as load a perf tools, light module, you build the application, you run the application and the results come out.

A

So basically, this is what I've done with the amino benchmark.

A

um This REM here is to get an annotated listing um to go with the statistics, so here's my um statistics in the amino there's one routine, that uses all the time um it does do a Halo Exchange and that's where the real problems gonna hit you when you move to the GPU okay, um so um it also breaks it down into the maximum function times on. This was run on two MPI tasks per node: okay, I and load imbalance Etc.

A

The other thing is, it gives you on a line level where all the time is being spent, and so, if I now look at that annotated listing um I see that all the time is being spent and right here and there there is an iterate iteration Loop and then a um triple nested loop with the reduction, um and this is the Jacobi relaxation.

A

Okay. So now the big question with this is: what are the loop iterations I mean because in order to really figure out how best to put this on the accelerator I want to know what came Max, J, Max and IMAX are okay. So we have another thing: called perf tools, light loops um and what this is going to do is get us. The loop iteration count and it now I'm displaying the call tree where this is the main program that calls Jacoby I. Think that's an error.

A

I, don't think Jacoby uses over a hundred percent of the time um and then here's our looping structure and it shows us the average uh loop trip count. And then this is the initialization and the um MPI down here.

A

Okay, so now in the annotated listing, um you also notice that it tells me that the compiler vectorized this Loop and unrolled it by three I and the reason it unrolled. It is because of all these um I I minus one and I, plus one because you're really effectively utilizing cash uh that way.

A

Okay and now what we're going to do is just use the reveal to generate an open, MP code, and so um we bring up reveal we select uh that outer loop on iterations, which is we don't want to do that? That's not parallelizable, because it's iterating and it even calls uh MPI routine here. So we kind of dropped down and we'd choose k, and now we bring up.

A

Another window comes up and it says: do you want to scope for a GPU or scope for a CPU, and so I do CPU initially and then it comes up, and it tells me all of the variables that are scope, private and shared. This means that that's a reduction function, but there are no Inhibitors to this Loop and so I just insert directives and and I look at there. Here are the directives uh you notice, it always uses default, none um so that it. This is a good way to do.

A

um uh Have it do self checking, okay and now I run it on um on the AMD node and it scales pretty good um up to 16 cores.

A

um So that's openmp, but now I want to generate GPU code, all right and so I go back and I choose GPU, and now there are several things going on here: notice: the G the G is very important. Basically, what that's saying is those arrays are in a module, and so we're going to have to do special, Extra Care to make sure that arrays are updated prior to returning, because those arrays may be um used in other places and when you have a module or a common block.

A

You're going to have problems like this, with both open, ACC and open MP. Now, unified memory gets you away from this problem. So that's one thing that we are making sure that we generate extremely good unified memory code so that um we, you don't have to deal with this Global variable issue.

A

Okay, so basically, these are the openmp offload um directives that were based generated now the problem is, is there are some map uh always, and that is because of this Global nature. In other words, it's always going to have to write on some arrays back to the host and one array to and from the host.

A

Okay, so um now we're we're gonna use perf Tools in order to profile the GPU okay. um Now perf tools is not as automatic as the Light Group. You have to first um uh do a pat, build and then run the instrumented version and then do a Pap report, but this gives you excellent information.

A

So what you see here is that over 90 percent of the code is in data movement now this is after the compiler has optimized out a lot of the data movement, but this is necessary because of the Halo exchanges and notice that they're they're too this weight in this kernel.

A

This is saying that point six percent is spent by that kernel on the GPU, okay, and um so now, let's kind of go back um here. This also gives you it shows you what the where the kernel is, and the G's are where the parallelism is happening.

A

So this is spread across high level, and this is the low level 70 um uh threading and we do have the reduction function. Okay, but this is really the problem. This call to send p is a Halo exchange and it's passing in P, which is calculated in that previous Loop.

A

um Well, there's a loop in between so uh yeah p is set equal wrk2 and then there's an all reduced, but we want to look at the Halo exchange because we cannot afford the message passing around this Halo Exchange, okay and so what I'm going to move to and I Helen I'm? Sorry, because I didn't I can't find the version of this code.

A

Where I did the Halo Exchange in a mental, so I'm going to go to um Leslie 3D, oh before I go there um I want to kind of tell you the other thing I did was to put all the kernels that use the um variables that are um accessed and that Loop I want everything to be on the accelerator okay. So um this is the thing that really reduced uh the data movement Okay. So um when, when you are packing buffers on the accelerator, you typically have this.

A

So you pass in an array and then you pull off the uh edges.

A

So this is kind of the North um pulling out the uh part of the uh 3D array and then I'm sending the north um to my neighbor, okay. So to put that on the accelerator, it's very easy.

A

um You just do a Target um and um then um I'm updating the host with um the um buffer. Now there are corresponding receives.

A

um And so on, the receive you basically are uh setting into the variable, and so that way you have to do an update to the accelerator.

A

um And our excuse me to the host I'm sorry and uh then do this I'm sorry. This is on the accelerator tooth. Well, anyways! You know what I mean okay, um but there is something that's much much nicer and that is using device, arrays or device pointers. So what I've done is I've made all of the buffers device, pointers and now I do not have to do any updates.

A

All I do is I use those um pack and unpack uh loops on the accelerator and um and and then I use this device pointer, and it turns out that now this doesn't even go to the host. It goes directly from the accelerator to the Nick and, and so this really improves your Halo exchanges and the same thing uh with the receive buffers.

A

Okay now I I ran this actually on pisdan and Lugano, um and this doesn't really help you that much on small numbers of nodes using device buffers. But when you get to large number of nodes, you really get a significant increase in speed, because you're really cutting down all that overhead of transferring that data back and forth to The Host.

A

A

Now the other thing I wanted to show you um is we have this debug environment, one of the developers, who is an awesome, awesome um coder um he put in his own debug and I was uh this was way back in the days of Titan and I was porting s3d with open ACC to tighten and I needed, something to help me debug the code and he said: hey I. Have this a cray ACC debug?

A

There are three levels, one two and three one shows you all of the transfers and all of the kernel executions where they come from and what on the line number so what routine and what line number and and then sinks, and so this is just you know, it's transferring 29 items uh Etc and now it it says it transfers that, but then notice it says, there's zero bytes transferred, and that is because um that's what the analysis said to do, but the compiler recognized that those arrays were already on the host or where they should be okay.

A

This is two and now this gives you the actual array name and the size of the array that's being transferred. Okay in in this. So you have this and it's coming out on air around and then, if you hit a problem on the accelerator, you know exactly where it is uh because this stops the output stops, it tells you it probably came off in a kernel.

A

um It shows you exactly where that is now. This is the kitchen sink, which is extremely useful, uh because this, really it tells you everything it tells you the size. It gives you the host and the accelerator pointer. It gives you the striding, um and this is it.

A

The compiler has this present table and in the present table is a region of data that represents that data element, and this is why you can um equivalence variables and still recognize that, in fact, it might still be on the um on the accelerator, and so this is just so valuable. It's an extremely useful.

A

So what have we learned in this very short presentation?

A

um Perf tools is excellent for identifying issues in existing applications for improving threading, vectorization and scalar. Optimization I, I've I've been using this for 20 years. I am the biggest user.

A

um If you ever have any problems with perf tools, ask me reveal, can really help with scoping variables now in amino. It didn't even need any help from the user, but in something more complicated. It's going to come back and it's going to tell you you know, I, don't know what to do with this variable and then the user can say ah I want to make that private or it should be shared now I, don't recommend using reveal on C plus plus it's just.

A

There are too many issues with C plus plus now moving to the GPU is difficult. However, you can do it in steps that are more manageable and perf tools identifies the bottlenecks for you very quickly um and finally, GPU direct is the best way to do message passing, um and this is even getting better, because on on uh Frontier, the MPI can actually run on the accelerate, and so you don't even have to have the host invoke the MPI call.

A

There are three um cases in in that um that I don't have time to go into because I'm out of time- and that is my last line.