National Energy Research Scientific Computing Center (NERSC) NVidia HPC SDK Training, Jan 2022, 12 Jan 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 7. Hands on demo: StdPar and Nsight -- Max Katz

Description

Part of the Nvidia HPC SDK Training, Jan 12-13, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/nvidia-hpcsdk-training-jan2022/

A

Directories for tomorrow for opening the openacc and cuda for today we just have the directory on center language parallelism and if you look at the readme, it shows we've, given you two source files to look at um one in fortran and one in c plus. I recommend doing both, even if you only use one of those two languages, because it's really not there's not any real code exercises just practicing around compiling and running and we've.

A

Given you a brief readme, which explains what's going on what the prerequisites are for getting it running and then um the set of like little exercises just to get practice compiling and running the code for both the fortran and c plus plus cases. So I recommend going through these and whether you're running on promoter or on summit we've. Given you, hopefully enough instructions for compiling and running and then in a little bit. I will come back and say a little bit more about these exercises.

A

To show you what we were thinking.

A

A

Okay, so I just want to briefly touch through these exercises and and tell you what I think or what we hope you you might get out of them, so I'm in the I've already done the git clone, I'm in the um stood bar directory. Let me just get rid of anything that I might have make it clear.

A

So you'll see right now, there's just this stupar directory in the student car directory. We have two files: a fortran example and a c plus plus example.

A

The for training example is solving a linear system, ax equals b and it has, it creates a matrix, a and a vector b and the way it's going to solve. It is using the standard loss operations that first do an lu, factorization of the matrix and then do the solve using the factored uh matrix, a it first initializes the matrix, a with some random numbers.

A

It then fills in the um some guess at b, or rather it creates it fills in a right-hand side b, and then it does the factor and then the solve, and then at the end it checks uh to make sure that we actually got the right answer, which we should. We can know, um because we know what a and b are, and we know the relationship between a and b. We should be able to just do a sanity check to make sure that we got the right thing now. The.

A

This example uses a matrix that is size 1000 by 1000, and it's using a number of standard for train apis, but can be run entirely on the gpu.

A

So this call to random number that will fill in the matrix a as well as this do concurrent operation that initializes um modifies the the matrix a and fills in b, as well as these bloss operations can all be done in the gpu, with the support that the nvidia compiler has.

A

That requires you to turn on a couple options which we'll see in a moment, but you can also just run this on cpu. So that's the first exercise that we list this is just run. This example on the cpu.

A

In order to do that, we save I'm doing this example on nerfs right now, but it's similar how you do it at oak ridge. You can see what our requirements are for the nurse environment, the we want you to have the program dash nvidia loaded on promoter, that's part of the default environment. We also want you to do module load coded toolkit, which is not part of the default environment, but easy just one line to do, and so you can see what my module environment looks like here.

A

So this is basically the default environment, but with this additional cuda toolkit loaded now, the first um exercise is just to compile and run the code. So the only non-standard or the only part of the code that is not just immediately part of fortran is the cause of the boss. Apis. Do you get rf and de-get rs um in order to do that? You'll just do dash l bloss. If you just do dash l bloss, then you'll use uh the blossom library that we ship or that this support from the nvidia fortran compiler.

A

However, you can of course use other blast libraries on the system. So if you wanted to use, you know mkl or some other um gloss library, you could absolutely do that and then you would get this that this you know test dg trf uh cpu um executable, now we've, given you a sample, submit script that you can use to run it. You can take a look at it. We have one for both nurse and for oak ridge.

A

You can see that this is what it looks like this just reserves, a single node for five minutes and it simply just reloads those modules and then simply does esron in the name of the executable.

A

So I should be able to just do s batch and then the name of the executable submit dd, rtf nurse.sh, and then I will submit that batch job.

A

Hopefully, that will run pretty quickly because it's um quick job and also it's part of this reservation uh as usual- we'll get a slurmout file which will record the output. So I can cap that all of that this code does is it prints out either test passed or test failed test patch would indicate that we got the expected answer from the linear system zone.

A

So this one worked cpu go to work well to compile it for the gpu. All we need to do. The only thing that we need to do is change the compilation, flags, the compilation flags do get a little bit more involved, though so as before. We're going to.

A

um First of all, give it a different name indicate that we're now doing the gpu build we're going to need to add a couple things. So, first, we're going to need to add dash sid bar dashboard bar means. We want to take all the standard language constructs in this case duke and current, and then run them on the cpu.

A

We're also going to want to add that we want to pick up the linear, algebra and run it on the gpu. So remember brent said that that's the nvla math um and then that so that looks like mvla math. That requires the cuda 11.4 back end, so I'm doing mvla, math, comma cuda, 11.4 and then the last thing we do is um tell it. We want to support the kudo libraries, so we need to pull in any cuda libraries that would be relevant for this gpu sport in particular.

A

We're going to want to support the math libraries as well as the random number generator kuran. In order to generate do that. Random number call that we said so that should be enough. As you can see, we have to turn on a number of options, but that's all we had to do was modify the compiler flags in order to get this example to run and hopefully now run on the gpu.

A

In order to make this run on the gpu, what we can do is simply modify our submission script and then use the updated name. So now, we've compiled with an underscore gpu extension to the name, so I'll just go ahead and modify my submission script and then submit that new code.

A

Again, I'm doing this all at promuter. But if you are trying this on summit- and you have any questions- please let me know- and either I or someone in slack will be able to assist.

A

Okay, so if I cat my new output, four three: fourth, it out great test passed. Okay, so the gpu build worked. However, it'd really be nice. If we could actually verify that this thing actually ran on the gpu, how do I know that it wasn't just falling back to the cpu that I perhaps I made some mistake when I compiled and ran it? How do I know it actually was using the gpu?

A

The way that we're gonna, I'm gonna, recommend doing that is to use insight systems inside systems will record any gpu activities that occur and then tell me about them. So I can modify my submission script and add insist profile before the executable.

A

Remember I showed you from the slide and says profile means collector profile of this application.

A

If I add also the flag dash dash stats equals true, then that will say print at the end of the run print out a summary to standard out of all the gpu activities that occurred. So I can immediately verify whether um there was any gpu workload that occurred.

A

We can also give it a name so I'll just call this one. You know test g-e-t, rf, gpu and then it'll automatically append the right file extension to it, so that we know that this was the profile corresponding to that particular executable. So that's what I would do in order to collect a profile of this application.

A

So I've saved my file, I'm going to go ahead and resubmit that script and then see what's different about my server output. When I have completed that.

A

Okay, so already got the job started. Let's see what we got.

A

Okay, so first you'll get the output from the application test passed, that's great, um and then you get the output from the profiler itself. The standard out from the profiler is broken down into several sections. Each of those sections gives you a different summary of different kind of information.

A

The first section is printed is the cuda api, so these are all of the calls in dakuda that were generated on this on the cpu in order to generate gpu related work. Remember I said that cuda is the underlying platform which supports all of these program models on the nvidia platform.

A

So, even though we didn't write any cuda in this example, huda is being used under the hood by the compiler. In order to do all the work, so you can see all of the cuda operations that had to occur. If this is something you're interested in on the cpu, in order to support the gpu workload, then we have cuda kernel. Statistics kernels are the actual gpu compute work that occurs and the you get a, and it's a little bit too long. As you can see.

A

So it's falling over to a second row, but you get in each row. The total percentage of the time spent on the gpu for each kernel, as well as the name of the kernel and then some other statistics about how much time was spent total in that kernel, as well as how many calls to the kernel and how much average time has been in each call of the kernel.

A

The name of the kernel in this case is generated by the the compiler or the library. It won't always be super um intuitive to you. So, for example, this test, deg trf underscore 17 underscore gpu, is actually fairly informative. It's telling me that it's happening in my program test dhe trf on line 17, and that is the gpu code that's being generated.

A

If I look at line 17 of my file, um that is this duke and current loop. So it's telling me that it's generating a gpu kernel corresponding to that call of duty concurrent and then a whole bunch of other ones, with names that are less informative to you that are generated by the cause of linear algebra. So you can see actually a fair amount of work is being generated to support this linear system solve you. Don't have to worry about all of the code generation.

A

That's occurring, it just works for you, as you can see, there's a fair amount of logistics, but it gets the right answer.

A

You can also see operations of memory, copying back and forth from the cpu to the gpu and vice versa.

A

Somebody asked in chat about the fact that sometimes the names of the kernels get cut off if they're very long. Unfortunately, um there's really nothing that I can tell you to do about that in this standard out that we can get. However, you can get the full name of the kernel when you actually open this profile in the user interface, which I'll show you next.

A

So this showed you this confirmed that data that operations were actually occurring on the gpu. What we would tend to do. What I would normally do next is to go ahead and load this report into the user interface to get a timeline view of this. This is just a summary right.

A

It tells you what happened but seeing it in a timeline is much more powerful than just getting a standard out text summary of what goes on okay, so we hopefully got at the end of the output, a information about the the name of the file on the file system that we saved. It has the name that I was asking it for test, deg trf underscore gpu, and it has this dot, qd rep file extension, which is the native report format of insight systems.

A

As I mentioned, this was renamed in the very most recent release of the tool, but it has the same kind of works the same way. So I'm going to go ahead and copy that file the name of that file and then I'm going to open up a terminal and do I'm going to scp that file from promoter to my local system. So I can just use standard scp for that. I'm going to give promoter colon and the path to the file and there's copy it to some location on my local computer.

A

Steve asks can, I repeat, the command to build to enable profiling. You don't have to change any flags in the build in order to turn on profiling. There are separate activities, so the the gpu compilation flags are in the readme here, so you can copy this cuda gpu dash cuda lib. These are all necessary for turning on gpu support and then um exercise the text for exercise. Three says what you need to do when says dash stash equals true in order to collect the profile.

A

Okay, so I have copied this report file from my from promoter to my local system. I already have the insight systems user interface up um because I was showing you an example report before what I would need to do is I would need to go to file and then open, and I would need to locate this file that I just downloaded on my file system.

A

So I already happen to have it opens the right directory because I happen to be using something before so. I can see this test. Deg trf underscore gpu, so I'm going to go ahead and click that and click open.

A

And then I get a timeline of this application. Remember we said that it shows you both cpu and gpu activities.

A

The gpu activities are in this cuda hardware row and then all of the runtime api calls um that, for example, the cuda calls that I was showing you that are orchestrating. All of this work are here in this row. Additionally, you can see load on all the cpu cores that are being used. If you want to. This can be useful for understanding when and where was there any load on the cores in cpu?

A

What I really want you to pay attention to is the cuda hardware row, because this shows you all of the actual compute and memory work that happened on the gpu, so there's a little everywhere that the row is blank. Nothing was happening on the gpu and anywhere that there's any color in the row is where things were happening.

A

You can see there's a sliver of time here where there's something happening on the gpu, then a sliver of time here, where there's something happening on the gpu and that's pretty much it.

A

The timeline runs from the beginning at zero seconds to the end at something after six seconds, so the gpu work is actually um constricted to a fairly small chunk of the timeline.

A

This first bit here is going to be the call to um do concurrent that initializes the data. So you can see that this, if I zoom in really far that the kernel that's being run here, is this test deg trf underscore 17r square gpu. So this is that do concurrent loop on 917 of the code and then I have to zoom all the way out by the way to zoom in and out you hold down control um and then you can use either the or I think it might be command on.

A

Mac hold down controller command, and then you can use like your mouse scroll wheel. If you have one like a pinch and zoom motion, if you're using a touchpad to zoom in and out, I have to zoom all the way out in order to see- um or I can just click right click and do reset zoom. I resume all the way out and see this gpu activity that actually does the linear system solve is happening at the very end of the run.

A

So it's constrained to a fairly narrow chunk of the timeline, um and I could see the names of these kernels. If I wanted to they're, not these aren't going to be super useful to you, because these are the um individual kernels that are done by the linear, algebra library, so the names of the kernels aren't relevant just, but the fact that you can see names like g-e-t-r-f is indicating that this is the work that's corresponding to the linear system uh work.

A

So if I reset, what I see is that only a very small chunk of this timeline is actually using gpu. Almost all of the work is setting up data or handles. This is a pretty characteristic thing. That's true about gpus, where setting up work on the gpus is fairly expensive. Initializing the gpu is expensive, allocating memory is expensive, and so, if you don't have a lot of work to do, you may be killed by these initialization costs.

A

Generally speaking, you want to hold on to allocations as long as you can. You want to hold on to memory handles as long as you can for like any handles you might generate for calling into apis, and you want to run as big of a problem as you can. If you look at the example.

A

This has a thousand by a thousand matrix.

A

So the last exercise- and I won't show you this- but I recommend you- do it on your own- make this a bigger problem and then see if a longer chunk of the timeline is spent on the gpu. You might even be able to ask yourself the question: can I make this problem big enough to effectively amortize out the cost of the initialization?

A

Now this particular example wasn't set up to show you like excellent performance, so there's ways to write code that um that mitigate this behavior, but it is worth knowing that in general, initializing data and the gpu state is expensive. So you want to reuse as much memory as you can, and typically that will work out to something like run, a code that launches a large number of iterations or time steps, or something like that.

A

So you can amortize out that initialization cost um so yeah try making the problem bigger and see if that affects the shape of the profile.

A

We've also got a siblings example, which does a stood transform in c, plus um that matt set up and kind of resembles, something that he was showing you in his um lecture and so what it does is. It creates two vectors x and y. These are just um arrays with size n, which in this case, is a million to initialize them to some data, and then we do a times x, plus y, in the context of the c plus plus parallel algorithms.

A

One way to implement that is with stood, transform uh and then with stood transform. You tell it the um a vector as well and the first point, or the first pointer or location in the array and then how many pointer offsets later to stop at as well as the second vector and then the last argument is the receptacle or the output of the of the data.

A

In this case we're basically doing an in-place update to y, and then you write a lambda which basically says I want to return y plus a times x and that's the saxby operation and then just a check at the end. So you should also verify that you can compile and run the c plus plus example. You can also practice collecting a profile with inside systems in order to see to verify that gpu work actually occurred on the gpu.

A

um Any other questions: oh: um is there an option to run some parts of the code on cpu and some on gpu? The short answer today is not really we don't really support like a mixed mode. Certainly you can use in the context of openmp. If you use um non-target regions, then you may be able to combine openmp host threading with gpu target regions on running on the gpu, but they're in general.

A

We don't support something like having multiple c plus plus parallel regions like stood, transforms and having some of them run on the gpu, and some of them run on cpu, that's a little bit too challenging and tricky for us to implement and also honestly. There are very few circumstances where you would want to do that. So today we don't support that, but if you have a really compelling use case for why you'd like to mix those things, you can always reach out to us and we'll be happy to hear you out.

A

Yeah um there was a question in slack: can I just go over one more time what this whole thing is, so it may help to um look at the if you, google, the api for stored, transform to go along with it. But basically um I let me go through these arguments, one by one.

A

So the first argument is the: what we call the execution policy execution policy is a statement about how we it's basically telling the the compiler some some statement about the relationship between the iterations of the for loop, the execution policy, really what it means is telling the compiler. How should I generate code to do this loop and that can either be done serially, so I run the executions of the loop one by one.

A

If you think about the transformers really representing a for loop, like this, for loop above from zero to n, then a serial execution policy would basically mean generate code which looks like this for loop. So iteration zero is before iteration one etc, but we can also give it parallel execution policies and, in particular, the one that we're using here. Power underscore unseek means that is both parallel and that there is no specified relationship between a particular iteration. So I can do iteration a thousand before iteration, zero or after or explicitly telling the compiler.

A

There is no particular relationship or dependency or data dependency between iteration, zero and iteration hundred. That gives the compiler freedom to generate a fully parallel loop behind the scenes to implement this, and that's exactly what it's going to do here and then the um first the sec.

A

The second argument and third argument are the beginning and end of a particular array or iterable, and the third argument would be a second thing because stood transformer is basically combining two pieces of information and then the next argument is the output of the data and then what it's going to do is going to pick a set of data from x and y and then give it to you in scalar form, xl and yl um as read-only data, and then the return value of the lambda is what I want to do with that particular combination of data from x and y.

A

This exam, this api assumes by the way or it because it's you're, giving it both the starting and ending location of x. It's basically assuming that you're going to be doing the same kind of offsets into y. So any particular lambda is doing.

A

You know x of x, 0 and y 0 or x, 10 and y 10, and then it's combining those two things together and then the lambda basically says. What do I want to take from those two values and then store and y 0 or y 10.

A

If I have some openmp instructions in my code, but I compile it with no openmp options, um then no the compiler should not should just ignore the openmp premise.