National Energy Research Scientific Computing Center (NERSC) NVidia HPC SDK Training, Jan 2022, 12 Jan 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 12. Hands-on demo: OpenACC, OpenMP, and CUDA -- Max Katz

Description

Part of the Nvidia HPC SDK Training, Jan 12-13, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/nvidia-hpcsdk-training-jan2022/

A

I wanted to just say briefly a couple words about the new exercises that we added for today: you'll see that there's now two new directives, one is called cuda and one is called directives. The directives directory covers openmp and openacc, and the cuda directory covers cuda.

A

If I look in the directives location first, what we'll see is two source files, laplace dot c and the plastic f90.

A

This is solving the laplace equation from math or physics using a standard method called jacobi relaxation, which basically involves iteratively um solving an equation, and we solve it some number of times until we reach a particular error, tolerance threshold, and then we stop.

A

If we look at the fortran version, um then basically the way this works is that we have some initial setup where we set up the initial conditions. On the um on the array data.

A

We have an outer loop, which basically says keep going until we reach some tolerance threshold or until we run out of maximum iterations, and then it does the main update for this algorithm, which is that at every location in the two-dimensional array, it updates that value to be equal to one-quarter times the the neighboring values in the 2d array. It's like I plus 1, j, minus 1 j, and then I j, minus 1 and I j plus 1..

A

So these are the four neighboring values to I j and it sets that equal to the average of the neighboring values. This is um how you up you perform this particular algorithm, but it's okay. If you aren't familiar with this algorithm, you can just look at it as code and ask the question: how would I accelerate this code.

A

It keeps doing that until we reach some error threshold and then at every iteration of the loop. What we do is, after we perform the update. We store the value from the update in an array to record the value and then perform this iteratively, so at every iteration we're doing an update based on the value from the previous iteration.

A

So we first do the update and store it in this value, a new based on the old data a and then, basically, we swap a and a new by just doing a simple loop over the entire array saying a equals a new and at the end we just check that we um reach the tolerance to within the threshold. We make sure that we get the right answer, which has a particular known quantity for this case, and then we stop.

A

So is it possible to increase your response? Yes,.

A

Now um this is the pure serial fortran code and there's an equivalent bit of c code as well. That's here in laplace.c and basically does the same thing. It initializes the data and then it does an outer while loop, which says, while the error is greater than tolerance. You um just do this update setting it equal to the new value equal to the average of the neighboring values in the array and then we're we stop when we reach this tolerance.

A

Now, if you just run this code without doing any gpu work, you can compile either the c or the fortran version, so I'm in this directory.

A

If I look at this, I see I have my laplace.c in the class setup 90. I can do nvc laplace.c and then that will compile the c version of the code. I can give it an executable name: oh laplace, and now I have a compiled version of the code and then, if I just execute that you'll see that what it does is it performs the this relaxation, this iterative algorithm after every 100 iterations of the algorithm.

A

It prints out the current um error, which is basically the difference between this iteration and the previous iteration, and we stop once we reach some tolerance threshold.

A

In this particular example, we actually don't reach the tolerance before the end of the thousand iteration maximum. So by the after a thousand iterations, it has a an error of about point: zero, zero, zero, two- and that happens to be the um the correct answer, or at least the the target answer for this one.

A

So if we are converting this code to run on the gpu and then we find that we get some final error, which is different from that, that means we've changed the outcome of the calculation and that's not the goal when porting at the gpus.

A

That being said, of course, you may change it to with, within some floating point rounding error threshold, which will be pretty small but non-zero. uh So we are checking to within a certain tolerance which is significantly higher than that um floating point round off, uh cut off.

A

The exercise here is to convert this code to run on gpus and we've. Given you some freedom for how to do it, we have said that you can either use openmp or openacc and that you can either do the c or the fortran example.

A

I'm going to say a couple things about how you would do the fortran example in openmp, and while I don't have time to do all of them, I can at least give you some sense of how we would think about, beginning to port this code to run on gpus.

A

So I'm going to go ahead and look at the lathos.f90 well. First of all, I should verify that when I compile the fortune version of the code that it also works.

A

Okay, you can see it does pretty much the same thing. It prints out the the data in different floating point format, but it's you can see it's still 2.4 times. 10-4 is the error after the last iteration.

A

If I look at the code, um porting to run on gpus is often best thought of as an iterative process where you port a loop at a time and make sure that you get correct results and then analyze the performance of that change and keep on going until as much as possible of the calculation is done on the gpu.

A

This is most easy to do in the directive based models because they are incremental and non-destructive to your code. If I add something like a omp target teams, loop right here, then the compiler will ignore this. So it doesn't know how to deal with those with that directive or if the flag to the compiler that um turns on directive processing is not turned on and also I don't even have to change the structure of the loop. In order to do this, so I can apply some directives to the code. I can one at a time.

A

Add them without changing anything about the code itself or the way the algorithm works and then see. If I get the correct answer, um that loop is a little complicated, so I'm going to come back to that one. Instead, I'm going to do it on this one. So this is a two-dimensional loop.

A

It loops over j I and j dimensions, and it says a of I j equals a new of I j. So this is a very simple two-dimensional for loop to be thinking about where I might start is by doing omp target teams loop like so that would be a valid openmp directive.

A

I can save the file now I'm going to compile this, and if I want to turn on openmp offload processing, I need to add the dash mp equals gpu flag to the compile statements.

A

I'm also going to want to turn on m info equals mp, which means the compiler should tell me about any places in the code where it generated openmp offload constructs corresponding to my directives, because I will often want to inspect and see. Did the compiler actually do what I asked it for the other thing that I'll do is, I will turn on dash gpu equals managed.

A

We talked about this a bit both today and yesterday, and what this does is. It means that all dynamic memory allocations I.e the ones that use um allocate in the case of fortran or malik. In the case of c, all of those memory allocations can be used on either the cpu or the gpu I'll give it the file name dash olympus and then the post.

A

So this dash m info flag now gave us some input or some from information about what the compiler chose to do with my omp target team loop statement. The first number in this line 60 corresponds to the line of code where it found that directive. So let's double check in the code. If I look at the bottom of my emacs window, you'll see that this does in fact correspond to line 60..

A

That's where my omp target teams loop is um maybe there's a way to get this information out of vi, but you know good luck to those of you who use that um and then, if I look at the information further down, it tells me even more info about what it did to the director what it, what what the compiler chose to do. Based on that, the fact that their directive was in the code.

A

The first thing is that it says generating gpu kernel. So that's awesome. That means that it was able to take the the corresponding loop and then turn it into a gpu kernel. um It also gave me the compiler generated name for this kernel. This will be useful later when we look at insight systems output, so you can verify that this kernel name uh corresponded to line 60 of the code. This particular for loop, generating tesla code by the way is just happens to be equivalent, meaning saying we generated code targeting the nvidia gpu architecture.

A

Now it also gives us even more nested information which says how it chose to do parallelism across the two levels of loops. It says on line 61, it parallelized the loop across teams. um Teams we know, are the higher level of parallelism that are available to us in openmp and then on line 62. It said it paralyzed the loop across threads within a team, and it also tells us how many threads it chose to use per team. That is 128.

A

Another thing that's interesting is that it gives us a comment here which says that the loop that was paralyzed across teams is equivalent in cuda terminology to the cuda block index, and the loop that was paralyzed across threads is equivalent to the cuda thread index.

A

So this is telling you that the the the kernel that gets generated is similar to one or analogous to one where you had written an explicit cuda kernel that parallelized the outer loop across cuda blocks and the inner loop across cuda threads per block.

A

As brent said, you don't have to think in this way. It's not required for you to think in this way in order to generate code that runs and works on the gpu, but it is sometimes useful to at least understand that lower level of how things work, because that will be useful when you're looking at the performance of your code and thinking about whether it can be any better than it currently is.

A

It also tells you that can generate multi-core code because, as we've discussed, the um openm, the nvidia compiler can do either gpu or multi-core cpu parallelism for openmp target loops.

A

Finally, it also is telling you that it's generating an implicit map of a new and a what this is saying is that it recognizes that for this code a and a new are not so we talked about how both in openmp and openacc there's an explicit notion of gpu memory and cpu memory and by default the way that these models work. They expect you to map in the contest is open people call it a map they're supposed to map data from the cpu to the gpu.

A

If I were doing this without the benefit of managed memory, I would be expected to have some sort of map clause which says that I'm moving a and a new to the gpu, so they can be accessed in a gpu target teams loop.

A

But if I use the dash cpu equals managed flag, then that's not necessary. One of the benefits of that flag is that all the memory management is handled. For me, however, the semantics of openmp don't change right from openmp's perspective. If I'm accessing something on the device, there needs to be a map clause which says this data is available on the device. So the compiler is telling me that we're telling us that, in order to satisfy those semantics, it's generating a map statement.

A

However, we also know that, because we use gpu equals managed that we don't actually need to do any explicit data transfer at this time, it will be implicitly transferred automatically by the cuda runtime underlying this whole platform, because it will migrate the data onto the gpu when it's accessed on the gpu.

A

Okay, so that was a lot of explanation for what the compiler did. We can verify that it actually runs and gets the correct answer.

A

Great, so we get a final error, which is the same as we had before. All the output looks the same as it was before, so this all worked.

A

A

If we were to collect a profile to verify that this all ran on the gpu, we could do nsys profile dash dash stats equals true.

A

And what this would do is it would um record all the gpu activities, like we discussed yesterday and it'll print out any cuda kernels that happen to have run. You get the cuda kernel section you see. Only a single kernel was used in this code and its name is nv. Kernel main underscore underscore f1l 60 under square one. This is exactly the same name that the compiler told us it was generating for that kernel.

A

So that's that's reassuring that that kernel actually did run on the gpu, and we get some information about how many instances of the call there were.

A

There were a thousand, and that makes sense, given that that code occurs in a loop that runs a thousand times and it tells us the average time per call um it's listed in nanoseconds, so the numbers often tend to be a bit big, but you can see this ends up being about one millisecond per call.

A

Now, if I were to time the run of this just using the time command,.

A

I would find that it took about 3.7 seconds to run if I compile this without the dash m and p equals gpu, which basically turns off that. um Oh I do this also, so I turn off all the gpu support and then I rerun it and ask how much time it took.

A

It actually took um it wasn't really like the gpu code was arguably slower or about the same amount of time to run. So that's not awesome. It suggests that we have a little bit of work to do and one way we could think about this I'll go back and turn on the um gpu flags. Now.

A

One way to think about this is to look at the profile in detail. When I collected this profile, I will have generated a.

A

Qd rep file at the end, which restores all of the activity that occurred.

A

So here's the name of that qdrip file and I will advocate that there is nothing more informative when doing gpu coding work than actually looking at this profile in the user interface and seeing what happened that really helps you understand. What's going on, um so what I'm going to do? Is I'm going to copy that or I'm going to copy that report file down to my local system?

A

So here's the name of the file I'm going to save it to my local system, then I have to type in my password, which I should have done ahead of time. Apologies.

A

And then get out my google authenticator to that part.

A

Okay, so now what I'm going to do is open up that report um on my inside systems. So what did we call? This was report. 2 is what it was called, so I'll go ahead and open that up.

A

Okay. Now, if I were to look at this profile, I would see that the gpu was in usage for the majority of the timeline. Let me make this a little bit bigger. So it's hopefully easier to see um the gpu is in usage for the majority of the timeline. So this whole thing ran for about four seconds, which is pretty similar to the what we got, what we said it took before and for almost all of that time the gpu is being used.

A

We said yesterday that blue is when a kernel is running and red is when memory operations are occurring so there's both memory operations and huda kernels happening the whole time, so so from like 10 000 foot level. This is good. The current the gpu is being used, but let's now dig in a little bit deeper and and understand this at a slightly finer level.

A

So what I'm going to do is randomly zoom in to a particular part of the timeline, and I'm going to keep going until I can see the fine grained structure of the compute.

A

So now I've isolated it down to a relatively small section of the timeline. This starts at about 1.7 seconds and ends. You know about 15 milliseconds later it's a relatively short section of timeline, and I can see now. Gaps in the timeline blue is when cuda kernels are running and that's gpu compute work and when there's no blue on the timeline. That means there's no compute operations happening at that time.

A

So this is immediately a red flag to us, almost in every case, when you're running on gpus you're going to get the most performance of the gpu. If the gpu is running essentially continuously, if there are big gaps in the timeline, though you're not using the gpu and that's likely to be an ineffective use of the platform that you're on the bottom. Half of this row is memory operations. If I hover over any one of them, it tells me a little bit about that memory operation.

A

So, for example, this one is an h2d transfer that is host to device transfer and it's telling us that it migrated because of a page fault, which is just a fancy way of saying that this was the gpu requiring the data when it currently lived on the cpu moving over to the gpu and then these operations here, if I hover over them, these look like d to h, transfers or device to host or gpu to cpu transfers.

A

So if I look at what's going on here, I see a pattern repeated four times in this particular part of the timeline where a gpu kernel runs and then, as part of that kernel operation. We transfer data from the cpu to the gpu and then a section of timeline where we transfer data from the gp to the cpu and no gpu work occurs.

A

If we then compare that to the code that we were looking at, we might be able to get a sense of what's going on here.

A

This is all part of this, while loop and this while loop has two components. It has the second loop that we already moved to the gpu, which does the swap of the old in the new array, and then has this other loop, which is the one that performs the actual iterative update and notice. Now that I haven't yet actually ported this loop to the gpu, so the result is that that loop is going to be happening on the cpu. That has two consequences.

A

First, this loop is going to be a lot slower on the cpu than it was on the gpu, because it, um the cpu just, doesn't have as many uh threads available to it. In fact, we're only running with a single thread in this case in the cpu and the other problem is that we're also transferring the data back and forth in order to support this operation.

A

So that's going to make the cpu code slower, because it has to wait for the data to come back before it can operate on it. It's also indirectly making the gpu code slower, because that data needs to constantly be coming back from the cpu after this section of code in order to make it on the gpu to perform this section of code. So it turns out that it'll actually be a lot faster overall to run all both of these loops on gpu.

A

A

With this loop, it's a little bit more complicated than the previous loop, the the below that we already did and the main reason it's complicated is because of this second line. If I were to comment this line out and just look at this first line, this is pure compute and every um index I comma j can be computed uh independently of every other index. It does depend on neighboring indices.

A

You know I plus one j et cetera, but all of those are in an array, a which is static or not changing within the context of the lifetime of its loop.

A

So I can completely paralyze fully over the I and j loops in this case and not have to worry about anything, and this would just be great. It would end up having the same kind of parallelism as the loop below and would be good, but it's actually a little bit more complicated because I do have this line which computes the running tally of the error, which is the difference between the old array and the new array.

A

So we have to be cognizant of that and make sure that our loop construct knows that there is in fact, a reduction operation occurring in this case. This is a max reduction which, um which has um which basically calculates the maximum value of this error value across the entire loop.

A

Now, if you were to forget this, the compiler might do something that you expect or might do something that you don't expect. So first, let's ask what happens if I left this off. If I forgot that there was a reduction here, what kind of code would the compiler generate? We can answer that, because we know with this dash m info flag. We will actually get output from the compiler about what's going on.

A

Before I analyze this, um so a couple questions in slack that I should answer um one question was: can I explain the difference between copy and copy in um and why, in this solution code, do we use copy for one and copy n, for the other copy means that I want to copy in the initial value of the data and also at the very end of the data region, copy it out?

A

So I can reuse it after that. When would you use copy versus copying? Well, you'd only use copy in if you don't need the value at the end of the data region, but you would use copy if you do want the data at the end of the region, so copy in might be useful for for some data with an initial value, which then goes to the gpu when it stays there and you never need the data back, but copy would be more appropriate for a case where you're bringing the data to gpu you're updating it.

A

But then you want that answer back on the cpu, for example, to write it out to a checkpoint file or something like that to save your result.

A

Cadenza's profile be used before the directives are added to determine which parts of the code would benefit most from out of directives. The answer is yes, there's a couple ways to do this, one is that you can use the cpu sampling profiler part of insight systems to identify this.

A

So if I were to look at the section of the timeline which is the cpu threads and that's this row here, every so often inside systems will periodically sample the cpu and ask it what's happening at that point, and if you had written code with a a deep call stack, then what we would do is at every sampling point. We would show you what the call stack is um and that can be used indirectly to determine what the hotspots are in your application.

A

Unfortunately, for this particular example, we just have main and that's it so a call stack, doesn't really isn't really super informative about where the time is being spent, but in a real, realistic application code, you would have probably multiple functions, you'd be able to see or infer based on this, where the time is being spent.

A

The other thing you can do, which I won't talk about today- is to use nvtx the nvidia tool extension, which allows you to manually instrument the code with human readable strings that can call out sections of the of the source code. And then you would see those those timeline sections occur in the inside systems profile. So I in fact often will do that when I'm starting to port a code from the first for the first time from the cpu.

A

Before I write a single line of cpu code, I often will write nvtx regions in the code, identify where the time is being spent and then start there, because you're going to get the most bang for your investment, your time, investment dollar, if you um import the code where the most the time is spent in this example, not super useful or relevant because, like it's obvious just from looking at the code where the time would be spent, but I completely agree that in general you should use that kind of instrumentation to figure out where to invest your reporting effort from the start.

A

Okay, so I was getting back to the code. I was asking the question: what would happen if I forgot this reduction value? It turns out that actually, the compiler is pretty smart notice here. First of all, omb target teams loop and it says, generating implicit reduction max error, so it actually recognizes that a reduction is occurring here. Does this actually give us the right thing? Well, let's double check.

A

If I were to run my laplace code, it looks like indeed it does still get the correct answer and the same one as before notice, also that subjectively it took a lot less time. What I can do is I can collect a profile of this run and see how much time was actually spent, and it turns out that what do we got so the kernel uh that is now in line 61.. It was on line 60 before, but we had. A directive now takes only six uh microseconds to run on each of the invocation.

A

The new kernel that we just added um in line 49 that does this. This iterative update takes 18 microseconds to run but notice that that's shockingly smaller than the one millisecond it took to run this kernel before. So we have made this code a heck of a lot faster, even though we added more work to the gpu, we actually made it faster overall, because a we accelerated the first loop and b, we hopefully got rid of the data transfer operations.

A

If I look at the output of this, I now see that I have a report3.qdrap, so I can go ahead and copy that down to my local system.

A

And then open that up in my inside systems- viewer port, 3., okay, so now the whole thing runs in some 700 milliseconds and if I look at just the gpu section of the timeline, it's condensed down to how much time is this 100 milliseconds? That's obviously a lot faster than we have before and if I zoom in to any particular section the timeline, what I'm looking for is hopefully smaller gaps we can see there are still gaps so that maybe is not as optimal as it could be. How long are the gaps? Well in report?

A

Two, the previous example. The gap between any particular invocation was about as long as the kernel itself and took some. Let's see, 760 milliseconds in between kernels. How long is the gap now.

A

Oops long as the gap now.

A

The gap is 45 microseconds, so basically, it's orders of magnitude faster, ordered magnitude, less time in between kernels.

A

Also, each of these kernel invocations is extremely fast, whereas before we was taking an order, a millisecond to run in now we're writing kernels that run on the order of you know: 10 microseconds, so we've made the code like 100x faster in some sense, as you can see, the overall runtime of the application did not get 100x faster. It still took only 0.7 seconds, whereas it took a total before in the previous one four seconds.

A

So, even though we made the compute section of the timeline vanishingly smaller, we didn't change the fact that there's actually still a fair amount of setup cost to running a gpu. There's a whole half a second here before the gpu code even runs. We talked about this yesterday. That's just an unavoidable fact of life when using gpus, but if you were to run a bigger problem or you were to run more iterations of this solver, for example, you hopefully amortize that out and end up paying for the gpu.

A

So what I'd recommend that you do is once you've gotten this working with either openmp or openacc.

A

Try making the problem bigger or run it artificially for more iterations and see how the profile looks when you do that, and I think you'll start to get a feel for when gpus are are beneficial on net and when they are not with that, I'm going to stop there and if there are any last questions.