National Energy Research Scientific Computing Center (NERSC) NVidia HPC SDK Training, Jan 2022, 12 Jan 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 8. Introduction to OpenACC -- Brent Leback

Description

Part of the Nvidia HPC SDK Training, Jan 12-13, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/nvidia-hpcsdk-training-jan2022/

A

uh Again, my name is brent levac. If you were here yesterday, uh you had my introduction, so I'm just a member of the uh nvidia hpc sdk team. I do a lot of customer support, 10 hackathons and help out wherever I can so uh today we're going to talk about open, acc first and then I'm going to also talk about openmp. These are two directive based models and we'll talk about the similarities and the differences between the two models.

A

uh Open acc resources so there's a website. Openacc.Org has a lot of the resources like the specification uh where to find literature on it where to find maybe some blogs or success stories, upcoming events. So uh through nvidia we have a lot of hackathons over the course of the year. I'm uh currently involved with one at georgia, tech I've been involved with a lot of the hackathons over the years and the hackathons are open to people uh using, of course, open acc also openmp just cuda.

A

I don't know that we've had anybody come to the hackathon just to work on standard par yet, but uh uh that should happen this year. I would think so.

A

Let's continue on here basic syntactic concepts, so both openmp and openacc use directives. We, I guess in fortran the the correct name, is directive in c. Maybe the it's also called a directive, but it's found pragma. uh So you say: pound pregnant acc directive name or in in fortran. Most fortran that I've come across now is is free form.

A

So you say: bang dollar acc directive, so the directives in openacc are roughly divided into two groups: uh compute directives, we'll talk a lot about a lot about those and data management directives and there's also a runtime api, so that's typically used for things like device, selection and other less used functionality.

A

So the point of a directive based model is, you can add, directives to your source code and it uh will do the desired behavior if the compiler recognizes those directives.

A

Now, if you use a compiler that doesn't recognize those directives either it doesn't have that capability or you don't turn the the compiler options on to recognize those directives. You should still have your original sequential code, which will work the way it always so that's both powerful and uh has a few drawbacks. I have to warn people. I have been involved many times at hackathons in in my own work, where I have misspelled the directive.

A

I've said bang acc or bang bang acc that we've just you know, created a typo when we type things in, and you will probably do that yourself. If you use directives a lot and then it just becomes a comment and I've been at a hackathon where we've wasted like half a day, because we had a typo in the comment, and so the directive wasn't doing what we thought it should be doing. It was just a comment, so so you got to take the good with the bad a little bit.

A

uh The figure on the right, of course, is the typical uh view of accelerated code. So you know ninety percent of your application code deals with setup file, io some serial processing and things, but that the idea is usually that ten percent of your code is where you spend ninety percent of your time.

A

So you need to put the directives around those compute intensive functions to offload that bit of code to run. You know multiple times faster, hopefully on a gpu, and you still run the uh rest of the sequential code on the seat.

A

So I'm going to kind of uh keep with the uh the design of the two-day talks, we'll start with the highest level. First, so the highest level compute construct in open acc is the kernels construct and there have been many successful ports of applications that only use acc kernels and what they do. Is they just kind of find the the section of code and they put acc kernels at the beginning and the kernels at the end.

A

So what this does is expresses that this region of code may contain parallelism and it's up to the compiler to determine what can safely be parallelized and offloaded to the gpu. So, depending on the compiler, you may have, you know, begin kernels and kernels and you can have multiple sections or multiple loops inside of there and the compiler is free to identify the number of parallel loops in there and and generate multiple kernels for offload. In this case it could generate two kernels or it could.

A

You know, fuse the two loops and generate one kernel: it's not specified by the uh the open, acc spec, uh how many kernels to generate here. It's up to the compiler's capabilities and the compiler analysis to make that decision. For you and, like I said many people are comfortable with this, and it's worked in a lot of uh major applications.

A

The other uh high level uh compute construct is the parallel loop directly. So this is a little bit more designed after the opening mp parallel loop, and that is it's a little more specific and it does not give the compiler as much leeway in how it generates the kernels.

A

So you basically say pragma acc, parallel or pragma, acc parallel or fragment, acc loop, and so uh just like in openmp. Acc parallel just starts a kernel region, and then you uh put the work sharing loops inside or the upper example. Acc parallel loop is a combined directive, both parallel and loop inside the same directive, and you can do it either way they are equivalent, if you say, acc, parallel and you're in fortran, you need to have an end parallel and the same in c.

A

If you say pragma acc parallel by itself, then you use curly braces to delimit the parallel region.

A

So again. Actually this was, if you go back to the beginning of open acc, the parallel region was proposed by cray and that's how they were doing.

A

Gpu offload the colonel's region was proposed by pgi and that's how we were doing uh gpu offload, and so when we did, they designed openacc. They decided to use both forms and that's why we have two forms of the compute construct today.

A

So open acc exposes three levels of parallelization and the terms they use are gang worker and vector, so you can use those uh terms on loops and, as we'll show later, there are other uses for these.

A

uh If you are familiar with cuda, the gang level is basically the blocks in the grid and the vector level is basically the threads in a block.

A

Now there is a middle layer called the worker level of parallelism, it's not used very often, but it is there and what our compiler, how our compiler treats that is. If you have worker level parallelism.

A

Typically, the worker will be another dimension within the thread block. The vector dimension will be in cuda terms, a single warp.

A

So if you have worker and vector parallelism, uh it's almost assured that the vector length will be 32 and that will be 30 threads 32 threads that make up a cuda warp and the worker level parallelism will be multiple warps in a thread block, so you can specify loops that you want scheduled, particularly in a certain way, so I could have in the upper example I could have the outer loop loop gang.

A

I could have the middle loop loop worker, I could have the innermost loop loop vector and that's a common way to do it another way that I've ended up using more and more. The more experience I get is just the bottom block. There, acc parallel loop collapse, three and I let the compiler uh compute the the.

A

What am I trying to say, the decomposition of the three loops inside the kernel and I'll show a little bit more example of that later. So you can put gang and vector on that and it will use both uh the blocks in a grid and the threads in a block or running that, and you just kind of end up with the default configuration that the compiler chooses.

A

There are ways to adjust the gang supporters and vectors, and this you can use this when you, when you know a little bit more about the loop bounds than the compiler.

A

um So there are, you know, heuristics that our compiler uses that have been developed over the years to find you know, typically good numbers for a number of gangs based on the the trip loop count.

A

If it knows it and the vector length and the a funny anecdote is over the years, it just kind of turns out that if you have gangs and vectors the vector length of 128 is almost always the best so trade secret there, uh we've we've tried lots of different things and lots of flexibility, and it just always seems to be 128 is a good number.

A

So, uh but if you, if you know something like you know, my vector dimension of my arrays is not 128. It's only 30 or 32, or you know a short number. You can force the vector length to be smaller, so vector length. Q use a vector length of q for this parallel region. Of course you can change this for every different kernel, which is nice.

A

You don't have to have one setting over the whole program, so this gives you some flexibility and it's a little more like cuda, in that you can have different launch bounds for every different kernel. So we talked about yesterday a little bit about do concurrent and some of the limitations of do concurrent and one of the ones that I pointed out was you don't have any control over the launch configuration in do concurrent currently and so open, acc and openmp is, will show, give you uh some flexibility and control over the launch.

A

Configuration of the kernels.

A

You do not always want to work share a loop in your kernels, so sometimes you want every thread to run the entire loop uh you want well. This is often the case in you know multi-dimensional loop uh constructs. So if you have a four or five I've, even seen, seven deep nested do loops and fortran some of the loops.

A

You just want every thread to run sequential, so you can force that using uh acc parallel by saying acc loop seq for sequential, so you can still you know, work share outermost loops and then run sequentially an innermost loop or you can. You can actually run sequentially, not the innermost loop. You can. The acc loop sequential can occur anywhere in the loop structure or above any for loop.

A

One recommendation for you is that don't put acc loop sequential on the loops that have arrays with the index being the leading dimension of your array. So yesterday I talked about the most important thing for performance on a gpu was at the vector dimension, use the vector index as the leading dimension of most of your array accesses. So in this example, maybe it's not great, because the array accesses are pretty mixed up. There's I j I k and k j.

A

So so it's not a hundred percent clear which one should be the vector dimension, but in general that's not always the case lots of times every array is accessed the same way.

A

This last note, sometimes the m info messages will tell you that the compiler found a dependency in one of the loops and forced it to be sequ, and sometimes even if you mark the loop as seq, the compiler m info will still tell you that, which is kind of an annoyance to me. But uh again, we've stressed this many times. Look at the m info output and verify that what the compiler is doing, syncs with what you think should be happening.

A

Both open, acc and openmp as I'll show later have a collapse clause and I've found this more and more useful over the years.

A

So it's useful when the loop extents are short or there are more loops than levels. So you know I mentioned you could have a four or five six dimensional, you know do loops or four loops. You only have three levels of parallelism, but maybe some of the loops are short like you know, only one to four or zero to seven or something like that. So you know you don't want to dedicate a whole.

A

You know loop to one of your units of parallel is so you want to collapse a couple of those together. So if you have the code on the left, basically what the compiler does is takes the two loop bounds, both in in this case and creates a single loop from zero to n times n. And then, if you use I and j as indices into arrays, it does you know the correct arithmetic, either divide or mod or or both. I don't know. If I did that exactly right, but it's something like that.

A

To get the I and j values, and so you might think well, that's just a lot of overhead, but remember that these loops are loop shared. So every thread has to do this anyway right, so it doesn't create any more work than you would have to do anyway by the time that the the loops from zero to n minus one are mapped onto some number of threads or some number of blocks, so uh the work is done anyway. You shouldn't see a performance degradation for this little bit of extra arithmetic.

A

Calling user routines and device codes, so this uh is a nice feature of open acc. It's a little bit different than openmp and where you start here is in the lower right hand, corner acc parallel.

A

So I'm going to specify the number of gangs and the vector length here and I'm going to call a function or 17 g, and so you can see that when you compile program main the compiler doesn't know anything about.

A

What's in fg I mean it may, if it's in the same file and you're or you're doing some type of inter-procedural optimization, uh but I'm I'm basically just calling fg and so fg could be marked as a routine gang and routine gangs can have loop gangs within them. So fg calls fv.

A

So fv is a function, that's our routine, vector and so vector routines can have vector loops inside of them and then uh inside the vector loop. I call fs, which is another function, but it's a sequential function and it just returns some function of a so you can have pretty deeply nested call chains in open acc and the each call in the this is a fortran example. In the fortran sense, the call has to be explicit and you need to know the interface of the routine you're calling.

A

So maybe this example isn't quite right that f, g, f, fv and fs need to be either in an interface block or in a module. That program main can see so open acc makes it a little easier on the compiler vendor in that, when you compile and call routine fg, you know what level in the parallelism tree or hierarchy that each routine is in. Is it a gang level routine? Is it a vector level routine, or is it a sequential level there's also worker again we don't use worker very often.

A

I think I've talked a little bit already about reduction clauses, so I'm not going to spend too much time on this slide uh and we'll talk about it in openmp2 they're almost and in fact they may be identical between open, acc and openm, so uh takes many values and reduces them into a single value.

A

The way that the compiler and runtime do this under the hood is the subject of a lot of uh experimentation, research churn and we're fixing it all the time we almost have like a full-time compiler person that just works on reductions. We'll talk a little bit more about that later. But it's it's a very important part of the gpu code generator.

A

So each thread calculates its parts. uh The reduction doesn't have to just be a high level reduction over the whole kernel. You can have reductions uh within a gang over a vector or over a worker, or you know. If it's, if it's a sequential loop, then it's not really a reduction, because every thread is computing, its sum, but the compiler will insert code uh to produce the single result uh at whatever level you need and if you've ever written a reduction in cuda. You know just how nice of a feature this is.

A

This was a big step forward when we were, I was playing around with cuda fortran in the early days.

A

So these are the the reduction operators they're, just what you would expect and- and so I'm gonna go over- that quickly. Now we're gonna move on to data regions.

A

The data construct defines a region of code in which gpu arrays remain on the gpu and are shared so you can have. This is an example of a structured data region. So a structured data region exists all within the same function or subroutine in fortran, there's a definite beginning and end in the same program unit and within that uh the data within the data directives will remain on the gpu uh the entire time uh it.

A

The structured data region does give the compiler a little bit more information, because the data region is in the same function and sometimes we can take advantage of that and do things a little nicer and we'll talk about that later.

A

A

Data clauses so all copy and create clauses behave as their present or variants. A little bit of history open, acc 1.0. We did not have data regions at all.

A

At every kernel we copied the data over, ran the kernel and copied the data back. uh Then later we decided well the same style of directives that we use to generate kernels. We can use to generate data motion or data movement, so we created things called. You know copy in copy out all these and then about a year later we decided well when we port code a lot of times.

A

We put the data directives just around the kernels and then we move one subroutine higher, and then we put data directives there and then we move another subroutine higher and put data directives there and that's kind of the the method of tuning for performance is to move the data movement higher and higher in your program. So it's you know resident over all the time steps.

A

For instance, we talked about yesterday, so it occurred to us that really we don't want to have to remove the data directives, but we've always had this notion that the data directives are present or copy present or copy in present or copy out. So what that means is uh the open, acc, runtime and openmp.

A

I think in all compilers have this thing called the present table and when you reach a data directive, you look up in the present table and say: is this data already present on the gpu and if it is, the data directive basically becomes a no op.

A

So you don't suffer the performance penalty of copying in at every level of your call tree. You only do it at the highest level and then you just have a tiny little performance hit just to check the present table, which is just a you know, reading a little bit of from the cpu memory. So it's not a very expensive operation at all.

A

So this trips up a lot of people and in fact, just last week I was working with a person on a code and they had a c plus code where they had the data directives in a class, and they had the you know.

A

The class constructor started the data region and then the class destructor ended the data region, and then they had other data directives sprinkled through their code, and it was a present or problem that the data was not getting updated correctly because they thought the data was getting updated when it was actually already present.

A

So just a little history there and something to watch out for that. These data clauses are present for variants. In fact, our compiler probably still accepts the syntax present or copy, uh which was shortened to p copy, and then it just became copy is the same as so, and I digress there a little bit but copy of list copies both in and out copy in, of course, only copies in copy out only copies out.

A

Sometimes you don't even need a copy of the gpu. It's just kind of a scratch space, while you're in the kernel, so that's create and open. Htc has something called present where it just checks, uh and so it's assumed that there's a data directive outside of of this data directive. So I'm just checking that the data is already present.

A

I can assume it is if it's not we're going to get an error from the open, acc run.

A

Unstructured data regions, uh so, as I mentioned before, structured data regions, the beginning end, has to be in the same program. Unit uh worked great for a while, but then you know there's complicated codes where all the data is created in the emit function and all the data is torn down in the file function and stuff like that. So you wanted the data directives where the data was initialized, so these are unstructured.

A

You can have enter data uh at any point in the in your program. You can have exit data at any point. They don't need to be in the same program unit.

A

It only makes sense when you have an enter data to use copy in or create, and it only makes sense when you leave a data region, an unstructured data region to have copy out or delete.

A

So this is not a real complicated example, but it just at least gives you the syntax.

A

So uh the unstructured data regions have this notion of dynamic data a lot of times, I'm not going to go through this too much.

A

The one thing I wanted to point out here is that there is another form of data directives that I didn't touch on anywhere else and that's the acc declare create data directive and what this is useful for is two cases. One is for global data, so in fortran, if you have data in a module, that's global data see it's like data at the global level and also it can be used for, like local arrays, within a function.

A

So in fortran, like automatic arrays that have the lifetime of the function, so they have a dynamic lifetime, which is how long that unit is in scope.

A

So it's easiest to think about in a subroutine, you have an automatic array in there that lives. As long as the subroutine is in scope, a compiler will generate a gpu copy of that for you upon subroutine entry and tear it down upon exit. If you put that array and declare create.

A

So there's the present or- and so you might say- well- that's kind of annoying. I really need to update the gpu or cpu copy of the data, so there's the update.

A

So this is actually a always a actionable uh directive, so it always updates it's not a present for operation. So a lot of times you have a you know. Your data is really high out, but you need to get that data onto the cpu or update the values on the gpu and typically or one common example. Of this is say you need to do mpi communication, so uh you know I need to update the host version of x in the code down below right before I do. An mpi send.

A

So, that's why you would use the update clause so the the spec itself they changed, host to be update self. uh I find that kind of annoying, so I still use update device and update host, but host and self are, for the most part, the same thing there's a notion in the spec that you could do an update on the gpu, so self would be device, but I don't think in practice. I've ever seen that.

A

So again, here's the uh uh just the syntax of this no big uh surprises here. One thing that you may not be aware of is that the data internet clause does not need to be contiguous.

A

uh This is nice and fortunate. You can give it like a slice of an array. You can update the internal interior of a two-dimensional array, uh something like that.

A

It implies that the data must be present, so you will get a runtime error. If you try to update something that is not present, it does not do a create or any type of create before copy.

A

Array shaping uh so sometimes the compiler knows the shape of your arrays. Sometimes it doesn't. You know in c c, plus plus. Sometimes you just pass a pointer and the shape is sort of just up to the programmer to uh take care of in fortran. You always just pass a reference, but the dimensions sometimes can be different in uh for transparent than they are at the high level right they're still, uh you can just say star as the shape of the array.

A

So if you suspect something is wrong here, examine the m info output and the m info tells you what it assumes uh the uh sizes of the arrays are. uh If you want to be explicit, you can put the shape on your copy in and copy out directives, but note there is a little difference between c and c, plus, plus and fortran.

A

So we took the lead from it was an intel product intel had some definition of c array syntax and their definition that they had was starting element colon number of elements. So it seemed to be. You know, kind of the way that c and c plus plus views uh uh aggregates.

A

I guess is the right word, uh but that's not the fortran syntax, the fortran syntax for slices, which is part of the language and has been since fortran 90 is beginning element or starting ending starting element, colon ending element so 1 to n it's a little easier because 410 begins numbering at 1..

A

So be aware of that, when you need to be explicit about your copy in and copy out these, uh you know, data clauses can be on data parallel, kernels directives, uh the copy in and data copy out and associated things can be occur on a lot of directives.

A

Oh, and one thing I was going to mention- is we made a recent change probably two releases ago that we used to support copy in a of n and we would treat n as one colon in and we got into some issues with that, where people wanted only to copy in a single element, a of n.

A

So about two releases ago we made a change to require. You know the colon notation. Otherwise we will just copy in one element and we have a way to with a flag. I have to look and see what it is, but it's like implicit sections or something like that to go back to the old behavior, but that is really non-standard behavior.

A

So I recommend highly that you use the colon in your array shaping.

A

uh So I think I'll have this slide in both my opening htc and openmp uh box. It's a useful slide. uh I refer to it a lot, so I'm glad that I wrote it. It's the open, acc data directives on the left, openmp data directives on the right. They are almost identical other than you know, slight syntax changes. In fact our compiler uses the same exact runtime.

A

As for data management between openhcc and openmp, there is so much similarity that we hardly need any special code to handle the two.

A

Let's see six minutes, five more slides, um a few other important things to talk about in open acc. uh Almost all applications use google libraries. So how do I use openacc to do my data managements but still pass the base pointers into the google libraries? So uh I have a fortran example on the left.

A

A c example on the right: it's the same basic uh syntax. You use bang dollar acc, host data use device and give it the array that you want to pass the device array, the device pointer within that block.

A

So when I call kurand generate here, g is the kuran generator, so that still lives on the host, but I'm filling a device array from kurand, and so because a is in a data region here that data copy up above the compiler can know for the host version of a what is the associated device pointer and I'll pass that into korean generator.

A

So uh we use this all the time for calling kurand, kublaz, ku, fft and.

A

If you use any of those in your hpc application, you'll get to know these directives pretty well.

A

A

Asynchronous behavior hues and streams.

A

So again I have a fortran on the left uh c on the right and it's done pretty much the same way. This is probably almost the entire talk in itself. uh In my introduction, I just want to mention uh that open acc uses cues and the queues are numbered. So you just kind of pick an arbitrary number. You know on the left. I picked 10 and that's my hue number that I'm using in open acc. So uh I can do update device async 10, I can do kernels async 10.

A

I can do update host async 10 and that's just an open, acc queue, but in open acc there is a one to one correspondence between the q number and a cuda stream, and I assume max later today when he talks about cuda, we'll talk a little bit more about cuda streams, but it is sort of a delimiter or a queue that have dependencies within the stream, but multiple streams can run somewhat independently.

A

So the call you want to really focus on here is when we set the qft set stream. There's a api function called acc get cuda stream, which you give it the cuda. I mean the open, acc async number and it returns the cuda stream, and this cuda stream is a 64-bit handle. Basically, we treat it in fortran as a 64-bit.

A

uh On the right again is uh c same general idea: I used 12 there, so you you can just kind of arbitrarily pick integers what is mostly done in openhcc for your async numbers.

A

uh Openacc does allow you to uh tune your code a little bit like cuda, and it has something called the cache directive which I'm showing on the right here. I haven't really talked about private, but uh it's kind of a common notion among openmp and openacc that you can have private data at the gang level or at the vector level.

A

At the gang level, there will be one copy of that for uh block thread block in cuda and that's very convenient, because cuda has shared memory that can be used across. You know all the threads in a thread block. So in this case on the right, the tile array is actually put into cuda's shared memory and it is accessible very quickly and with high performance.

A

So uh the way on the fortran cuda fortran on the left, where you declare something as shared, uh and then this is just cuda fortran on the left on the right, you can do almost exactly the same operation uh using the tile array in vector loops that are collapsed twice once to fill the tile and then the second loop to read from the tile.

A

This is the final slide: hpc gpu compilers commonly used options. I've talked and we've discussed some of these already. You can explicitly use managed memory or no managed memory. uh We talked yesterday that.

A

Lots of people find this pretty useful, um but there are some. You know, drawbacks of it, and sometimes it's performance issues max reg count is, is kind of a code generator uh flag. You know there are limited number of registers. You can kind of force the compiler to use less registers and spill. Those more gpu equals line info is is useful if you're needing to use some debugging or diagnostics.

A

I think max yesterday talked about compute sanitizer, so you get a little bit more information out of the compute sanitizer. If you have some, you know it's called dwarf information or line information in your executable gpu equals compute capability. Xy uh only generate code for a particular compute capability. If you're just running on perlmutter, you really only need to generate. You know, code for ampere, which is tc80.

A

And finally, we added a feature: maybe a year or two ago called auto, compare well. Actually it's called pcast and pcast stands for parallel. Compiler assisted software testing, so you can compile with gpu equals auto compare and it will automatically run your kernels both on the cpu and the gpu, and then compare the results at the end, and you can have some control over the comparison, whether it's exact, compare or provided tolerance, and you can find bugs that way.

A

Maybe things that were not, you know updated correctly between the host or gpu, or you know, real compiler bugs or something so auto. Compare is a big hammer. Does it everywhere?

A

It turns on gpu equals redundant, and so we have a section in our uh user notes about there's an actually an api. So you can turn on this redundant uh execution on cpu and gpu on a section by section basis and then just specify which arrays do I want to compare between cpu and gpu.

A

So that's it for openh.