National Energy Research Scientific Computing Center (NERSC) NVidia HPC SDK Training, Jan 2022, 12 Jan 2022

Previous Meeting

⏯

youtube image

►

From YouTube: 9. Introduction to OpenMP Offload

Description

Part of the Nvidia HPC SDK Training, Jan 12-13, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/nvidia-hpcsdk-training-jan2022/

A

In the first place, so this will be, I I hope, the equivalent to the previous, and since you watch the open acc, we can compare and contrast openmp to open, acc here um and feel free to you know answer, ask questions as we as we go along or keep them to the end and we'll have a question and answer session.

A

uh Okay, openmp.

A

So openmp has been around a long time.

A

uh Here's the website openmp.org, uh it looks quite a bit like uh you know the open acc for a reason I mean open mp, has been around for over 20 years and it is the you know the designed to be the parallel programming api, but back to our our discussion on day. One.

A

uh Does that mean that if I have openmd code that I've been using on a cpu for the last 10 years? Is that going to perform well on the gpu? And the answer is uh unfortunately, no.

A

uh The architectures are just so different that uh things that you've been taking advantage of on the cpu just will not perform very well on the gpu. So one is that we have more levels of parallelism.

A

So most of your openmp code that you have today has a single level. It's omp, parallel, 4 or omp parallel due um the other is since the as we talked about yesterday, since the cpus have so much more capability.

A

uh People have gotten used to having pretty extensive amount of work under this compute function here. So um you know there could be things in there. Like you know, dynamic memory allocation, uh you know lots of stack usage.

A

What am I trying to say? Barriers, openmp, has thread local storage or thread private data. I always get the two confused. One is an openmp thing and one is more of a cpu thing, but you know it's got a fork join model and what that means is that the threads are pretty much spawned at the beginning of the program in cpu openmp and they kind of just hang around and you hit an openmp region and the threads fire up and do some work.

A

And then you hit the barrier or the end of the region and they kind of just go to sleep and wait for the next amount of work to come up so uh cuda and and gpu programming is a lot different than that. The only thing that really stays around is your data in uh the gpu memory, other than that, the the kernel grabs, resources, computes and then the results are stored.

A

You know back to the device memory and the threads go away, and the next kernel launch can use a completely different number of threads and you know, can run on completely different hardware.

A

So what is the openmp solution?.

A

So the top part on the left is the openmp pragma syntax, not really any different by design and then open, acc again open acc, you know borrowed lots of things from openmp uh on. The bottom. Left, though, are the things that you should be uh concerned with or studying, or are the keys to gpu offload so they've added some new constructs target teams and distribute. So these are the major additions for gpu acceleration.

A

So uh what target does and- and this is the one that you will always kind of include at the in your data in compute sections, so that starts the offload. It maps the variables to the device and begins execution.

A

Are you know, as a marker for the beginning execution of that construct on the device?

A

It does not do any parallelism by itself, so they've added uh another construct, called teams and teams creates the teams for execution.

A

So this is all pretty uh hard fast uh teams, maps to open acc gangs or maps to blocks in the grid in our implementation- and I think all openmp offload, implementations and then after teams distribute says, work share uh the work.

A

That's following amongst the teams that I've created and then we're back to the original openmp uh directives parallel, uh which has been around since the beginning and in our implementation that creates the cuda threads within the team. So that's the threads in a thread block and then parallel do or four says uh to work share that work below that among the threads in the team.

A

So on the right is the same picture. I had in open acc.

A

Lots of applications have been written using openmp on the cpu and open acc on the gpu, and I think those same applications could still use openmp without target on the cpu and open mp with target on the gp. I haven't seen it yet, but I there's no reason that that can't happen in that case. You know this this. uh This diagram, maybe is a little false in that, probably with openmp on the cpu there's, probably more than five percent of the code.

A

That's running in parallel on the cpu using openmp would be my guess from my experiences with openmp on the cpu.

A

So, what's a programmer to do so, the first attempt might just be to insert target teams distribute where you already have openmp directives. So this is like the laplace code and we'll go through this example. uh Quite a bit um so before where I had pregnant omg parallel four on the uh outer eye loop, I can just insert target teams distribute, and you know it's not terrible.

A

uh What we will do is uh insert uh you know, create a kernel for that and divide the outer loop among both teams and threads. So using m info you'll get a message like I've parallelized this across teams and threads. I use 128 threads and we're using a static schedule.

A

So if this is, uh you know if, if your code has a whole bunch of one-dimensional loops and that's how your code is structured, uh you may be okay. uh The thing to worry about, then, is like you talked about before. Do you have long sections of code under the omp parallel, and are you just going to overwhelm the gpu with all the resources that you expected to use.

A

uh In this case, to get more parallelism and get coalesce accesses, you could break up the directive and use uh parallel four on the inner loop. So uh I think this is a thing that uh a lot of people have tried and and probably works pretty well on, multiple compilers.

A

So on the outer loop, I distribute that across the gangs using target teams distribute and on the inner loop, I distribute that across the threads in a gang uh using omp parallel four, so this gives a pretty good performance. Actually, you know almost as good as it gets. What we've done now is since c. The rightmost is the leading dimension. Now I have the leading dimension in the in the thread.

A

You know open, acc, vector or the you know, the thread block dimension. I get coalesced accesses reading from a and writing to a new, and this performs pretty well, so we generate the reduction and you'll see m info messages like this that you know loop 66 is across teams.

A

Loop 69 is a parallel loop and that's parallelized across threads, so you're running on the gpu and it's great. uh Unfortunately, this is not what you want to do on the cpu. So what we've done here with this example is we've kind of crippled the cpu performance, because we've moved the parallelism on the cpu to the innermost loop, so we're creating you know doing this fork join operation for every innermost uh loop uh j equals one to s, size minus one.

A

So when you compile this for the the cpu using our compiler, and I think most compilers uh you'll see that the the outermost loop is across teams and I believe well, I know for sure our compiler only creates one cpu team and I believe most compilers only generate one cpu team and the reason we can't do anything else. There is the openmp spec, as far as I understand uh does not allow barriers or uh things like the single construct or or things like that across the target teams dimension.

A

All the existing cpu code out. There uses those types of of things on the parallel four dimension and that's where barriers can occur so uh so we're really sort of stuck here as far as as how to create a portable uh schedule for this nested loop that works on well on both cpu and g.

A

So there have been, you know, lots of papers and and proposed solutions to this over the years. In fact, there was a paper pretty uh popular one a couple years ago. That just said oh just used uh uh preprocessor directors, and uh so one thing that they have supported or added to the openmp spec is this notion of a meta directive, and I am struggling with this a little bit myself.

A

We don't really have this working in a released compiler. Yet we have something that I think works correctly, that we're gonna have in our next release 22.2, but with a meta directive, you will be able to uh have a directive based on different targets. So here the syntax is when the target device is a of kind. Gpu use, target teams distribute you know, reduction when it is not kind gpu or the default use parallel. Four with a reduction and enter you can just say: if it's, if it's gpu add parallel four, otherwise don't do anything.

A

And then, if you compile with mp, equals gpu, you get what you want. If you compile with mp, equals multi-core, you get what you want.

A

So uh this is an okay solution. I think people will use this. I believe there are ways that you can put the meta directive inside of another.

A

You know high level macro, maybe in an include file and then kind of clean up the syntax and not have to um duplicate that with with every single kernel, uh I'm struggling a little bit because this is not really out in the wild yet so I don't have a lot of experience with this in our compiler when I used the dev. This worked for me.

A

So we are also providing and putting a bit of effort into another type of directive in openmp and it's called the loop directive. So this is a more of a descriptive openmp uh constructs. So there's religious wars between whether directive should be prescriptive or descriptive.

A

I don't really want to get into that too much, but descriptive gives the compiler a little bit more flexibility and you're just saying something about the code itself and in open acc. The directives are considered descriptive and you're.

A

Saying these loops are parallel, so do what you want prescriptive is is more the openmp model which says uh break up the work this way on these loops, and I can't I I don't really want to get into it any more than that, but uh our compiler has been around quite a while and and we kind of and like and and hope that people use the more descriptive model that gives a little bit more flexibility to the compiler and as part of nvidia.

A

Now we want the compiler to target our hardware in in kind of a the most effective way, so the form that this looks like now is pragma. Omp target teams loop, so you just basically specify loop and the compiler looks at the next loop and determines you know how to schedule that onto the gpu.

A

Alternatively, like below this is almost like. An open acc uh slide that I had. You can specify the omp target teams, you know begin and end, and then you can have openmp loop. Just inside of that, so we'll generate one parallel kernel and uh and map that, uh hopefully very efficiently onto the gpu, giving the compiler a little bit of flexibility of of the teams and and threads that it uses.

A

So here no need for metadirectives or if macros uh it provides a target, specific parallelism in the same executable, so the claim and in actually the labs you can probably try this for yourself.

A

The claim is, if you use omp target teams loop on this loop, we will generate close to the most efficient code for both cpu and gpu in the in the same file in one compilation.

A

So uh using our m info messages, you can see that you get the you know, generating nvidia gpu code, uh using teams and threads, and when we generate multi-core code on the outer loop, we use the lines to cross the threads because we've given the compiler the flexibility to schedule it. You know the right way for the different hardware.

A

There was something else I was going to say with this now. I can't remember.

A

Maybe it'll come back.

A

Bind team's parallel thread pauses on omp loops, so this is the uh kind of the equivalent in open acc. To being able to say this is a gang loop. This is a worker loop and this is a vector loop.

A

uh You know, there's only two levels of parallelism in openmp teams in parallel teams again are across the thread. Box. Parallel is within the block across the threads bind thread is equivalent to openacc's loop seq.

A

So that says, you know each thread. Does you know the whole loop here? It's not worksheet.

A

uh Similar to open acc, there is only an implicit barrier between bind parallel loops.

A

So uh so you can have multiple omp loop bind parallels, you know one after the other and we will do on the gpu, the appropriate sync operation, so that the second loop can read the results generated by the loop preceding it, assuming that it's within the same team.

A

uh If you use the target teams loop uh construct, you can also uh adjust the kernel launch parameters similar to open acc, where we had num gangs and vector length in openmp. It's num teams and thread limit.

A

It's not just for openmp target teams. I believe you can also specify thread limit in num teams, for you know, target teams loop, uh I mean target teams distribute as well. I think we will accept thread limit and num teams for almost all cases. There are some cases where we will not I'm trying to remember what it is. I think it's if you just say, omp target without teams.

A

In our experiences, uh the compiler does a pretty good job, uh our openmp reduction implementation.

A

We have found times where we generate too many teams and our reduction implementation currently in openmp uses atomic operations, and you may find limiting. The number of teams gives better performance. So this is something actually. You can also try in the lab exercise. Today, I think one version of the lab has atomic operations.

A

A collapse clause uh is the same as in openhcc, probably by design. I think openhcc just uh followed what was in openmp.

A

I think we talked about this before uh in the opening cc talk exactly the same, calling user routines and device codes, so this is more complicated in openmp than in open acc, and we are working on this.

A

So you can see the example I showed with open acc where we had a routine seq routine, vector routine gang, and you know you can call that in openmp you just use omt declare target and it's up to the compiler and the the run time to kind of do the right thing and, quite frankly, we're a little bit struggling with that, and so uh in this case you know we have subroutine fv on the right here with declare target and has a omp parallel do so.

A

This is called like an orphaned parallel uh operations, so you know it's orphaned to be in openmp sense, because it's not within the original.

A

Loop, our the original kernel, uh begin, begin kernel, end kernel uh block, so uh here's the error that we give in 21.11, I believe in 22.2 our next compiler. We will handle this case, so we've been working hard on this kind of for orphaned, parallel uh operations, uh so there's work to be done there. uh It's uh unfortunate. I think that uh openmp decided to not help out the compiler vendors a little bit more here.

A

uh Reduction clauses: I think this is almost exactly the same: open acc borrowed heavily from openmp and openmp didn't change. So uh again the reduction can occur over all the teams in the control kernel or within a team. The compiler will generate the proper code clicks together.

A

uh Atomic operations uh I haven't, talked about atomic operations, but they are supported in both open, acc and openmp, and these are, you know, actually used quite a bit uh they're kind of considered, maybe a more advanced topic than for an introductory talk, but the atomic ensures that a specific storage location is accessed atomically, hence the name. So uh this prevents race conditions uh and uh can you can actually implement reductions?

A

This way a lot of times it's used, for you know, collecting things that are more like a vector scatter, where you accumulate in a scattered manner. So it's the decomposition across threads and blocks is a little unclear.

A

So that's here that j is a kind of a runtime function or read from a another array. So you don't know for sure. If I'm updating things that could be updated, you know across the gpu on the other side, uh hackathons have shown the need for double complex atomic updates uh seems like it's.

A

We've hit that in well berkeley, gw, almost every time I mentor that team, um some other. You know chemistry codes, most purposes, it's okay and our work around is usually to do the real and imaginary parts separately, because they're just doing atomic sums and they nobody actually reads the result. Until the kernel is finished, the hardware itself does not have a you know two double uh atomic update, so you have to break it up into two.

A

So why why do we encourage users, the descriptive versus prescriptive argument, so our openmp loop, uh more directly, leverages years of our open, acc scheduling, kernel generation and openmp uh inside parallel four allows some things that we view as parallelism, limiting so like master directive, single barriers, etc or openmp. Api calls so lots of code that people are porting may call things like omp get threadnum or something like that.

A

So when we generate openmp loops, we don't need to insert any of this openmp runtime support into the generated kernels and we do in the other cases.

A

Sometimes we know that we don't have to, but when there are subroutine calls within your kernels, we aren't sure right because we don't always have good visibility into the subroutines or functions that are called.

A

What we've found, though, is you know the converse. The cuda tool chain does a pretty good job of removing, or at least minimizing the overhead that we have to insert uh into the you know the prescriptive openmp, but we haven't gained a lot of experience yet with complicated kernels so whether that still holds up or not we'll get to see. So I I think the jury is still out.

A

I think you can do a lot of good work with the prescriptive model, uh whether it's portable to all platforms or not, is another question we'll also have to wait and see on that.

A

So you know part of I should say part of the acceptance of our openmp compiler at nursk is that the set of performance benchmarks are within 90 percent or more of open acc and we've reached that. So um so it's an it's not like it's it's terrible all the time.

A

But again we have not used. You know really complicated kernels for a lot of that, I'm going to quickly go through data regions. The reason I can quickly do it is because they are almost exactly the same in form and function to open issues.

A

So I you know, there's there's differences in the compute constructs between open acc and openmp uh for the data regions. We share 99 of our runtime code for data management between openmp and openacc, so they're very similar omp target data can have teams loop within there.

A

uh There are data clauses again similar to open acc. There are some additions, so there's a you know. Instead of copying copy out, there's map to from to or from instead of open, acc create there's map alec.

A

uh Some of the editions are like always map hallways too, and that's always from so I laughed when I first saw this and we started playing around with it. I called them always slow. So uh if you find that you are, you know really relying on map always to or map always from.

A

Maybe you ought to step back and see if there's a way to get away from that, so this will always update the host or always update the device.

A

So we have the update clauses. If you occasionally need to do that, if you always need to do that, uh you know.

A

Maybe that's not a good thing.

A

uh There are unstructured data directives. Here's the basic example: again: it's like one-to-one mapping between open, acc and openmp, just different syntax from uh when you end the unstructured data region. You can delete it just like open acc there's an addition in openmp called release.

A

Openmp talks a little bit more about the reference counts. uh You know it. It's actually implemented the same way in openacc and openmp. So the present table that I talked about earlier has reference counts for how many times you know the data has been referenced or kind of pushed onto the stack, and when you leave a data region, it just reduces the reference count.

A

uh Map release, actually kind of exposes that to the user and so they're able to just kind of you know, subtract one from the reference count using that uh clause.

A

uh Target update uh just again exactly like openings, you see just different syntax lmp target from uh again. You might need this before you use mpi send or something like that array. Shaping is again the same between open, htc and openmp, and here's the same slide just to show the uh corresponding data directives between openhcc and openmp.

A

So you can, if you want to use you know with our compiler turn on both openmp and open ecc, you can put use all the data directives in openmp and all your compute constructs and open ecc, or vice versa.

A

It will work either way.

A

uh Asynchronous behavior, so on the left, I have open acc from example from this morning, we're still kind of ironing out how it's going to work in openmp, but this my knowledge is how it's going to work in our next release 22.2.

A

So it is a little different than openacc openacc had q numbers that map almost one to one between streams, open a openmp uses depend clauses and what you put in a depend clause is really kind of a just a marker. It's convenient to use a variable, and then you kind of use that same variable independent clauses and map, whether the dependency is in or out and then uh and then also add, a no weight clause for asynchronous behavior on that.

A

A

It's worth kind of going through this example on the right, because again it's important that most applications call uh cuda libraries, most cuda libraries take streams, and you want your work that you do in open a openmp kernel to play nicely with the stream operations in your library functions.

A

So here we're adding a call, maybe just in the last week this became ompx get cuda stream because it's non-standard openmp, I believe, but you from the devos from the default device. This will give me a stream number that in our runtime the next set of no weight clauses will use.

A

So it's kind of looking forward that says if I call omp get cuda stream, it returns me the stream number that I can subsequently use, and then I can use that stream number. It's exposed, actually the cuda stream to call coup fft set stream. So I'm you know, instructing the cuda fft library to use that stream, and then I can use the openmp depend clause on stream and no weight to put all this work using depend with end stream into the same stream, asynchronous stream.

A

So this update 2 will use the end stream under the hood. Our qffts we've specified those to use uh and stream the scale of c we'll use n stream and then the target update, reading the result back we'll use that cuda stream and then finally, I say omp task weight and that's where the synchronization will occur. And then the result is back on the cpu.

A

So this will be our solution in 22.2. My understanding is in openmp 5.2 they've proposed another solution to kind of interact with cuda streams, but we won't have that available in our compiler for a while.

A

uh Passing device pointers to cuda libraries is very similar to uh open, acc and openmp and openacc. Remember it was host data used device in openmp, it's targeted a used device pointer, but otherwise it works exactly the same way.

A

uh Fortran array, syntax and device code. Basically, it's not really uh supported in our openmp compiler. Unfortunately, there's a couple of ugly work: arounds, you can say target teams loop and create a loop that goes from one to one, and we will accept this error, syntax here and kind of do the right thing or you can explicitly write out the array syntax here on the left. H colon equals zero, adding a loop, um so we don't have a good solution for fortran array, syntax at the high level in uh openmp.

A

Yet so this is my last slide. uh So I've mentioned a few times that uh our openmp compiler is still sort of work in progress and in our next release, 22.2, which will be out in february. These are the things we're working on uh so openmp and openacc in the spec define array reductions. So that's a reduction not just on a scalar but on an array element or an entire array or a section of an array and we've been working on that for a while.

A

We actually implemented it in openmp first and then moved our implementation to openacc it's complicated and we're still working on it. But it should be in pretty good shape in 22.2.

A

uh The target task no weight like the example I showed, and how that maps to cuda streams uh that should be pretty solid in 22.2 uh and as again, that's important for you know: interoperability with uh cuda libraries support for orphan parallel. Like the example, I showed, a parallel loop in a user function called from a kernel that should be working in most cases in 22.2, still working through some of the metadirective support issues and we're always working on performance.

A

So uh you know try to get the commonly used uh constructs to work as well on our nvidia hardware, as we possibly can is our goal. So I think that's a wrap, open, acc and open.