National Energy Research Scientific Computing Center (NERSC) Using Perlmutter Training, January 2022, 11 Jan 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Programming Models for GPU

Description

Part of the Using Perlmutter Training, Jan 5-7, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/using-perlmutter-training-jan2022/

A

So I'm brandon cook, I'm an application performance specialist at nursk, and this talk is going to be about the different gpu programming model options that are available at nursk uh before.

B

A

I want to reiterate or touch again on some of the points that that jack made in his presentation, so starting starting out just a few comments on the difference between cpus and gpus. So a cpu is, is really a architecture designed to reduce latency.

A

So you have a few number of very fast threads that can run at high frequency. You have large caches and um a lot of silicon dedicated to supporting things like branching and switching back and forth between different types of code.

A

You also have very large amounts of memory, but it's relatively slow.

A

On the other hand, a gpu is a throughput oriented device, so it's really suited for parallel work, particularly when you're doing the same operation on many elements so grid points, particles, anything where you're applying the same thing to to many many different elements of data, it's usually a good fit. It has.

B

A

More threads, but individually, each of those threads is, is less powerful than a cpu thread. um The memory capacity is is smaller, um but much much faster as jack pointed out.

A

So these are just some things to keep in mind and to start from the very beginning, there's I want to talk a little bit about the this, just in general, the style of programming so inherit.

A

This is a heterogeneous model right, so you you're still executing serial cpu code, but interspersed with that you're you're launching or offloading work to to a device, and so what this kind of means is in general, you should keep latency sensitive and serial work on the the cpu and, as jack mentioned, the the cost for moving data between the device and the host can be quite high. So it's always best to keep the data wherever it's used, so either device or host.

A

As for what's in this talk today, there are many options for compiled languages, so c c, plus, plus and fortran that we're going to cover later today, we'll be hearing about things like python and machine learning and ai frameworks.

A

So what was the landscape of of these different models? Look like. um I sort of arbitrarily chose two axes to slice this across. um So on the on the horizontal axis. We have ease.

B

A

Use or level of control, which is also a bit of a proxy for a number of features or what your possibilities are, and so all the way on the very right we have cuda, which is the native programming model for nvidia gpus, but it's also the lowest on the portability axis, because that's there's really one one main compiler llvm can do it so one or two compilers that can support cuda and it really only runs on nvidia hardware and then still offering quite a lot of control and um and features, but a little bit less I'd, say verbose than writing kind of raw cuda are your c plus plus frameworks.

A

So these are things like cocos or sickle and we'll touch on those as well. Then, the next um a bit easier in my opinion, to use than those in in several situations are directive based models. So these are you know something if you've been writing openmp code on cpus, it's it's similar. You, you have a set of directives that allow you to offload work as well and open acc is, is a kind of a similar, similar story to openmp.

A

It's a directive based model and then finally, um I've put the some fortran and c plus logos up here, because there's increasing support for um offload and parallelism directly in those standards, so in c plus plus there's the the parallel standard template library.

A

That's many features really showed up in c plus plus 17, and this can be a really powerful approach um and that will allow you to express some parallelism that can run on many different platforms. I think even microsoft compilers support that um and then same thing for fortran there's. There's do.

C

A

And offload support for array intrinsics like matrix, multiplications um and transposes.

A

So, let's start with the the native model or or cuda, and I think this is a good place to start, because it really serves as a reference point for for all the other models and knowing knowing what's going on at least at a high level in cuda, can help. You understand, what's happening behind the scenes for the higher level models.

A

So in terms of benefits and pros and cons, um you know the the obvious pro is that cuda is co-designed with nvidia's hardware, um so you get full control and direct access to essentially every feature of an nvidia gpu by using cuda, typically.

B

A

Being that close to the hardware, it also means that you didn't have the maximum possible.

B

A

And I say possible because there might be a large amount of tuning, that you need to do of things like launch parameters, etc that I can get into in order to actually achieve that maximum performance.

A

The the downsides are, it's obviously not portable, and the code itself can be a bit more verbose than some of the other options.

A

So I think the starting with cuda from the very basics cuda is an extension of c plus so cuda c or c plus plus, is an extension of the base language and the the real key thing that it provides is this extension um called a kernel and what a kernel does is it's a it's like a regular function, except it's executed some number of times in parallel by different cuda threads, um and you indicate that you have a kernel with this.

A

These special markers like underscore underscore global, and then you invoke these kernels with this special triple chevron, syntax.

A

So, let's keep going and dive a little bit into what? What does this syntax mean and what does it mean to launch a kernel?

A

So one kernel consists of a grid of blocks and this, this grid of blocks can be one two or three dimensional and then within each block you have a set of threads that can also be indexed in one two or three dimensions, and so, when you launch, when you launch a kernel, you say how many blocks you want and how many threads per block and those parameters can either be integers or these um dimension three types for for the higher dimensions and that lets you really map um the that.

B

A

Comes in here and lets, you lets you kind of easily map this sort of parallel work specification to you know.

C

Multiple multi-dimensional problems.

A

So those threads in so that thread hierarchy comes with a corresponding memory hierarchy. That jack touched upon, so the global memory or you might hear hbm. um This is the memory on the gpu, so that's accessible to two different kernels, so multiple kernels or different grids can can each access that global memory.

A

Then, if you zoom in the shared memory that jack mentioned, that's lo, that's private to a particular thread block within a grid and then even finer. Each each thread itself has a very small amount of per thread: local memory available.

A

So this this talk isn't a deep dive into the the details of cuda programming. um There's a huge amount of content available online about this.

A

um I do recommend that anybody who's going to do nvidia, gpu programming should read at least the first few sections of the the cuda programming guide and that will really help baseline your your knowledge of um what's going on with the device, and I can definitely recommend it. It's it's uh it's possibly more more accessible than you think and then there's also the nvidia blog and gtc talk and slides series that I definitely recommend and then I'll kind of end with a tip or maybe a warning.

A

So there's a huge amount of content available. If you go online and look for cuda I'll, just caution you to definitely check the dates of whatever content you're looking at, because cuda has definitely changed um in in some cases significantly over the years with new features- and you know, restrictions relaxed et cetera, so definitely check the date and make sure you're you're reading um kind of a modern source that um that doesn't, you know, give.

C

You information.

B

A

A

So moving moving along some on my uh my through the landscape here onto c plus plus um frameworks, so these are typically, these are usually built as cross-platform abstraction layers and they they give you a set of modern c, plus, plus abstractions and primitives, that you can compose to to express your application and they they tend to target accelerators and cpus from from multiple vendors.

A

So as a pro, these tend to come with very powerful high-level abstractions, they usually have integrated tools and libraries, and I guess I didn't write it here, but you can also get very high performance.

A

B

The downside of these.

A

Are um you have to write c plus plus, if that's that's, not really a downside, but that it could be if your application's in portrayant site um they require some amount of buy-in?

A

So typically, you'd have to convert significant elements of your your application to one of these frameworks or spend a lot of time getting the the glue.

B

A

For for a good interface setup, which can be tricky, they can often come with a learning curve and in some cases these are. These are really new or up and coming kind of projects that may or may not have direct vendor support, um and so you know if ecosystem maturity and the ability to to pay somebody to work on it. For you, contractually is important, then that could.

B

A

A

The the two main cv plus frameworks that are available at nurse are cocos and sickle.

A

Cocos is a is a project, it's largely run out of sandia, but has broad support within the department of energy and is built as an ecosystem. That includes a programming model and abstractions, and then a number of libraries and tools, support that come with it and sickle is a is a class cross-platform attraction layer. That is largely at this point. In time being a lot of support is coming from intel.

A

This is the native programming model for the upcoming aurora system, um but it's it's not. Proprietary. Sickle is a standard own bike. It's a standard run by kronos group, which is not intel. um So it's an independent, open standard it'll be familiar to anyone. Who's done work with opencl, it's kind of like the c plus version of that.

A

Okay, so start starting with cocos. As I mentioned, it's a multiple doe labs are contributing. Nurse has something like two staff members who are who are directly part of the projects and and contributing in various ways. um It's funded by the ecp.

A

The nnsa labs have in particular, really embraced this, and this is it's all open and available. You can go to github.comcocos and they have a really great set of tutorials and examples uh available and an extremely helpful slack channel. But I um and then I've also included here the reference to their their most recent paper, which which really uh outlines all the different capabilities that um that are available, um and then I can also recommend, if you, if you google, for this gtc talk.

A

This is a great talk which I think the lead developer of this or the person leading the project kind of talks through all the available options.

A

So just to to kind of dive in and give a little bit of a flavor of what what cocos is about it. Some of the main abstractions are are views, memory, spaces and execution spaces, so view is like a shared pointer to multi multi-dimensional data, um that is in a particular memory space, and then it comes with the layout, and so what what a layout means is basically, which which index is the fast one?

A

And this abstraction is important for for getting good performance on cpus and gpus with the same code, because you can swap the layout without changing anything else, and the other abstractions are memory, spaces and execution spaces which are where data is stored and then, where operations are executed.

A

So here's a vector edition example that you'll see everywhere um implemented in cocos. One thing I want to point out is that this entire code is is uh know cocosified. If you will.

A

In my experience, I find the least resistance. When I minimize mixing of cocos and and other code, you really just hand control over all the data management to cocos and let it figure it out.

A

So in this case, I'm allowing this coco's view object to manage the data.

A

I use a parallel four pattern to express how I want operations defined in this lambda function to to be applied to that data.

A

And, finally, I even use the the cocos reduction pattern to compute a final sum, and so at this point I want to mention that you know in this toy 1d example. I don't really take advantage of the the layout abstraction, but with with more dimensions that hiding of the organization of data and memory uh allows for cache utilization, good cache utilization on cpus and allows for coalesce memory traffic memory transactions on gpus.

A

This is a really powerful abstraction. If you, if you're targeting high performance, cpu and gpu code.

A

Okay, having given a small taste of what coco's looks like, let's talk a little bit about sickle so sickle, the the support for a100 and nvidia gpus is is under active development.

A

This is, this is actually joint work between um nersk alcf and uh a company called codeplay, where we're we're actively targeting support for a100 and that that project is is really progressing, is underway now, and it's progressing really nicely, but it's still, um it means that the support is is brand new, but it's it's.

A

The feature set is: is uh is really good so again, dpc plus plus, which is an implementation of sickle, is the native model for for aurora, and so that that particular compiler, which is based on lvm, is, is directly supported by intel.

A

um If you want to use sickle at nursk, we also can obtain some support for that through our contracts with codeplay- um and this is you know again- it's a different- it's like a slightly different flavor, but it's also modern z, plus plus, or it's based on modern c plus plus, and this could also be a good option for for anyone, who's really familiar with opencl it'll. Many of the concepts like a cue will feel really natural.

A

In in terms of of support in term, the the sickle model is there's a number of different compilation options available that that target a variety of different uh architectures.

A

um You have the dpc plus plus it's based on open source llvm. Like I mentioned, there are proprietary compilers from codeplay.

A

um You can xilinx, has support for compiling to fpgas um and then there's amd support through a project called hipsicle, and I I don't know too much about neosicle, but there's.

C

A

Support for these, like vector engines through that, so that this particular corner of the slide is is where uh nurse can hold. Alcf are working together um so right now this is a public fork of llvm, but the the eventual aim is the inclusion in the main projects um and we're we're working on targeting a back end that that generates ptx code directly. So you can still get very high performance.

A

And since it's this is open source, so anybody with the recent nvidia gpu will will also get the benefits of this. Although we are targeting a 100 and we're we're, also developing a lot of extensions to to enable access to some of the key a100 features like the tensor cores and some of the asynchronous operations that are maybe familiar to those who have done a lot of cuda programming. Recently.

A

Okay, so with without going too much further, uh what does this code look like um and if you squint a little bit it's it's very similar to the cocos code shown earlier.

A

The the body of work is again expressed as this lambda function.

A

You have a parallel pattern.

A

B

A

Range, that's specified by like this range objects. um You you enqueue. This work by submitting to a queue instead of using like the triple chevron function, call and the other thing you'll notice is that all the data access is coming through these. um These buffer objects, with a bunch.

C

A

Get access handlers and what what this allows is for the the sickle run time to kind of automatically handle the data movement for you.

A

So it's it can be a lot of effort to type all this out, but in the end you can end up with with quite efficient code, because the runtime will know okay, you're just reading this data and then you're writing writing that data. So I can stage this or keep it resident or whatever.

A

So this is a great idea it can. It can end up with a lot of tdm if you have really complex data structures so for the more complex data structure in the latest version of sickle. There's now something called unified, shared memory, and this is is kind of analogous to the managed memory in cuda.

A

So here you just declare sort of shared pointers that can be used anywhere um and all that buffer and accessor.

C

A

Is gone, um which makes it much more shortens the code significantly, um but it could. It could result in slightly less performance. um So if you tried sickle before and thought that it was too verbose, the the latest version definitely uh improves that situation.

A

So as for cycle at nurse um it it can compile and run today um we don't have a full module file available at this exact moment, um but I'm happy to make one available to you um and we definitely want to hear if you're interested in sickle. I definitely want to hear from you.

A

You know, file a ticket with us check the sickle channel in the nurse user, slack.

A

Okay, so switching switching gears away from the c plus plus frameworks.

A

I want to touch briefly on the parallelism built into the languages itself, so in fortran, if you write a do concurrent loop like this or you use, array intrinsics like matrix, multiply, reshape transpose uh et cetera and I think they're adding more um all the time.

A

Then you can just compile with this good par option with nvidia fortran, and that will give you parallel code offloaded to to gpu um and with with no changes at all to your code. That's just 100 iso fortran um and then I think nvidia might be the only compiler that does gpu offload, but there are a number of other compilers like intel which will generate parallel cpu code in some cases with with these do concurrent structures.

A

So if that's that's of interest to you, I definitely suggest checking out the nvidia blog post that I've linked here that kind of dives into the details of all what what is and isn't supported.

A

And then, switching back to c, plus plus, um I think that this is a little bit more full feature than the support in fortran. At the moment um you have all of these. You have a bunch of really powerful parallel algorithms, so, like transform, transform, reduce uh you.

B

A

Scans um for each et cetera, and if you check out the the numeric algorithm and execution headers on your favorite, you know c plus plus reference site.

A

um The most of what's in there is is what you can do and which is which can be very powerful. um So again, I think checking out the uh so- and this is really like nvc plus plus can generate code um for gpus by using just the standard code by just turning on the appropriate flags.

A

So so here's an example of that with uh again that same simple. Let's add two two vectors together, um so here I I make some vectors fill them up with some data on the host and then on the device. I say I want to do the transform algorithm with the parallel unsequenced execution policy on on this, these iterators or those vectors and then adding them together is the operation.

A

I want to do in this uh expressing this lambda here and then that's it, and then this this will get uh turned into parallel code by nbc, plus plus and execute on the gpu you can. You can compile the same code with intel and get parallel cpu code and so on.

A

So, just a few comments. If, if, though, that sounds like an attractive approach for.

B

A

um There's uh I won't be doing it a deep dive on any of these, um but just a few of the the things to keep an eye on so atomic reference. It is a great and powerful thing that allows some access to data to be atomic without requiring all accesses to be atomic.

A

So often, you'll have some array that you don't want to use access atomically in all situations and because that has a huge performance impact, um there's a new standard split phase barrier, which is really useful for coordinating asynchronous work.

A

The ranges have a lot of potential for better composition of different algorithms and can avoid.

C

Some of the headaches.

B

A

With iterators um speaking of iterators, there's often a need to iterate over multiple collections together or need to do something based on some index, and so for that you need either a zip or a counting iterator. If you.

B

A

Instead of writing them yourself, you can get them out of uh you, know, library like thrust or boost, usually.

A

And then another thing to watch is uh is mdspan, which is a multi-dimensional span, objects which is supposed to make working with multi-dimensional data, which happens quite often in science better.

A

I also want to say this: isn't an exhaustive list, there's tons of there's tons of new proposals and features, but these are just the ones that happen to be on my radar, um so it it's a great approach, but there's definitely some caveats and limitations.

A

um You know you're you're, still waiting on really brand new modern features to be implemented. You have to use a very modern version of c plus plus, which could be an issue for some legacy codes um and then there's other cases. If you have hierarchical parallelism um or getting that last bit of performance, that can definitely be difficult. With these standard language approaches.

A

Okay and um at this point, I'd like to turn over the the presentation to chris who's going to talk to us about openmp and directive based models.

C

Cool. Thank you brandon. Thank you on my screen.

B

C

This look okay to you.

A

Yes, chris, let's get.

C

Great well, thank you um yeah. My name is chris daley and I work for the advanced technology group at nurse.

A

Sorry to interrupt, there is like a window thing at the top.

C

So can you say.

B

That again, a window.

C

Thing at the top, like the window decoration for mac like the close button uh things like that, it's just a minor uh thing, don't worry.

A

It's probably your uh individual thing. It looks good to us.

C

Okay, uh sorry, I'm not sure how to fix that. um I hope you can tolerate it.

C

um I'll just continue um right so yeah. I work for the advanced technology group at nursk, and this is about openmp for gpus.

C

So, just as like a quick recap, I mean, I guess most people on the call are pretty familiar with openmp for cpus.

C

So just as recap, it's a set of directives and apis used to parallelize c c, plus plus and fortran apps, and we find that many nurse codes are using openmp on the cpu um oftentimes.

C

Just a few directives are used in user codes, for example the parallel directive and the four directive, as shown in kind of this code fragment here, where we're just simply incrementing an array on the cpu where we're work sharing across all of the cpu threads.

C

So this presentation will kind of extend from this and show you how to use openmp on the gpu.

C

So the first thing is to look at the openmp thread: hierarchy for gpus, so openmp introduces some new directives in order to be able to use openmp on the gpu, so the first one that I want you to become familiar with is the target directive. So this is the directive that enables you to create a gpu kernel. This is what is enabling you to execute code on a device and then an important consideration for gpus is we need to make use of the massive parallelism available, so we have to create two levels of parallelism.

C

A

C

Has introduced is a form of coarse grained parallelism, that's suitable for gpus. This is referred to as teams parallelism and then later on, you would use the familiar parallel directive in order to create the fine-grained parallelism. So using both together enables you to exploit the massive parallelism on the gpu.

C

And it's useful to be able to compare the openmp thread hierarchy to the cuda thread hierarchy that brandon showed earlier. So he had this diagram where there's a cuda grid of thread blocks. We see the multiple thread blocks and then within each thread block there are gpu threads.

C

So then, if we compare this with openmp, how they correspond to each other is one cuda thread block is equivalent to one openmp team and then openmp introduces a new directive called distribute.

C

This enables you to work, share a loop across all of the openmp teams and then secondly, a cuda thread corresponds to an openmp thread where, as before, we now have the forward directive. This is enabling you to work share a loop over all of the threads.

C

An important thing that you need to do with gpu programming is obviously moving data between cpu and gpu. These have distinct memory spaces on the cpu. You have your dram and on the gpu you have your high bandwidth memory. So openmp um manages uh the data in the gpu memory. It refers to as the device data environment using a combination of both both implicit and explicit data management.

C

So, in this code fragment here, we show how to map some data from the cpu to the gpu, using the map clause. So breaking this code fragment down, we have the target directive. This is what is created in the gpu kernel, and then we have next to the target directive map clause. So what.

A

This is doing is creating.

C

A variable on the gpu and because we're specifying a map type of to from what this is doing is copying data to the gpu before the kernel executes and then once the kernel has finished, we'll actually copy data back from the gpu to the cpu.

C

um If we add some print statements, what we can see is we have different memory addresses on both the cpu and the gpu.

B

But kind of an important.

C

Thing to kind of be aware is that we have this single variable name x, but this is actually pointing to two separate variables um on the host we have the original variable and on the device we have a corresponding variable.

C

So the openmp runtime is keeping track of this association and enabling you, the user, to be able to move data between both the original and corresponding variable.

C

um If we now just consider how to execute a simple example on the gpu, so this is now combining both the compute and the data management.

C

The first three lines of this code fragment is just some boilerplate code in which we're allocating and initializing an array on cpu. um We then move into the directives. So what we have.

B

C

Target teams distribute parallel 4 directive, um so this is work sharing um the work in the subsequent loop over all of the teams and all of the threads and then as before. We have this map clause in order to handle our data management, so this is moving the data to the gpu and back from the gpu.

C

So what we're seeing is we're seeing the updated value of x after this gpu kernel has completed um it's useful. This table here is really kind of get into what are the variables that we have in our device data environment, so we actually have three variables.

C

We have this variable x, which is our mapped variable. um This corresponds to brandon's explanation in cuda of data stored in global memory, so this is accessible by all of the threads in all of the teams.

C

We then have a variable n, which is a first privatized, scalar variable. So this is a per thread variable, which is initialized to the the host value is 16 384 and finally, we have the variable I this is just a private scalar variable. Once again, this is per thread that is uninitialized.

C

um So one thing that jack mentioned at the start of today is how important it is to minimize data movement between cpu and gpu in order to obtain higher performance, and so what we can do with openmp programming model. Is we have a family of target data directives that can be used to keep the data on the gpu for multiple gpu kernels?

C

So we can see the directive that we're using in this code fragment is a target enter data map. So what this is doing is creating this variable creating and initializing this variable x on the gpu.

C

So this will now remain present on the gpu until we have a corresponding exit data map, when we would actually free the memory on the gpu, so where this is now really useful is we can now have multiple gpu kernels that can access this variable x on the device without any additional data movement. There's now.

A

C

You to specify any map clauses, because what the openmp runtime will see is the usage of the original variable x. It will see that it has already been mapped and then, when you're executing your gpu kernel, it will ensure that all references to x are to the corresponding device variable x. So it's a really powerful way that openmp provides in order to minimize data movement.

C

And but it does come at a cost, and that is that the map clause does not always cause data movement in the way that a user may expect.

C

What um how the openmp programming model was designed is to make sure that there's no unnecessary data movement. So what it does is it reference counts. The map data in order to avoid any expensive data movement. So you can see where this can cause problems in user codes.

C

So we have here in this code fragment that we we've initialized some array x, which we're then mapping to the gpu.

C

We then may decide in the user code to update this variable on the cpu and naively you'd expect just specifying a map clause in your target region would then propagate that value onto the gpu. But that's not actually the case, because all that has done is incremented. The reference count, so the next two slides will just show two ways for us to fix this code.

C

So the first method for ensuring consistent data environments is to use the target update directive. So what this will do is transfer data between the original and corresponding variable of this maps variable.

C

So we can see here now in the user code before we execute the target region. We have this target update2 in order to update the value of the gpu of the gpus variable x. So then, in the target region we would have the expected value on the device.

C

Similarly, another method you can use is to use the always modifier for the map clause. So what this will do is it will force a data transfer irrespective of the reference cam. So now, the only way we've modified this map clause is to add this always um always work, and once again in the gpu kernel, we would see the updated value.

C

B

Just kind of two things.

C

To be aware of um so, this was kind of a really brief intro that just shows some of the compute and data management considerations for openmp um for using openmp on perlmaster. We recommend the nvidia compiler. This is for all c c, plus blast and fortran applications. um As mentioned earlier. We will soon have the clang compiler, but this would only support c and c, plus plus apps, and for reference in terms of what compiler options you would need to use.

C

uh There was a presentation on day one building and running gpu applications on palomata, and also we have this web page at nurse which not only goes through the compiler options. It also includes some best practices for how you can get high performance with openmp.

C

um On the topic of performance, one thing I want to make you aware of is a new feature in openmp5 called the loop directive, so this is particularly useful with the nvidia compiler, which, as I just mentioned, was, is the preferred compiler on par motor.

C

So this has a similar behavior to the distribute and four directives that we showed earlier, but it has one additional characteristic. So not only is it work sharing, it's also making an assertion that the loop iterations are independent. So this really enables the compiler to apply additional optimizations and deliver improved performance to you.

C

So we've seen in particular with the nvidia compiler. There can be advanced and advantage when you have multiple parallel loops, it's beneficial to convert from the standard, openmp 4.5 way to use in openmp loop, so.

B

Just looking at this.

C

Code fragment here we see that the distribute directive has been replaced with the loop directive and the full directive once again has been replaced with the loop directive. This is just kind of a a quick performance consideration for the nvidia compiler.

C

um I just want to move on quickly now to this openmp case study um that we did so. This looks at a qcd mini app called su3, and the highlight of our case study is we managed to achieve 97 of cuda performance on an a100 gpu using openmp and the nvidia compiler, and we actually presented this work at the gtc conference last year. This was a joint presentation from me and grey ozan, who is a compiler engineer at nvidia.

C

um Our key performance plot was this throughput plot. So this is the performance metric for this su-3 benchmark, so we show several bars. The first is just showing the performance we obtained on the cpu, which was 139 gigaflops per second.

C

We then looked at the cuda version of this su-3 micro benchmark, and you can see it was significantly faster. More than a factor of 10., we got a throughput of 1935 gigaflops per second and then converting this cuda.

B

Code to openmp.

C

We went through successive code optimizations as well as simplifications, um and we managed to achieve this 97 of cuda performance, which is a really nice achievement, and I decided to choose this particular case study, because this was one case study where the loop directive enabled us to obtain the highest performance with the nvidia compiler.

C

And this is just my summary of what I consider some advantages of openmp over cuda kind of the big one that's been touched on before is that it's portable to the cpu, as well as other vendors gpus.

C

One thing that it does really well is the data management, so this becomes important when your code has some very complicated data structures. If you have.

A

C

Data structures, with like lots of pointers and double pointers. You can move this data to the gpu with just a few directives. If you tried to do this with cuda and a runtime api, you would need um dozens and dozens of lines of code, so it's very very burdensome. Trying to do this with apis.

C

There are other kind of productivity, wins we've seen that we openmp provides directives in order to be able to work, share loops between both teams and threads if you're using cuda, you kind of have to manually do this based on the thread id and the block id, which is just an additional burden that um yeah it's just a pain to have to deal with um another.

A

C

Of productivity win is that you can very easily fuse loops using the collapse clause, so this is really nice to be able to expose the parallelism required to make use of the massive parallelism on gpu if you're using cuda, you have to manually fuse the loops and then have some like crazy, integer arithmetic to figure out um the um it's just the multi-dimensional index space.

C

um Another benefit of openmp is you have um a reduction abstraction. This makes.

A

C

Easy to form the data productions um if you're.

B

C

Cuda, you would have to either write it yourself and probably obtain low performance or.

A

C

Library, in order to obtain a high performance data reduction, um so so, in summary, there's um there's multiple productivity wins by using something like openmp over cuda.

C

um I want to just make a quick note about openacc, so this is obviously an alternative directive based approach. The concepts are very, very similar, similar directives. Occasionally they have different names.

C

Kind of one of the big differences is it's that it's a much more restrictive programming approach than openmp. For example, you have no thread id and you have no thread synchronizations.

C

This does come with a benefit because it it makes it easier to obtain higher performance than openmp.

C

So this is for kind of two reasons I mean the first is that, because it's a much more restrictive programming approach, you the programmer you're forced to write um much more gpu-friendly code, and the second is that, because it's a more restrictive programming approach, it's easier for the compiler to support the capabilities.

C

So historically, we've seen that, especially using the nvidia compiler, the open, acc applications can perform very very well.

C

um However, um as jack mentions in the first talk of today, we've had nre contracts with nvidia, in which we've been kind of co-developing, the openmp offload openmp offload capability in their compiler and we've actually.

A

C

In a paper a supercomputing last year that a suite of nurse openmp applications can achieve more than 90 percent of openacc performance, um so there's really no reason to be concerned that openmp applications will perform poorly, but at the same time, if you have an opening openacc application, there's no need for you to go out and quickly convert it to openmp.

C

The nvidia compiler will capably support both openacc and openmp applications.

C

And I guess yeah questions. If there's any questions, we can answer them on google doc um branson. Do you have a few words to say on this slide.

A

um Yeah, just that there's there's obviously many many options for for getting your code up in and running, to take advantage of the gpus on promoter, um and I think we we covered kind of the the main ones that we are looking at on the nurse side.

A

But we are, of course here to support you um and we're happy to engage and talk through any of the options that were presented today, and we can also help you out if you, if there's something you didn't see, um if you want to discuss it's, it's availability of nurse or um what? What your?

A

What our recommendation is, we're happy to do that um and I just wanted to end as well with um a few call outs um for those of you on the nurse user slack there's a number of relevant slack channels for the community of users out there who may be also writing code in the same model. So you could get in touch with people that way.

A

If you want to hear from us in addition to the google doc that's available right now, you can always um write.

B

To us in a ticket.

A

Where we're happy to to um have those and have those discussions and then obviously this was a very high level overview, so keep an eye out for upcoming events that are are targeting specific models in tool chains.

A