National Energy Research Scientific Computing Center (NERSC) GPUs for Science Day, October 25, 2022, 9 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Current State of CUDA Compilers and SDK

Description

Jeff Larkin (NVIDIA)
Current State of CUDA Compilers and SDK

A

So hi everyone, um my name, is Jeff Lark and I run a team at Nvidia that manages uh HPC programming models and standards and HPC software and so I'm going to talk to you about our HP C software stack I apologize. My voice comes across a little bit froggy. If the audio drops very subtly during the talk, it's probably not the connection, it's probably muting quickly, but uh I'll struggle through it with you, but stop my video and um so there's.

A

uh You know three high level topics I'd like to try to squeeze into this 30 minutes.

A

First I want you to understand um our vision for how you should program to a Nvidia platform and I want to talk as well about how the video platform is more than just gpus and and what all is.

A

uh It includes I'm going to talk about the HBC SDK, which is how um all of the Nvidia software is installed, on the on Pearl Mudder and on other systems as well, and make sure you understand uh what that provides for you and uh and how you, how you get it and how you access it and then I'm going to talk a bit about uh programming uh for machines like Pearl Mudder, with uh standards based uh approaches, which means not having to rely on a domain-specific apis or um you know, things like openmp or open, AC or Cuda would actually be able to write directly in C plus, plus or Fortran.

A

So the HPC software team has uh four major initiatives. You'll see a couple of these again when I get to the math libraries, but first we want to make using our platform completely seamless, and so that means uh making it simple to uh to take advantage of all of the available Hardware. So I included this being able to write a native, C plus or native Fortran application that can run on the CPU or the GPU, or even take advantage of things like uh the GPU tensor cores.

A

All of this, uh these these uh great Hardware features to be exposed to you in a way that is easy and seamless to uh to use. Next, it's important we scale up. You saw a picture of the the Pearl mutter node. It has four gpus on it.

A

So uh you know it's pretty rare to find a GPU machine with just one GPU on the Node, now they're, larger, uh larger machines, and so first the libraries need to enable you to uh just scale up uh to all of the gpus of your node, but even be able to scale up to full system scale, and some of our math libraries now uh can uh scale things like the multi-dimensional ffds or linear solvers up to full system scale for you, uh next working in uh domain libraries, and so one that we've been working in a lot recently, is quantum so identifying a specific domains that have very specific needs and develop libraries to support those needs.

A

So for Quantum we developed libraries to support a Quantum circuit simulation for things like signal processing. We developed uh software packages to support that so identifying these uh important domains that have uh individualized needs and be able to address those needs, and, lastly, arm is an important part of of our ecosystem, although Pro matter doesn't have any arm nodes. I think you've probably seen that Nvidia does have some Army CPUs coming, and so we need to make sure that those are a well as well supported as x86 is today.

A

But looking at the broad Nvidia platform, how do we expect you to to program to this and I want to emphasize platform, because uh often people uh think of it, video uh as the the GPU company, and we certainly uh have made our names for our gpus? uh We also have you know coming next year, our uh Grace CPUs. We also have, of course, some embedded CPUs as well, and our uh our Network interconnect through the melanox and finiman. So we need to provide a programming strategy that addresses all of these.

A

It's really a disservice to the users. If each of these Technologies has a distinct way to to program, it makes things uh much more difficult to use the foundation of all of that is our accelerated libraries and when I say accelerated. Libraries. People frequently think immediately of our math libraries, which are a really important feature. We provide things like uh linear, algebra, solvers, ffts, random, number generation, uh tensor solvers things like that, and we do provide very highly tuned math libraries that improve over time.

A

But those aren't our only libraries core libraries are things like libq, plus, plus and thrust which provide high level data structures and algorithms. A communication library is providing a well-tuned, MPI, nickel and vishman, and even higher level Frameworks for data analytics Ai and Quantum circuit simulation as well.

A

This is the foundation and for many developers you could start by uh by writing code to use these math libraries. So if you're doing large-scale ffts, we've got a library for you if you're doing large, linear solvers, we have a library for you, but it also enables us to uh to layer on top of that gives us a firm foundation for these other programming approaches.

A

I've broken them down into kind of three a high level um high level ideas. Now many people assume that our goal is absolutely everybody writing in uh some form of Cuda, where there's Cuda, C, plus, plus or Cuda Fortran. uh These Cuda languages are our languages Innovation we can co-design those with our Hardware. So whenever we do Hardware release has new hardware features. We can expose them directly here, um but it's not necessarily the right programming model for everyone. It's a it's a language uh for platform specialization.

A

If you uh want to get the take advantage of all the hardware features we have or what to optimize your application, it's specialize it to the GPU that you have in front of you. uh You know Cuda, C, plus, plus and Fortune, give you all the tools and knobs and bells and whistles to do that. But it's certainly not the only approach to our our platform.

A

The far opposite side is the one that I'm most excited about, which is accelerated, standard programming languages, so here I'm, showing ISO, C, plus plus and ISO Fortran, where I'm writing native C, plus plus and fortunate, with no extensions, nothing platform specific.

A

In both of these cases, this code can be built and run on multi-core CPUs or even offload it to gpus automatically and we're working on some of the major python packages as well to make a a similar experience in Python.

A

Even though it's not an ISO language, um why do you go this way means your code is parallel first, so, whether you're coming to a GPU platform, a CPU platform, uh an fpga, a DSP, it doesn't matter if your platform supports ISO, C, plus, plus it's going to run out of the box and that's a really exciting thing, because it means you're, not uh porting your application to the platform.

A

uh You know you're, ready to run day, uh Day Zero now sitting between these approaches is uh compiler directives which provide incremental opportunities and there's really two main ways we see. Directives use and at the top is open, ACC and the bottom is openmp a one.

A

You could use it like I'm, showing here where you write something and say uh you know: Fortune, Duke, current or C, plus, plus uh standard Library and use compiler directives to incrementally prefer and portably improve the performance, so here I'm taking control of my data movement, uh so that enables you to write fewer directives than you would have, uh because the parallelism is handled the language, uh but still uh optimize things like data movement in a portable way.

A

The other way pilot directives are popular is for legacy applications. If you have hundreds of thousands or millions of lines of code, uh it's unrealistic to expect you to rewrite it all overnight. A compiler directives provided ice beads of leveraging your existing code getting as much as possible running in parallel running on the GPU and then once you're running that on the GPU, you could go and evaluate. Well, maybe uh certain parts of my code should be refactored and Rewritten in one of these standard approaches, or maybe I have one solver.

A

That is just so important. I want to get the maximum possible performance, and maybe I can write just that part of Cuda. So that illustrates the last part of the slide, which is all of these approaches.

A

Mixed together, nicely you're, not picking one swim, Lane and sticking to it, but in fact, uh there's existence of applications that do the bulk of their work and do concurrent and and math libraries, and then it sprinkle in some directives here or there and maybe even have a function and and uh Fortran and Cuda, and all of that will mix together nicely. So uh so our goal is to provide you with the accelerated standard language as a way to write your code once and uh and expect it to run everywhere and anywhere.

A

So here I you, depending on my compiler flag. It understands to build this for a CPU I want to run on some number of course grade threads or for a GPU on some larger number of fine-grade threads and the code itself Remains the Same.

A

Now this is supported through a software product called the HBC SDK, and we've worked very closely with the folks at uh nurse to make sure this is, uh and also with uh with hpe the system integrator to make sure that this works uh really well on your platform. And if you look at the slide earlier, you saw that HPC SDK was the one software package that had uh you know, Green bars all the way across it supports all of the uh the the programming models available.

A

Hbc SDK is a completely free product, regardless of whether you have a GPU. You can download this and install it on your personal machine, uh it's available on on Pearl Mudder and on all of the major super computers it's available on all of the major clouds. uh So you could right away begin to use this for free everywhere. It supports x86, arm and open power. So it's a portable software stack if today you're running on Pearl, water and next week, you want to run on an arm computer by taking the HBC SDK.

A

All of the the software libraries that you need are built in and ready to go uh right on that new platform. So we support all the programming models. I discussed comes with four compilers nvcc is the Cuda compiler and NVC C, plus and Fortran are the CC plus plus and forger power. So we call these are HBC compilers at a huge range of libraries. I can't even list them all here.

A

The community, including communication, libraries and so uh what's really nice with this package, is all of these pieces are tested to make sure they work well together. So this entire software stack is well tested. So no matter where you go, you can expect that it will uh work uh seamlessly. And, lastly, we do provide profilers of debuggers, because it's really important for you as a developer, to be able to understand.

A

Okay, uh if I'm hitting an error or I'm hitting a performance issue, uh why am I hitting that so I just to say it one more time completely free and downloadable uh from our website as a container via SPAC and all the cloud? And we we release uh every odd month and the occasional even month as well. So our next release is expected in uh November.

A

An important part of that package is the HPC compilers now I know many of you that have been around uh uh nurse and the other doe sites for a while are familiar with the PGI compilers uh several years ago. Actually, uh almost 10 years ago now, Nvidia purchased, PGI and uh and we've put a ton of work into those compilers so much so that we can't even call a PGI anymore. We call them the Nvidia HPC compilers, uh so they do have that uh uh that lineage, but they've come a long way.

A

Since then, we've added features like the standard parallelism. uh You know additional programming models uh at additional uh platforms, so it supports all of our GPU platforms, and so you can automatically, in some cases, offload to gpus uh all of the program models. I discussed. uh It is a great CPU compiler as well. You don't need a GPU to take advantage of this that supports compiler directives and vectorization.

A

And, lastly, it's uh multi-platform.

A

The other important piece of the HBC SDK is our math libraries, uh these first two I've kind of covered already, but the additional uh high level initiatives here are that uh we are building libraries to be more composable, as our gpus have got larger and larger uh in some cases you know, writing a uh you know, a single uh having a single Matrix, that's large enough to uh to saturate entire GPU may not be realistic for scientific applications.

A

So we've built uh composable functions where you could uh College the libraries for within your existing kernels, which saves you a data movement costs, launch overheads, and things like that and, lastly, um making sure that we, when Grace, does come, that we provide the best uh best possible uh performance libraries for uh for the ARB CPUs as well.

A

So here's a you know a swath of of our libraries I. Don't think this is even a complete picture and in most cases you could tell from the name uh What uh what they do.

A

Yeah I'll highlight two reached uh uh enhancements. In both cases. These are our multi-gpu libraries, so here uh coosolver mp, uh coo solver. Does uh you know linear solvers things like Lu chilesky, QR, eigen, solvers things like that? uh What I've highlighted here is the multi uh multi-gpu multi-node aspects, so lower is better here. The gray bar is a library that uh Community Library called analyst. You can see it scales up here to about 1024 gpus uh with the green bar.

A

You can see, we've taken our uh coosolver MP and it not only improved performance but improved the scalability out to uh here I'm showing four thousand ninety six gpus on Summit. So this means, if you have large solvers that might you might have used in the past like Scala pack. Now you know we can support scaling up to uh to all of the gpus.

A

You have available automatically and we can do the same thing with uh ffts here, I'm showing uh um where we're scaling up the problem size as we increase it over gpus you can see reaching 4 000 gpus is a very large 3D, fft um and, um and performance is great. uh We support uh 2D and 3D with sports lab and pencil decomposition. We greatly prefer slab because it gives uh you see a picture here.

A

It gives a lot more work for the GPU, uh but we do support pencils as well and as well as having functions to convert between these.

A

So that's it for the libraries there's uh several uh talks available on the nvidia's website. Let's go into greater detail about uh more of our libraries to talk about the stated languages I'm going to highlight both C plus plus and Fortran. Here.

A

So C plus plus, is uh has been said: C plus plus 17, a parallel language. uh It introduced C plus plus 17, to introduce the parallel algorithms Library. So we already had a list of high-level algorithms in the in the C plus plus standard Library, C, plus plus 17 added. The idea of providing an execution policy, so I could say, go run this sequentially or hey.

A

This is something that can actually safely be run in parallel, so it enables you to exploit both your threaded be a parallel and Vector concurrency, C, plus plus 17, also made some guarantees for forward progress to avoid deadlock and clarifications the memory model to avoid race conditions. uh The reason I highlight these is at the same time we were building those features into our Hardware.

A

So the last several GPU releases I've had these matching guarantees, which means that we have known now for many years that we wanted to be able to support all of C plus on our gpus.

A

uh We do uh continue to work in the C plus committee.

A

There's new features coming to C, plus plus 23, that we're excited about and that we're beginning to uh to preview it our compiler and we're already working on C, plus plus 26 Ed, will later this year release a prototype of a feature: that's not even expected in C, plus until C, plus plus 26., so we're very excited about making sure that uh C plus is a mature language for parallels of a concurrency, because that's a like the tide raising all of the boats making this available uh these these features available everywhere.

A

One application I'll highlight, is a mini app from Lawrence Livermore called uh luesh. It's a hydrodynamics mini app and uh the Baseline code is C plus plus with openmp, and this is one example function you can see. This is fairly typical. Openmp I'm supporting my CPU threads I have work sharing my work here and.

A

I have a nfdf at the top to handle a sequential uh consistency.

A

If we look at this code on the right, this is the exact same function, doing the exact same thing, but written using a C plus plus standard algorithm. So you can see the code is much more compact. uh It should make it long term easier to maintain it's completely ISO compliant, which means it's portable to every isoc plus compiler, and on top of all of that, it actually turns out to be faster too so here uh showing three compilers, the Intel compiler, the new compiler and NVC plus plus.

A

uh This is running on AMD, epic CPU I, don't recall if these are the same CPUs as as on Pearl, better but they're the similar family, and you can see the performance of the openmp code across these compilers is comparable if I take the time to tune all of my environment variables, I get them even closer together, but this is the default setting.

A

If I look at the the code right, they're, just ISO, C, plus plus you can see across the board, the code gets faster on the same CPU, and why is that well I think there's some advantages. The compiler has to staying in a simple single programming model that has more complete uh understanding of the code. It could optimize better but I think as well. There are some performance uh inefficiencies in the open, NP code that could bring it.

A

uh You know closer to Performance, but this is the the Baseline code AS provided by the uh by the customer.

A

Lastly, changing a one compiler flag: I could take this C plus plus code, no extensions, no directives and running on the GPU as well, and so you can see here, uh I'm, taking the exact same pure C, plus plus code, we're going to cost three different compilers and two different Hardware platforms.

A

So Lou lash is a mini. App uh Maya is a full application, written at rwth AKA University, it's about a half a million lines of code. It's been written over the course of a long period of time, so this is not something that could be Rewritten overnight.

A

uh They did, however, go to some of their individual solvers and replace their open MP with uh just straight C plus plus parallelism down. Here are the results for the lattice of ultimate solver, and you can see the openmp and the C ISO C, plus plus, are comparable performance.

A

In fact, this is after after they uh they fixed a significant open, MP performance bug that that was initially in the code and uh taking this code unmodified, they can run it on uh all of the gpus on their node or using MPI scale it up to their full system scale, so they're very excited about these results and have now begun to work through their their other numerical methods to uh uh make this uh MPI, plus ISSC, plus plus their uh their standard going forward.

A

We're doing very similar things in fortrad is a very widely used language within the doe in particular, but around the world, um and it has a history of parallelism as well. So uh there's three ways to parallel programming: Fortran we support two of them. First, is our array math intrinsics, so these are things like calling Matt Muller reshape audio arrays. uh There's lots of parallelism deeply embedded in that that we can take advantage of second, uh usually the Duke and current Loop, which is something that was added in Fortran 2008.

A

It was extended in 2018, it's being extended again in 2023, and we actually already support that feature that adds reductions um to do concur loops, and so you can write all of your data parallel Loops Now using uh do concurrent uh without the need of uh like any compiler directives at all.

A

Lastly, is co-arez: co-res can be thought of as a an alternative to API. We don't currently support that in our in our compilers. We would like to eventually, but we don't have that today, one application I'll highlight for uh Fortran is mini weather. This is uh developed at Oak, Ridge, National Lab. It is a teaching code, but it's used as a part of the spec HPC Benchmark Suite, there's a open, FP version, there's open ACC version, there's even versions written in various C plus Frameworks as well.

A

um In terms of the code. You know if you are a fortunate programmer. This code will look uh very uh easy to understand matter. Of fact, uh the bulk of this code is exactly the same as the open, ACC and openmp versions. The difference is I replaced uh a Tripoli nested do loop with a Duke and current. uh The compiler could take this code, build it for CPU threads.

A

As you see here, the performance is on par with openmp openmp actually handled the thread Affinity slightly better, so it's uh Luke, Duke Curtis ever so slightly slower uh in this case or uh could run out of the GPU and it's comparable with the open ACC. So we get very good performance and portability uh using uh to concurred.

A

Thank you. Another application is pot 3D. This is also a spec HPC Benchmark suite, and why I like to highlight this one. uh They had an existing open, ACC code here lower is better. uh They wanted to see. How far can we get without any directives at all, and so they rewrote uh the open ACC code using compiler directives and what they found was. They were about 10 percent slower than their open ACC code, which is actually fairly acceptable to them, but they wouldn't understand why, and so they dug in and determined.

A

Well, you know part of the reason we could run uh fortra Duke occurred on the GPU is because we could use something called cuda managed memory which, uh when you allocate your data, it's visible, both the CPU, the GPU and under the hood. It will migrate according to usage, uh and so um they what they eventually found is they use open, ACC with managed memory. They get the same performance as the Duke of current.

A

So clearly, this 10 performance loss was due to the automatic migration of data, and so they put back in some minimal open ACC to handle uh the data movement to optimize. The data movement and their performance was then the same so here uh they Stripped Away some 400 lines of of directives or something like that uh and wrote all of the parallelism and Fortran and the data movement in uh primary directives.

A

And one more thing I'll highlight here is a recent enhancement is games. uh Games. You've, probably heard of is a computational chemistry application. That's very widely used. It's been developed for some 40 years. uh Their Baseline code is MPI, plus openmp, and they have that on the CPU, and they also have that for the GPU using the the uh offloading directives, a student at Iowa State rewrote this portion took out the directives and put in duka current, and you can see here.

A

The results was actually a pretty dramatic performance improvement over the openmp and she did go through to try to further optimize the openmp, and this was the after the optimizations. So why is this? Well? Openmp is pretty strict in what the compiler can and should do to your uh to your code when, uh when encountering these directives, uh Duke occurring is more descriptive. It gives the compiler a whole lot of freedom in what it can can do, and so it can make smarter, optimization decisions. So this um there's a paper coming for this.

A

It's uh not not available yet, but look for it next year to show these results so with like four minutes left. Let me uh come to some conclusions here. uh First hpcsdk is a complete and portable toolkit for HPC developers. It's available on pearlmeter uh via a module load or on your own machine via a download I, encourage you to check it out because it is portable across uh you know. All of the major architectures second Nvidia supports a wide range of programming models.

A

uh I think uh I would actually say that we have the the greatest choice of composable uh and mature programming models of any HPC vendor and, lastly, I can only scratch the surface in. uh You know 30 minutes here so I I picked out my four favorite talks for uh uh GT this past GTC and linked to them here. uh Full disclosure. This last one was mine and uh my voice is a lot less painful to listen to in, uh in that one.

A

So I encourage you to go back and watch these and with that I have about four bits left. If there's uh any questions you'd like me to answer.

B

Thank you. um Thank you, Jeff. We have a couple of questions in the chat, so the first question is from uh Josh and the question is: um do any Nvidia sdks have licensing concerns or considerations that need to be noted uh when deploying on non-nvidia platforms. I did notice that, on your previous slide, you had mentioned Nvidia, GPU and amdcp. So.

A

Yeah, so um the HBC SDK it's freely available there. uh If you there is a Eula, the Euler does have a car route so that you can distribute the necessary runtime libraries to make it possible to uh to build and distribute your uh your your software appropriately.

A

um You could certainly, if you, if all you have, is an Intel CPU uh you go off or download this. You can still build. If all you have is arm CPU go off and build it. uh You know it's. uh So this is a freely available. But yes, of course, go go, read the uh the EULA, but we tried to be as generous as possible. There's no charge to it. There's no renewing a license. It's uh download off our website. It's free to use.

B

Thank you. Another question is what would be the best place to suggest uh or request new features in cooler, math, libraries.

A

Go to math libraries um I would say uh your best bet is to uh to float them up through uh the nurse help desk uh there's uh they have a direct line to our uh to our engineering and our product managers to uh to make that happen, uh but you could also post on our uh our forums as well.

B

uh Another question is what is MP in sorry: cool, solver, MP and cool fftmp. Are these available through previously existing cool, solver and 450 apis yeah.

A

So we um that I believe stands for multi-processor, uh so there's qsolver, which is the kind of single GPU solver MP, is the uh the multi uh multi-processor version. The uh the API um does uh translate I. Think there's I think you have to add the two letters to it, but otherwise the the API is very familiar and the reason for maybe them differently is it does make it a little bit easier when, uh if all you need is the single node or all you need is the multi-node, it helps with the linking and distribution.

B

uh We have two more questions, so in Fortran, 2023 um does do conquer and support reductions on array variables, I.

A

Believe it does so actually that uh that pop3d code, um one thing I did gloss over is they actually uh had did not go down to zero compiler directives. They had to leave in three one for picking which GPU on the Node they wanted and two atomics which were for array reductions. We have since implemented that in our compiler, so um the best of my knowledge, it's both supported in the um in the standard and in our compiler. Now.

B

Our next question is from our next uh next week. Next speaker, sorry, what are the compilers will run standard, C plus plus on gpus yep.

A

Great question uh so um there's a paper coming um at Super Computing by uh the folks at University, Bristol, Simon, McIntosh, Smith's group that goes into a study of this um and I believe they were able to uh to build for it. Intel gpus, uh with using Intel's compiler stack and a very small shim, uh the difference being we chose to use the standard execution policy and Intel chose to use a uh a unique one to themselves a DPC plus plus execution policy.

A

So they had a six line of code shim to to translate between, but otherwise the code was the rest code was portable I'm, not aware of one for AMD gpus at the moment- um and we are you- know, discussions with the community to to better to to support this, uh to support this better.