National Energy Research Scientific Computing Center (NERSC) NVidia HPC SDK Training, Jan 2022, 12 Jan 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2. Nvidia HPC Software -- Jeff Larkin

Description

Part of the Nvidia HPC SDK Training, Jan 12-13, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/nvidia-hpcsdk-training-jan2022/

A

Hi everybody uh for those who don't know me, my name is jeff larkin and I am an architect in nvidia's hpc software team, and so I wanted to give you an overview today uh of really the topics that are gonna be covered in greater detail through the rest of today and in tomorrow. So give you just the high level introduction to our platform, to our vision, for how you should program that platform and the software that we provide to to enable you.

A

And to start I want you to and I'm going to turn my video off. So I don't eat my bandwidth, so there we go uh to start. I want to discuss what uh the what is the nvidia platform so uh frequently people will think of us as the gpu company, because we've certainly um been innovators in the gpu space from the beginning, and that is um has certainly been our bread and butter over over the lifetime of our company. But in the last several years we have expanded that to a broader platform.

A

First, through the acquisition of mellanox providing us with an infiniband network uh portion of the company. um That's happened a little over a year and a half ago, and then last year uh jensen, our ceo announced the the grace cpu that is coming uh in the future, and so uh at an uh unannounced time in the future we will have uh a cpu uh arm-based cpu as well.

A

So we need to provide you with um a coherent vision for how to program for all of this, not just the gpu, but also the cpu, and also the network, and this is what what we tell people and first I want to point out this foundation down the bottom of our accelerated libraries and I think of these libraries as the foundation on which we build the rest of this.

A

So we have a broad range of libraries available to you, beginning with the core libraries that expose basic data structures and algorithms math libraries. We've got a huge set of libraries for linear, algebra, random, random numbers ffts, and things like that.

A

We also provide you with communication, libraries uh mpi, of course, but also um our collective communication, library, nickel and our schmem implementation envishment, and then we also provide support for various data analytics and ai packages and then um most recently, we've added support for uh for acceleration of uh quantum simulation.

A

So this is the basis on which we build the rest of that strategy and for many people the assumption is that we at nvidia believe that all developers should be writing uh this code on the right, which is uh cuda c plus plus, and we also have support for uh cuda fortran as well.

A

Cuda is came out at a time when programming for gpus was very difficult and we needed to provide a uh the proper abstractions to be able to do general purpose. Compute on the gpus and kudu remains the place where we innovate. So if we introduce new features on our hardware, you can expect them to come to cuda c, plus, plus and cuda fortran. First. This is where we'll expose those features and in various innovations in software.

A

However, it is not the only approach to programming our platform and in fact uh our goal is that for most developers, you'll come to our platform with these approaches on the far left, which we call accelerated, standard languages and you'll hear here. Some other terms sometimes you'll hear us refer to stood par shortened for standard parallelism or I like the term standard language parallelism, because it encapsulates the fact that these are standard programming, languages, iso c, plus plus here iso fortran here and then also python, which doesn't have an iso standard behind it.

A

But it has become a de facto standard in its own. In each of these cases, we are using what is native to that programming language to implement a parallel algorithm in this case uh saxby.

A

So here I have a standard transform, I'm saying to write it uh to execute it in parallel and then I'm providing the implementation for that same thing with fortran with fortran.

A

Do concurrent has been in the language now since 2008 and we have supported it on our across cpus and gpus since uh 2020, and it is actually supported in other compilers as well to express that uh the iterations this loop can be run in any order, and so we can actually parallelize this for multi-core cpus and for gpus, and the more recent edition is coo numeric, which provides a similar interface uh for python and I'll go into more detail on that. In a few slides.

A

By focusing your efforts on these standard languages, it means that you can come to our platform or any other platform with a code that is already parallel, and will it be the uh the highest performing code that you can achieve on our gpus, probably not most likely. If you want to get absolute, the best possible performance exploiting everything in the hardware you'll still want to write it in cuda, c, plus or fortran, but your baseline, which you run, will already run out of the box on day one using these standard languages.

A

Now there is a functionality gap between between these approaches and we use directives to span that gap, so as a as a for instance. uh In these uh these approaches on the left there's no way to represent uh your data transfer, so we've built them on top of cuda unified memory. But if you wanted to take control and further optimize the code, you can incrementally optimize your code using something like open, acc or openmp, but we still believe the best place to start is with the standard languages.

A

So what do we provide you to uh to program to this vision uh and the product? We have is the hbc sdk. uh You will sometimes see the acronym nvhpc as well.

A

The hpc sdk is a single package that provides all of the tools and libraries and programming languages and compilers that you need in order to be a successful hpc programmer.

A

So all of the programming models I discussed standard c plus and fortran open, acc, openmp, cuda c, plus plus cuda fortran, are all available to you uh through the hbc sdk.

A

We provide compilers for c c, plus plus fortran, and then we also support the traditional cuda compiler nvcc, and then we package up a ton of libraries to make you successful here. I've listed cousin plus plus, which is a subset of the c plus standard library for gpus thrust, which is a higher level template library for uh for writing code, that's portable across cpus and gpus and cub, which is uh our um a collective's library, our communications library. Here um we also have um for it's building blocks for other for other algorithms.

A

We also have support for our math libraries, so some that you may be familiar with me: cooblas coo, solver and qfft, and this is not an exhaustive list, but these are some of the most important ones and, lastly, we do provide support for communication, libraries for mpi envy and nickel and then because um you need to be able to debug and understand the performance of this. We also provide you with debuggers and profilers.

A

So all of this is available inside the hpc sdk package and when you load the um the modules that that both helen and suzanne listed. This is what you get I'll point out, that this is not limited to machines like pearl motor and summit, that this is freely available for you to download for any machine.

A

You don't even need to have a gpu in that machine. It's available uh for free from our website, uh it's available by containers using nvidia, uh ngc, also spac, and even on the various cloud providers. We provide uh amis for that and we typically release uh somewhere around seven or eight times per year. Our last release was actually just last week with the 22.1.

A

Now some people are surprised to learn that our hvc compilers are not just gpu compilers. In fact, we want them to be best in class, cpu compilers as well, and so we do provide high degree of cpu optimization across all of the supported architectures, which is x86 power and arm server class cpus, including parallelization and vectorization.

A

But of course we want your code to run great on our gpus and as as new gpus come out, we add support quickly to make sure that we provide the best acceleration on our gpus, many of which can be done automatically using approaches like standard language parallelism.

A

I mentioned. We provide support for all of the major programming languages, directives, plus gluta c plus plus, and it's available on on three platforms: x86 arm and and open power.

A

So why are we advocating for this approach of standard language parallelism?

A

And we recognize that for many of you, you were not necessarily consider yourself: computer scientists you're, not considering yourself software engineers for most of you are in fact scientists first and you've learned to program in order to enable that science through simulation, and so many of you will- will have applications, you've developed on other platforms and it's running along uh in in that uh in that lane and you're making good progress. But you recognize the uh the performance benefit of being able to run on gpus and so standard language.

A

Parallelism provides you with an on-ramp to get to the gpus so that you can run code that is uh natively able to run on the on continue to run on cpus, but also on our gpus and then once you're on the gpus.

A

You can begin to look at other optimizations, such as directives, cuda c plus, plus and fortran, in order to better optimize your code and bring it farther and farther into the fast lane so standard language parallelism provides you with a means to get onto all parallel platforms uh very quickly, and this is not a decision that we made overnight to uh to strive towards standard language parallelism. But it's actually an investment we've made over the past decade.

A

So we've been participating in the various iso committees for for more than the past decade and actually have we participate, not just with uh with the national labs, but with also with our we collaborate with our competitors in these spaces of the iso committees.

A

So as we began to support standard language, parallelism in our software stack in 2020, but that that fruit was actually attended and and planted more than a decade earlier and has only been through continued engagement with the standards committees that we have we've developed this, and we believe that by focusing on bringing concurrency and parallelism to all of these languages that make standard language parallelism the tide that raises all of the boats. So this is available to you everywhere.

A

You honestly can't beat the portability of these iso languages because they have the strongest track rate of being supported in compilers on every platform.

A

We've contributed a variety of major features, and I don't need to necessarily call all of these out, but I will I will mention a few of them that parallel algorithms, which began support in c plus plus 17 and was enhanced in c plus plus 20, is one area that you'll actually be exposed to a little bit uh over over the next two days.

A

I'll also point out these multi-dimensional array abstractions solely because uh it is a collaboration that took place between nvidia and and the national labs, as well as the rest of the iso committee. And we continue to drive forward with new features.

A

So what does hvc programming look like in c, plus plus?

A

Well, the first thing I'll point out is that we use the nvc plus plus compiler to to accomplish these the acceleration, these parallel algorithms, uh and so that's the compiler that you will use whether it's directly or through uh through crazy wrappers, and it does provide support for, um of course, grain parallelism, but also vector concurrency.

A

Some of this comes about through enhancement to the programming model and the compilers, but also with our hardware, and so our most recent hardware enables us those several last two generations have enabled us to make forward progress, guarantees that enable support for the c plus plus execution model on our accelerators and also the c plus plus memory model uh c, plus plus 20 enhanced uh the synchronization libraries, and you can see that we actually have provided support for many of these.

A

uh These primitives in the lib c plus plus library, and we are continuing to drive forward with uh new features such as the sender's receivers proposal, mdspan and md array, uh range-based, parallel, algorithms and and so on. So this is an ongoing uh ongoing process to highlight some of the successes. We've had first we'll start with a mini app called lulesh.

A

This comes from lawrence livermore, national laboratory, it's a hydrodynamics mini app, and it's written in c, plus, plus with somewhere around 9 000 lines of code now mini apps are designed to be able to test out a variety of approaches and they've have approaches for mpi, openmp and other and other technologies.

A

So if I show you just a snippet of the code- and this is one uh representative uh function here- you can see that using their baseline openmp, you can see the use of if-defs here to support serial or parallel execution.

A

You can see the introduction of a parallel pragma here to spawn to cpu threads and to the omp4 here to work share this loop, and this is a fairly typical openmp code, with the uh the addition of these pragmas and if deaths. uh This is the code that they run in in production.

A

As a matter of fact, now we worked with them to restructure the code, uh to use solely standard c, plus plus, and so this function on the left gets transformed into this function on the right, um and you may have to stare at a little bit to believe me.

A

But these are actually are accomplishing the exact same thing and you can see the code is a lot more compact and and easier to read and maintain here we're using a standard, transform algorithm, we're telling the compiler that it can execute this code in parallel and then inside of this you can see we have uh first the transformation which is this top loop and then the reduction here, which is this bottom loop.

A

So you can see this makes the code a lot more, a lot easier to read, but because it is fully iso standard, it's portable to any compiler that supports uh isu, c, plus plus and here's a list of of several that we've tried it in and in addition to all of those benefits from this code. I'll also point out that it's faster too.

A

So here uh is the baseline code that I showed and again that was just one function out of the entire code uh and you can see building with gcc on a 64 core amd epic server, that is our baseline performance.

A

Changing from gcc to nvc, plus plus you see, the the runtime is fairly consistent between those two uh compilers on this platform. Now, switching from openmp to solely iso standard c, plus plus you can see gcc gave us about a 50 performance enhancement.

A

Openmp has uh has various uh inefficiencies and overheads uh in the programming model that uh that we were able to eliminate uh using uh using standard c plus plus, if you switch to uh nvc plus plus, you can see that improves to a 2x performance improvement again same uh same cpu and same version of the compiler, but what's really powerful with this, is you can change one compiler flag and now take that same standardized, oc plus plus code and you're running on an nvidia gpu as well?

A

So all of this can be accomplished without any additional apis or language extensions.

A

Another code, I'll highlight is called maya. This comes from, I believe it's rwth, akin university. We worked with them on a portion of their code that uses the lattice boltzmann method. This is the fluid flow portion and we accelerate it using just standard c plus plus, so the code on the left again becomes something like the code on the right, so you can see once again. The code is more compact and it's completely standard compliant uh in terms of performance.

A

We actually saw a pretty sizable performance improvement there as well- uh and this is uh this one I would call not typical in that. We actually achieved a fairly large uh performance improvement in uh in the standard c plus plus code because of various things that were uh eliminated from the uh during the rewrite from openmp to uh standard c plus plus. So this is a fairly large and atypical performance improvement. But what is typical here is that we can then take again that same code and run it on uh on the gpu as well.

A

So you can't take this baseline code and run it straight to the gpu, but by rewriting and making your baseline code parallel. First, you can take that code and run it on cpus or gpus very effectively.

A

The last c, plus plus code that I'll highlight is st lbm. This is another lattice boltzmann code, this time from the university of geneva, and one of the things that I love about this code is that this team went off and did it completely on their own.

A

They had a goal of being able to run on a variety of platforms uh using no external uh apis, and so they wrote their code using standard c plus plus uh the parallel algorithm was available in c plus by 17 and uh and these were the results so they're able to run across their entire uh 40 core um xeon server.

A

But rebuilding that code. uh Changing the compiler options to target the gpus, they could run all the gpus as well, and I would point you to um uh two uh two talks that I did this first one is uh gtc spring so last march, and this last one's gtc fall and I'm sorry I I should have updated this with the link to that, uh because he goes into a great amount of detail about um about what was involved in this transformation and actually shows the results on more than just the a100.

A

But on several uh several of their servers and gpus, so I encourage you to go look up these talks because it's really he goes in some great detail, but to quote him, they viewed this as a paradigm shift for them for cross-platform cp gpu programming that they could do this with uh solely iso c plus plus now. Fortran is still a very important language in a high performance computing and within uh within your labs. I know there are a lot of fortran codes as well so beginning with, uh with the 2020 versions of our compilers.

A

We began accelerating various parts of the fortran language automatically as well. So to begin with, we were able to begin to accelerate the array, math intrinsics inside the library, so looking at things like a matt mall and recognizing that uh this matimal can be mapped to our accelerated math libraries uh automatically again using the uh cuda unified memory.

A

We expanded that support six months later uh in the the november uh release to support the duke and current. So now you can write your loops using this. Fortran 2008 feature do concurrent and it can um thread parallelize it on your cpu or it could automatically offload it to your gpu as well.

A

uh We do intend to eventually support uh co-rays. I don't have a comment on when that will be coming, but it is something that we are actively looking at as well um and then a new feature that um actually has been approved for fortran, but is uh the the version of fortran that it will be in has not yet been released, is uh reductions on duke and current loops.

A

So, if you're not familiar with the reduction, if you do something like a summation or finding the min or the max within a loop, um you have many values that you need to reduce down to just one, whether it's the sum or the min or the max or whatever the the fortran specification did not have a way to do this.

A

You actually had to write a do concurrent loop and then use a one of these math intrinsics to accomplish the reduction, um but this next version of fortran has support for a reduced clause and we actually already have preview support for it in the compiler. So you could write a duke and current with a reduction um since uh envy fortran 21-11 uh recognize that for this um this event using 21.9, which doesn't have it yet, but um the very next version that that becomes available to you will have that support.

A

So this is what it looks like uh here. You can see a fairly simple routine from a laplace operator here here we are doing a stencil operation that would normally be written as a I loop and a j loop you can see here. We write a single do concurrent loop and say uh you know, iterate across all of I and all of j, and the reason you combine all this together is.

A

It gives the compiler a huge amount of information that not just can the I iterations be run in any order, and not just the j iterations, but across the entire iteration space they can be run in any order, and that gives us a ton of opportunity to accelerate, and you can see here once again you can run this on on your cpus.

A

You can run it on your gpus, and actually the performance here is is comparable to uh to open acc version of the code.

A

I mentioned the math intrinsic sound just demonstrate here. um I hope none of you have have a naive, matrix multiplication like this in your code. But if you do, one thing you can do is replace it with uh with a matte mole operation like this, and you can see not surprisingly, that we get a pretty substantial performance benefit because we're able to um map this to our accelerated math libraries and that's not limited to just simple things like matrix multiplication.

A

But actually you can see uh that we support a a pretty broad range of uh of these uh intrinsics.

A

So our goal with accelerating the standard languages um is that you could take this, uh take the same code, whether it's a c plus plus or in fortran or even in python, as I'll discuss it in just a moment and be able to run it across your cpus or your gpus. And so you can see here for the compiled languages. I I simply had to change one compiler flag in order to uh to retarget the code and then using the python code.

A

You can see here that I can once again scale it as well across cpus or gpus.

A

So let me talk about python next.

A

Python has increasingly become an important language in high performance computing and this something I'm sorry. Somebody needs to go on mute. I'm hearing some typing noise, the the packages uh in the um the pi data ecosystem that are used most often is this. uh Is this numpy package? This is um the the common approach to writing uh numerical algorithms in uh in python, and it dates back to right around the year 2000- and you can see this code down here um that performs uh you know, an a plus a transpose operation.

A

So that was easy during time period where you had single core cpus, but that needed to be expanded, outward of course, as as gpus our cpus went multi-core, so you can see, uh we went from you know, creating a a single small matrix to then uh needing to distribute the that matrix across the multiple cores. These here we're using something called desk. To do that.

A

uh Das was eventually extended to be able to extend, not just across you know, cpu threads, but even across uh you know, clusters of of several nodes, as you can see here, but eventually um we needed to be to begin to provide gpu acceleration. We really want to return back to the simplicity of the first approach and our answer to that is a package called coo numeric. This was announced at gtc fall and it is, it aims to become a drop-in replacement for uh for numpy.

A

So you can see here I'm not thinking about um all of the gpus on my system and I'm not thinking about all of the nodes on my system. I'm thinking about uh the the size of the work I need to do so. Here's a 160, 000 elements and the ku numeric package is able to distribute this. This data structure across your gpus or even across many uh nodes automatically, and so now I'm able to simply write the same, a plus a transpose and it will distribute both the data and the work.

A

So people come to python because it's a high productive language- and this is a fairly typical code that you would write but they've come to expect performance as well and we'd like to be able to deliver both.

A

So this code is productive because it's simply sequential semantics, there's no obvious need for synchronization, there's, no obvious distribution of work. No partitioning of data- and you can actually compose this well with the rest of your libraries or our program, but it'll- be nice to be able to get high performance as well and so being able to transparently run this anywhere. You need and leverage all the available hardware, whether it's a single gpu or many gpus, or even a whole system, so uh we've been building a system.

A

Architecture called leagate, uh which is this layer here and league, provides a way to handle distributing of data structures and and work uh transparently and we'd like to extend that to the broader python ecosystem, but to start we're focusing on numpy using the ku numeric uh package, and so you can see here um the ku numeric is. uh We were able to take a um an existing numpy application, uh we're um plug in numeric in place of uh of numpy for our data structures and actually run this code across a entire machine full of gpus.

A

Now, if all I want to do is run on one gpu, there's already a package called coupe that could have handled that what's exciting about coonomeric is not just that you can run a run gpu. But here I'm rescaling it out to I'm showing here 1024 gpus, so that's pretty exciting because we're able to do it with us essentially no code changes.

A

Here's another code that we're uh that we've demonstrated this came out of the uh scikit image library. uh What's notable here is that um this function didn't change at all. uh What we did was we uh replaced the numpy arrays with uh kuhn america raised and we've run this, and once again this is actually weekly scaling. You can see that the throughput increases as we increase the number of gpus.

A

So that's python, and we would really like to see more people developing this. I will say that coo numeric is not um numpy complete at this point, it is, is still considered alpha software, but it is really critical to our standard language perils of strategy, and we would love to see you trying it so now. Let me shift uh focus towards nvidia's performance libraries, so the first thing is I want to to give you is kind of the goals of our libraries first is we want them to be seamless to use?

A

So, as we add new features to our hardware, you don't have to worry about how to um how to make use of those features that we can build them into libraries, and you reap the benefits.

A

We want you to be able to scale up so you're, not limited to a single gpu but multi gpus or even multiple nodes in some of our libraries and lastly, make them composable among each other and with your programs.

A

So we do have a large range of libraries and I don't need to go through each of these.

A

Most of the names are fairly self-explanatory of what they provide, and I will point out that there's a lot of excitement about uh the tensor cores that we provide in our uh gpus, beginning with the the v100 and expanding support in in the a100s as well, and uh the point of this slide is to demonstrate that for many of the operations in the library uh you don't need to enable the tensorcore support that they will be used for you under the hood, and you uh can reap the benefits of that automatically.

A

So, for instance, uh the kublaz library uh kublaz provides a full implementation of the blahs plus a variety of extensions such as uh mixed or lower precision and batched apis, uh and these are actually used in in a lot of uh of applications.

A

And- and what I would like to show here is that um you know, as you um utilize these um these libraries that we're able to take advantage of the the hardware features for you. And so here you can see on gv100 we're taking advantage of the uh the 16-bit.

A

Floating point performance using our tensor cores- and here you can see uh you know automatically out of the box you're able to take advantage of this of that feature in a100 as well or even if you have uh need for uh for different floating point types such as the tf-302 that you can utilize that as well and get very, uh very good performance.

A

Another library of interest would be uh coo. Solver coo solver provides a variety of linear, solvers, uh l, u chileski and qr as well. As uh you know, symmetric and generalized eigen solvers, and we do provide support for iterative refinement solvers which allow you to utilize reduced precision to get full precision results.

A

So you can actually take advantage under the hood of things like the the 16-bit uh tensor core units to get very high speed solvers, but expect the full 64-bit um uh precision on your results and we provide support for multiple uh gpus as well, so uh I'll point out the automatic use of the uh the mma instructions for for using the tensor cores um under the hood. So you don't need to opt into that and you can see.

A

Actually, you get a pretty uh significant performance, speed up going from v100 a 100 here, we're showing 2.3 x and actually in in more recent versions uh higher than that, and you can ignore this part at the bottom that just did not get stripped out.

A

For coo sparse, if you're doing sparse, linear algebra, you can see, we have. uh You know a range of capabilities uh available to you and again we can get um a very high performance. So this is uh speed up uh using our new um generalized version of the solvers versus uh versus the uh our previous solvers.

A

So we, um you can see more details about that in um in the release notes, um qfft, we do provide support for 1d, 2d and 3d ffts, uh including support for multiple gpus, and you can see here across a variety of problem sizes. You can see support for one two, four and eight gpus uh here and then more recently, uh support for uh generalized uh tensor, contractions and reductions using the uh kutenser library.

A

Now I want to highlight that we've uh begun supporting uh multi-node within several of our math libraries, and so one I'll point out here is ku solver mp uh and here we're able to to scale not just across the gpus in a single node but across uh multiple nodes up to full system size, and so uh this uh began support in uh 21.11 um and so and it's available uh since then so once again, you're running with 21.9.

A

It does not have this this library in it, but the next version, as it's updated, will have support for that.

A

We've also released early access support to a multi-node version of qfft, and you can see here that we're getting very good performance there as well.

A

For sake of time, I'll keep skipping ahead. um I want to point out that we have a rich set of what I would call core. Compute libraries for c plus plus um libku, uh the lib code c, plus plus or q c, plus plus, is um a standard template library that you can uh utilize uh on.

A

The gpus uh thrust is a parallel algorithms library- and this was uh very uh this- the thrust project very heavily influenced what eventually went into the uh the c plus plus standard as well, and then uh the cooperative primitives library, lib cub.

A

So thrust has you know very high level classes like the host and device vectors, high-level algorithms like transform, fill and copy, and then various iterators that you can use and actually in some of our um early c, plus plus examples we're using these iterators uh in there as well. And then, if you need within your kernels to do various collective communication patterns, you can use cub in order to expose that things like warp warp, wide and device-wide.

A

Primitives and here I'll point out, libkusy, plus, plus, and so qc plus plus, um is in addition to your uh standard uh library. This comes with your host compiler, so you would include you know any of the normal. uh You know. Impound include vector, for instance, to include atomic to get the normal host side, uh standard, template library and then ku c, plus plus provides two interfaces, one that is strictly standards compliant, and it is a subset of uh the standard library here which has the namespace cuda standard.

A

And then, um if you we do it provide some extensions as well um that are under the cuda name. Space uh one such extension is, uh is the atomic and there's details. I can point you to a presentation that details that um as well.

A

So, jumping ahead to the uh communication libraries, I want to point out that you know just like with the community with the math libraries. We hope to provide you with um a the right set of communication. Libraries that are optimized for the entire system also provide low, latency pgas programming. This would be the partition global address space, so things like nvsmem and then optimize collectives on your system.

A

We have several libraries to provide that are provided within the hbc sdk hbcx uh provides you with an uh version of uh openmpi.

A

uh We also have support for uh openschmem uh in that as well and uh and ucx and sharp, which are technologies that came from uh from melanox I'll point out that envy schmem is a technology that we support, that for um uh partition. Global address space uh messaging uh using the cpu or the gpu, so in typical mpi you'd have something like issue, and I send uh and and then wait for the results to complete, and you can see that the the data has to move.

A

uh You know through the cpu uh out to the network and back to the gpu with nvsma. The gpu can actually uh initialize all of this. If you want so you can. Actually uh you put messages through the network onto other gpus or even get message, uh get data uh off of other gpus, and so uh this has um also uh interoperates well with uh with cuda stream. So this is a program model that I encourage you to uh to take a look at and there's a variety of trainings available online. For that.

A

um Approaching the end here, I'll point out our developer tools, I know max- has some talks about this coming up and so I'll point out that we uh we provide insider our our download support for the cuda gdb, which is uh works just like gdb, but as understands our compilers as well, and also more recently, we've um we've begun shipping, an extension to uh to visual studio um that and excuse me visual studio code and then also there's the insight visual studio edition as well.

A

uh We also have profilers insight systems as a profiler for getting high level information of how your code is running. um You know, is your compute and your data movement. Are they overlapped and things like that when you want to drill down into individual kernels, we provide insight, compute and there's a and you can see here um a screenshot showing the uh the roofline analysis of a particular code, and then support for um nvtx is something that you would learn about later today.

A

I believe uh the nvidia tools, extensions that allow you to annotate your code, and so you can look at it and say: okay. This is time spent in my solver, or this is time spent. uh You know at other points of the code.

A

um Compute sanitizer is a way to just check, for um you know if something crashes, to try to understand uh why things like, if, uh if we access an array out of bounds on the gpu or if we're using a shared memory in an unsafe way, and so computer sanitizer, if you're familiar with cuda memcheck compute sanitizer, is the uh the next generation kudum check and, lastly, we provide integrations for various ids.

A

uh We do ship insight, eclipse edition and visual and insight visual studio edition, but what you may not be aware of is that we also provide support for a visual studio code and that's fairly. Recent.

A

So insight systems is our system level profiler. That gives you again a high level information about how your program is doing insight. Compute is our lower level really understanding. You know what uh what is limiting your code? Is it limited by compute, or is it limited by memory and you can even dig down as far as you want and look at uh your our underlying assembly? If you really want to dig in and understand, uh understand the performance of your code, uh insight, visual studio code edition here, as I mentioned, is very new.

A

You can find out more information on how to get it here or search for the video studio mark, visual studio code marketplace and you can see uh it provides ways to and you inspect the uh inspect your variables. Look, your juror your registers and really understand and debug your code directly within your ide, including support uh via ssh. Now I will point out from my experience that visual studio code, the remote ssh, does not work on summit, but we'll work on x86 platforms.

A

Here's a little bit of an example about cuda gdp and compute sanitizer as well so scanning your code. What I'll point out here is memcheck is an operation that allows you to look for unsafe memory. Accesses look for memory, leaks, look for out of bounds, errors. Things like that race check is a tool for, if you're, using shared memory understanding did you have unsafe memory access patterns to that shared memory.

A

uh Knit check is one for uh checking for reading from uninitialized memory and then sync check, watches for uh thread synchronization issues. So, um as you can see, computer sanitizer is um quite a bit more advanced than what was available in uh in cuda memcheck.

A

So I did want to leave just a tiny bit of time for questions. I was not very successful, but I'll put this back up here and point out the hpc sdk, as as your means of getting access to everything that I've shown, and I guess there's uh you know two or three minutes left where I can answer some questions.

B

Are there any questions um you could I mean, speak up or type question in slack channel. Please.

C

Hey sure I have a question, you showed some results with lulash and openmp, but it looked like the openmp code was the basically the older, like cpu-based directives,.

D

C

I've noticed that nvidia or they're, maybe not nvidia, but there are some specific directives used for gpu um acceleration like target teams, um for instance, and I wonder if if, um if you guys had used those if it would have perhaps made a difference.

A

So this um I I don't know whether there is a version of lou lesch that supports the the target offload directives there may be, given that it came from lawrence livermore there very likely is, uh but this is the the baseline code that was provided to us, which is the the cpu threaded one. um You know it would be possible to. You know, write a you know, target teams, uh you know, distribute here and here and and offload these.

A

In that way, I would not expect the performance to be as good because of various uh overheads um associated with uh with doing that, but functionally. uh It would certainly be possible to write such a code. We believe that a better approach is to use the uh the standard c plus, rather than relying on openmp as an additional api.

D

So hi jeff this is zanzi. um I have a question about your uh cool solver mp implementation and also so you have a co-ff tmp. So those are I mean the the cross. Note uh communication. What kind of library did you guys use.

A

I believe it's based on lib fabric. um I would have to uh confirm that I do know that it has been uh tested on on both promoter and summit, so we do know uh that it does work on those. I think it uses lib fabric which um each of the vendors are able to implement their own layer underneath.

D

So is mpi used or you skip that yeah nice, no.

A

It actually there's a lower level than mpi.

D

Also, I have another question like you have uh shown many performance results from like your tool. I mean you showed many better performance than like openmp or like some offload result. So when you do this standard language, parallelism implementation, did you still use a cuda or something else, I'm just curious. How your performance is that much better than other other things.

A

So, uh in both of these cases, no cuda c, plus plus, is used if you were to rewrite it in cuda c, plus, plus there are some uh optimizations that you can accomplish in cuda, c, plus, plus or cuda fortran. That can't be done um in standard language parallelism. So I would actually expect that it would be possible to tune this even further. um But of course the trade-off there is uh is in in portability.

A

If your goal is to write something that you can right away, bring to new platforms and expect it to run in parallel right out of the box uh standard language, parallelism provides that and actually a better slide to show. You would be this one.

A

Each of these codes can run out of the box on multi-core cpus or on gpus, but if I want to get if my goal is absolute best possible performance, I would write my code with this on the right now. The thing I should emphasize here I should have emphasized in the beginning- is that you don't have to choose one of these and you're not wed to one approach that all of these approaches composed with each other. So I can start with uh you know: fortran do good current.

A

I can um selectively optimize with um with directives if necessary, or if I have a portion of my code that is so performance critical that it absolutely needs to be written in in cuda. I can do that and all of those work together. So this is not a pick one and stick with it, but this is uh the your um your baseline starting point, and this on the right is your absolute best performance.

D

B

Jeff, are you able to see the slide questions, or should I read them to you.

A

If you could, please read them yeah we have two.

B

More minutes, um if the time when times runs out, I will just uh ask you to answer them in slack. So one question from philip thomas is: what is the state of the cuda graphics feature, particularly with respect to in standard programming for c plus plus 4chan? Is this feature on active development.

A

So kudo graphs is absolutely continues to be in active development um right now. We don't. um We don't have any defined interactions between uh standard language, parallelism and kudo graphs.

A

That's something that we could explore in the future, but right now they they are. uh You know two separate approaches. Those of you who aren't familiar with cuda graphs, uh the the basic introduction, would be if you uh have a series of uh data transfers and uh gpu kernels, that you call repeatedly within your application, rather than uh you inside of your time, step loop, you're, going through and issuing your memory request issuing the kernel after kernel offer kernel, which has various launch overheads.

A

You can capture all of those into a graph that you basically pre-compile and uh and just relaunch over and over again on the gpu and and um that takes away various overheads. So uh we don't currently have an interface to kuder graphs within uh the standard languages. That is something that that does require you to use this, the specialized cuda c plus plus to use it.

B

Thanks jeff, um here's another question in zoom: chat: I'll, read it to you: how does it's from ko hearn and how does performance compare against cuda versus standard c, plus, plus and fortune codes, that is in the past, writing cuda code was often the suggestion to achieve the best performance possible. Is this still the case.

A

So if your goal is the best performance on the gpu that you have uh cuda, c, plus, plus or cuda fortran is is the way to is the way to accomplish that. There are low level hardware features that you know we we can't expose in in standard c, plus plus or take you know many years to get exposed in the standards plus. So uh if you really want to tune for the best performance on the gpu, you have, you would write it in cuda, c, plus, plus or cuda fortran. Of course, the trade-off.

A

There is the more you tune for that gpu, the more specialized your code is the less the less portable. It is.