National Energy Research Scientific Computing Center (NERSC) NVidia HPC SDK Training, Jan 2022, 12 Jan 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 11. Concluding Remarks, Interoperability -- Brent Leback

Description

Part of the Nvidia HPC SDK Training, Jan 12-13, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/nvidia-hpcsdk-training-jan2022/

A

I've been using.

A

Lots of applications are built using a combination of various models.

A

uh I think, uh in fact, that the last hackathon berkeley gw had open, acc and openmp in the same uh program so and you could decide between the two at runtime a lot of people use cuda when they really need it, and maybe that's like one percent of the code, but it could be. You know a large percent of the overall uh execution time.

A

So you know those of us that have been around for a while remember when we used to write x86 assembly or assembly for even older processors than that for that tiny little bit of code that needs to run as fast as possible, so feel free to mix and match it's a goal of ours at nvidia that these all models are all interoperable to the extent that we can make them interoperable.

A

So what do we mean by that? uh First different programming models can appear in the same source file, so sometimes that's easier than others. uh uh In fortran we have control over all of the compiler, both the the cpu compiler, the openmp compiler, the openacc compiler and the cuda fortran compiler. So uh interoperability within a file is actually pretty easy for us with nvcc it's a little more difficult.

A

Really only nvcc can compile uh gpu code as max shown the kind of global functions that are the entry points into a kernel. There are also device functions which are commonly used. uh Macs didn't go into that, but global kernels can call device functions.

A

um Another uh uh definition: objects from different programming models can be linked together into the same program, and uh one programming model can use data declared or defined or initialized in a different programming model. So we've shown some examples of that.

A

You know using open, acc or openmp to manage the data and then how to get the device pointers to call into cuda libraries with that a little harder part of interoperability means. Can I generate a kernel using acc or openmp and call device functions which are written in cuda and our support, for that is pretty good and there are a few places where it's lacking.

A

When I talked about um standard par on uh yesterday, I noted that currently inside do concurrent, we have problems, calling functions that aren't pure and then all of our cuda functions to do you know cuda low-level things that you might want to do are not marked as pure yet and that's a problem we're trying to solve for interoperability.

A

And programming models can share attributes of a device such as the current device, current context and streams. We also showed some examples of that how we can share the stream between openmp and openacc and cuda libraries.

A

So that's important and max didn't show this yesterday or excuse me just in the previous talk, but in the chevron configuration when you launch kernels, an optional argument is the stream number so uh so that is useful and in fact, uh given people that are trying to get speed of light out of their kernels.

A

You know: controlling the streams, is critical.

A

um Just so, you could prove that you can uh use all these uh models. You know in the same program as a as a kind of a toy or a joke. I wrote you know a five-line fortran program and compiled it different ways. Turning on the flags dash, md, acc and dash cuda, and you can see that we can create a a program that uh you know uh accepts openmp, open, acc and cuda fortran in any combination and the sentinels in the upper left.

A

I don't think we really talked about it, but it's a nice way, at least in fortran, to do like macro pre-processing, so that you recognize some statements when you're compiled a certain way and don't recognize them in the other and it's equivalent to the light gray box down below using ifs.

A

So you can also use these if diffs in the same program and uh it's defined in openmp and openacc to set these macros and I believe, nvcc and cuda fortran define the under barcuda natural macro.

A

So, as I mentioned before, if you have a cuda, you should probably use mvcc to compile that uh use. Our nvc, which is rc, comply, compiler or nvc, plus, plus to compile openmp or openacc uh calling cuda libraries with host site interfaces does not require nvcc, so the the libraries like qft and kuran that we showed you know you're you're running that on the device. But you do not need mbcc.

A

We provide the dash cuda and cuda lib options uh just provides for easier, compiling and linking one thing to be aware of when you mix cuda compiled with nbcc and our compilers, nvc or nvc, plus plus there's this notion of a relocatable device code called rdc.

A

Our compilers nbc and nbc plus plus turn that on by default, because we figure for hpc applications, uh people are usually calling. You know, lots of functions from different files in those types of applications. Nvcc the cuda compiler, is doesn't feel like making that assumption and for performance no rdc actually can perform slightly better than rdc.

A

A few more optimizations can occur in the code generation during the various phases of assembly. So be aware of that. There are options on all compilers to turn on and off relocatable device code and uh the people at nurse who are attending. Some of our calls know that you know our our some of our slides have c, plus plus standard par interoperability with pragma based data directives.

A

uh That's a hard problem and and not available yet so c, plus standard par still requires managed memory.

A

uh It's it's hard, because a lot of the c plus plus parallel algorithms are just kind of handled in header files with meta programming and the compiler doesn't really have a chance to really interject itself to help out with you know, looking up into the present table, which I've mentioned a few times to know, you know how to get the device address for a corresponding host array. So again, there's there's more work for us to be doing here as well.

A

In fortran, uh uh similar to c plus plus, I mentioned our fortran compiler envy fortran compiles all models uh across languages. Fortran calling c is pretty well defined, as is cuda fortran calling cuda c. We've got lots of examples of that in our packages and we've had blog posts over the years. How to do that?

A

Our nvidia hpc sdk contains fortran modules for interfacing to the cuda libraries, and that list is growing constantly, so we're up to like fortran modules for eight or ten libraries now cubelaws coup fft coup, solver, cousparse, kutenser, nvtx, envy, shmem, nickel.

A

Maybe I've left a few out there, so uh if you're a fortran programmer and you're making use of some of those cuda libraries, I really recommend you use the cuda live option, because some of those interfaces require an extra wrapper library.

A

So sometimes we can't just call directly in the library a lot of times it's for fortran when you have to return or pass a c character string.

A

You know you have to kind of fix that up between the two languages, so there's a little bit of extra wrapper required and we take care of that with the cuda lib option.

A

um Because openmp defines a host fallback mode, some cases which work with openacc plus cuda are not quite right yet with openmp plus cuda, but we're working on it and I think we can solve that problem. We just have to get to all of the cases and, as I mentioned yesterday uh for interoperability, we still need to figure out if and how we allow non-fortran standard features in do concurrent for things like calling cuda functions.

A

If we want to or expressing a a launch configuration you know, I will say you know we don't add the capability to change the launch configuration to open, acc or openmp, because it's just fun. We do it because people have required it. Real applications need control over that. So to think that you can, you know port a real application to do concurrent without it is just to say well, I'm willing to give up some performance.

A

I don't know that people are willing to say.

A

And that's all I have thank you. I think we've given you a good overview of the tools in the hpc sdk from top to bottom.

A

I hope you've enjoyed it so we've provided some labs they're, very simple: I've written several versions of the laplace, which was some of the examples I gave with the openmp with or without re. Well, they have a reduction in one loop, not a reduction in another loop.

A

I have host code versions of them. You can compile just on the host and if you want to start with that, you can insert your directives yourself. If you want to try different things and then we've provided a handful of different uh solutions to that, I did them all kind of quickly. Yesterday, some of them might be a little buggy or off, but they're basically correct. You should be able to get the just of what we're trying to show.

A