National Energy Research Scientific Computing Center (NERSC) NVidia HPC SDK Training, Jan 2022, 12 Jan 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 4. Nvidia Standard Language Parallelism, Fortran -- Brent Leback

Description

Part of the Nvidia HPC SDK Training, Jan 12-13, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/nvidia-hpcsdk-training-jan2022/

A

Let's get started.

A

So I've uh added this slide again, um so I will present standard languages uh from a fortran point of view and matt stack will present uh standard languages from c plus plus and we're gonna try to divide it up into 15 minutes each. I think you know helen. We can take questions in between or at the end it doesn't really matter.

B

uh Maybe do it afterwards and time we could also a different question to this lecture.

A

So some of this will be review. uh Jeff went over some of these things, um so for this training we decided to start from the high level and and which is the leftmost column here and work to the right. uh Sometimes I think of things in the other order. I start from a cuda point of view and work.

A

My way left, I think part of that is that's my job, is to part of my job is to find things in cuda new features and things that we can expose at a higher and higher level, uh but either way you'll get exposed in the next two days. To do all three of these columns and and the library is underneath as well.

A

So uh for this section we're just going to talk about uh standard languages in fortran.

A

uh New concurrent is standard fortran, so this uh construct was introduced in fortran 2008.

A

So the there is a little bit of controversy about this.

A

The standard does not quite specify do concurrent in the same way that we are using it. So we assume the programmer guarantees. There are no dependencies between iterations so that we can run it on parallel on either a gpu or a cpu.

A

Everyone is aware what we're doing. We have people on the fortran standards committee and it's mainly just like a language thing in the in the language spec, but it's intended. The intent of do concurrent is to say this is a parallel loop. The iterations of this loop can run in any order and they can run on any type of parallel hardware.

A

So the do concurrent is a lot like a just. A standard fortran do loop. It's got a concurrent header and jeff showed that you know it's. You know j equals one to m and I equals one colon and m, or something like that.

A

uh So it's a little bit of a different format than uh the old fortran do loop and they have some uh locality specific specifications, so you can declare uh variables used within the body. Is local local, initialized, shared or default?

A

We'll talk a little bit about some of those and the just like c, plus, plus the option that enables gpu offload in our compiler is dash standard par.

A

This is one of the features, as jeff mentioned, that dash standard par tries to make as much of the data in your program as possible to be managed. Data, cuda managed data and what cuda managed data is is uh data that the driver and os is responsible for paging back and forth between the gpu and cpu, similar to how virtual memory works on a cpu.

A

In this way the programmer does not have to worry about. You know discrete memories for cpu and gpu, and it just makes the life of the programmer much simpler.

A

I guess I need to speed up here a little bit, so just some examples uh do concur in many weather, so many weather is a is a application.

A

Out of, uh I think it's actually out of oak ridge from uh matt norman name, so uh we have ported that code to do concurrent, and so the upper left here shows do concurrent with some local arrays, d3 valve stencil and also some local variables. So within a duke and current, you can have other do loops.

A

uh You can do just normal uh fortran operations and if you compile with them info you'll see that we parallelize the loops across uh threads and blocks the do concurrent loops and then the other, some of the other loops are run sequentially within that which is turns out, is a you know, a pretty good schedule for this.

A

uh Starting in 21.11, I believe jeff mentioned this as well. We support the reduce clause in duke and current on the top line on the left.

A

You can see duke and current uh with reduced, and so the reduce is, you know, some reduction on the variables mass and te, and uh so this gives you a lot of uh you know, capabilities our compiler sometimes would find reductions automatically, but it's good to have a specification that actually supports that and it doesn't hurt to add them, even if our compiler can find them automatically, because other compilers may not be able to so. This is more.

A

You know, conformant code to the spec now and our m info messages say that we will generate a reduction.

A

So we have some current limitations. uh This is, you know fairly new work newer, either than newer than our openmp compiler. So you know some of the gpu programming models we've been working on, we've been working on for over 10 years. This is fairly new. It's been around about a year.

A

The fortrans uh spec requires functions and subroutine calls and do concurrent to be pure.

A

So if you call certain things, you may get messages that oh you're calling a subroutine that is not pure, and we may at some point change that in our compiler, but today that's also another place where we are a little fuzzy according to the spec. Is we follow the open, acc and openmp defaults for scalars and arrays within the body of the new concurrent, so scalars are first private or local by default and arrays are shared by default.

A

If you read the spec, the fortran spec on duke and current, I don't know that it says specifically that that's how you handle scalars and arrays, it's very hard to really understand what they are trying to say.

A

So, in fact, standard par currently enables open acc and it's built on top of open acc. That is maybe subject to change at some point, but we we take advantage of a lot of the same runtime in our purdue, concurrent as for open acc and for openmp, as well as we'll talk about tomorrow.

A

uh Duke and current lacks control over gpu scheduling. You know we found that useful over the years, so things like forcing a loop sequential inside of a region, offloading a serial kernel and there's no control equivalent to an acc's gang worker vector which we'll talk about tomorrow and interoperability with cuda okay.

A

So we like all of our models to interoperate. You know we'd like duke and current, to work uh with openacc and openmp, and we would like to be able to uh interoperate with cuda as well cuda fortran in this case, so uh we still need to mark some of our. You know standard device functions as pure, so there are certain things that are just kind of part of cuda that you can't call in a do concurrent. We do support atomics, because that was important. So we made that change, but there are some other functions.

A

You know low-level cuda functions that would be nice, but we're not there. Yet we don't have any control over the stream which the offload region runs on and we are not yet on our interoperable with cuda fortran device attributed data. We would like to be able to declare cuda fortran device data and use that in a new now these are all extensions, so it would be non-portable, but it would make the programming model much more powerful if you needed it.

A

This is a duplicate slide. I'm not going to get into this too much. Just I added a title of a paper that came out at the end of last year from a person named ron, kaplan uh who's been using our compiler for many years and he wrote a really nice paper called and fortran's do concurrent replace directives for accelerated computing.

A

If you just kind of google that you'll find his paper and and he's ported a an entire application to using do concur, and one thing he found and said in the paper is, I would really like to have new concurrent reductions. So so we've addressed the the main uh issue that he ran into.

A

I'm not going to get into this slide too much. This was presented at the last gtc. There are people working with duke and current. uh Some kernels out of nw cam have been moved to duke and current, and they found the performance was, you know, basically on par with openmp or open acc on the gpu and a group from games uh used new concurrent to uh port a portion of of the games code. It was a pretty simple port.

A

It was a part of the code where it had just a single large, one-dimensional do loop and they use do concurrent and it worked. It was great. They were happy.

A

uh A little bit about other libraries and uh ways that you can use uh standard, fortran jeff mentioned this mammal. So uh one thing that I do as part of my job is to create fortran interfaces to uh cuda libraries, and while I was writing the interfaces for tensor, it occurred to me that two tensor solves a lot of uh or solves a lot of, the fortran array, intrinsic uh problems for map, mole, reshape and spread, and so we just added some capability.

A

If you use the kutenser ex module to recognize cases that cou tensor can run in a single kernel like this mapmole and just offload that- and you will never be able to. You know- write handwritten code that performs as well as the coutenser matrix multiply uh in the library. So so, if you can take advantage of that, this is standard. Fortran you'll get really good. Speedups.

A

uh You've seen this slide before the blog on the bottom is is a article I wrote on bringing tensorflow standard fortran. These are just representative of the types of operations that we can recognize and make contents for calls under the hood.

A

One quickly here, one uh project that we've done sort of in collaboration with nurse is a library called nvla math and using some of the same techniques that we've used in other areas.

A

What we wanted to do was to write our own wrappers around some of the crew solver functionality, so uh nurse identified, 30 or 40 important la pac calls for them. Of course, d get rf is usually the most important it does lu factorization.

A

So the code on the far left is what you would do on a cpu and how you would link you know the la pac in the blas library you get about 500 gigaflops on a cpu using our open blocks.

A

If you call coo solver directly, you know you have to go through a little bit of a a set of steps. You get the handle. You figure out what uh workspace sizes are needed by the coo solver you allocate that workspace. You call a you know: a coup, solver version of d get rf and you deallocate the work.

A

uh If you compile that you get about 3.3, teraflops on a v100 and uh then gpu with nvla math. So if you compile uh with the option cuda lib equals nvla math, uh we will pull in secretly kind of a a module that redefines the interfaces to d get rf and do the wrapper work for you. So you don't need to make any of these changes in the center to your legacy. Fortran applications, and we will uh you know the time is basically the same. There's we're not saying that this is faster.

A

It does basically the same work, but just hides that for you, so some possible future work. uh You know we'll probably look at adding some non-standard or nvidia specific capabilities to duke some of the things I mentioned, uh we'd like to do some more f90 intrinsic function. Support similar to what we have for matt, mall, reshape, spread uh pack and merge would be very nice um we'll you know once uh perlmutter comes up and we get some feedback from nurse users. We may add some more supported routines to nvla math.

A

uh We have these new multi gpu libraries. Maybe they can be wrapped. You know under scaly pack interfaces similar to what we did with lap and, of course, there's always new hardware and software features that come along.