National Energy Research Scientific Computing Center (NERSC) NVidia HPC SDK Training, Jan 2022, 12 Jan 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 3. Introduction to GPU Architecture and GPU Computing -- Brent Leback

Description

Part of the Nvidia HPC SDK Training, Jan 12-13, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/nvidia-hpcsdk-training-jan2022/

A

Good now my name's brent lebeck, I've uh been a member of actually pgi before we were bought by nvidia. I've held a few different positions and I just anymore consider myself just a part of the nvidia hpc sdk.

A

So I'm going to kind of take a step back.

A

You've heard of you know a good introduction from jeff and we're going to talk a little bit about the considerations you need to take when you're in the met in the process of porting your code from a cpu to a gpu, and I'm assuming that there are some people here who will just get started with this process on perlmutter and so just to kind of think of the way to think about gpu computing, which applies to all of the programming models and languages that we've talked about.

A

So I just showed this slide. uh This is the hpc sdk from a high level view I'm not going to go over this too much, um but it has the various compilers and programming models. But uh for this section right now we're going to talk a little bit about porting considerations.

A

So I've, given these talks for 12 years or so uh so from about five years ago, when uh the coral systems were first introduced. We used to show this slide, and this on the left is the uh uh cpu, whether it's an x86 power or arm, and it's got you know uh a handful of of cores- that's probably more cores now than it was five years ago, but maybe they've doubled to 40 or 64, or something like that.

B

Sorry to interrupt brent um people reporting sound, become chubby, so I think.

A

B

A

Let's see if this is better.

A

The cpu is connected to usually large, high-capacity memory anymore. This is you know, half a terabyte or a terabyte of memory on a server class system uh relatively uh to the gpu. The memory bandwidth between the cpu cores and its memory is fairly low, and you know I've been around this uh industry for quite a while, and in fact I wrote a paper in 2007 or eight with some people at sandia, when multicore was first coming out about how the uh the memory bandwidth was not keeping up uh with the number of cpu cores.

A

So on the right we have a typical gpu accelerator, so it's got many more cores. uh You know every generation seems to like double the number of gpu cores on the accelerator.

A

It's got a relatively small shared cache and a smaller main memory, gpu device memory, but the bandwidth to that memory is much much higher.

A

So uh with more cores, the chords are simpler and, as we'll talk about some of the considerations, the key is not to just fill up the number of cores on the accelerator but to way over subscribe them. So you can in fact hide some of the memory.

A

uh Latency and take advantage of the high memory bandwidth now between the two. There is a interconnect either pci or nvlink on the coral based machines, and that is yet another consideration that we'll talk about over the course of the next two days of how to minimize the amount of traffic across that connection between the cpu and g.

A

So processor counts through the years now so notice in the upper uh left-hand corner. There's a little circle that blue circle. So uh you know when some of us started uh in hpc, uh most of the parallelism was via mpa. You wrote pretty much a sequential program uh at points you inserted, mpi sends and mpi receives into your code.

A

The core may have been like a 486 or uh maybe an sgi uh processor, or maybe even a sun, ultra spark processor. uh So there were, you know a few uh registers. It had some instruction level parallelism.

A

uh The clock seemed to get faster and faster with every generation and as the cpus became more and more complicated, uh our ability to write software really improved and we added started adding a lot of new features and ideas to the programming languages.

A

So features like dynamic memory, allocation, larger heaps, larger stacks that just the amount of stuff in your node was really increasing at a very high level. High rate, I should say so, then you know cmd hardware came along like sse registers, avx registers. So how did we uh adjust to these changes?

A

Well, we still used high-level parallelism via mpi seemed to be a great model and it survived over many many years uh we got most of our performance improvements still via the manufacturing process. Improvements uh instruction level parallelism got even higher.

A

There were more registers, the clocks were still faster and at this point, uh cmd vectorization of loop became important and compilers, including the old pgi compiler, which, where I worked, you know really invested a lot of time and effort into vectorizing loops across uh cmd lanes.

A

uh So the compilers weren't perfect, some probes resorted to cmd instructions, but you still coded.

B

Sorry uh brent sure, never again and you're supposed to show some diagrams um we're only seeing texts on the screen right now.

A

You see anything on the left hand, side.

B

Just the four dots that's.

A

All there's there.

B

Maybe you have more diagrams to show okay.

A

I will so I am showing how processor counts have uh grown over the generations.

A

So we still code to the main core and the sequencer right, so you have a sequencer that has branches and loops and things now, even though you have uh say four lanes in your simdi uh registers, uh you use simdi loads and stores on those. You don't write code for each individual lane and that's kind of the cpu model.

A

So then we move to multi-core architecture, so this was maybe 15 years ago amd launched uh the hammer. I was thinking of the kind of the slime name, the hammer amd 64 architecture. So then people decided and started to use, mpi plus openmp. So maybe I used mpi one per core, but maybe I used one mpi process and openmp, and this was about the time that the clock rates really began to slow. You know we had hit, maybe three gigahertz and then you know that was kind of the the top end on on x86.

A

They actually started to slow down, but they added more so then pneuma issues began to crop up. uh Pneuma p threads made their way into the linux kernel. I remember how disruptive that was.

A

The memory bandwidth again does not keep up with the compute speed. uh The main memory got bigger, uh but perhaps you only had about two gigabytes per core on your on your uh server, but they added more and more caches to each core. So uh if you uh to deal with the latency to main memory they added more cache, then we moved to avx 512, and so this is maybe like quarry gpu.

A

Again, a lot of the same things as before, but just more and better uh avx 512 hardware was actually a little slow to take hold and the initial implementations were not optimal. A lot of times uh we found in our compiler. It was better to generate code for avx 256, assuming that every core was going to be active, it had uh heat or or power issues uh still pneuma issues.

A

Vectorizing compilers are still very important. Cmd instructions still in use, and just more and more of the same types of things, software grows in complexity, relies on features like dynamic memory, large heaps large data, big long call stacks. uh Of course, you still coded to the main core and sequencer. You did 512.

A

You know bit loads in stores, but usually just one sequence.

A

So this is ampere a100 and maybe this might even be an under representation of the number of cores in an a100.

A

So um this is this is what you're faced with if you've been running on an avx-512, cpu and now you're moving your code to amp. So you need to find much more parallelism in your code uh when you move this to the gpu. Otherwise, you'll be under under utilizing the gpu, so you know 40 different threads are just not enough for 64 different threads.

A

You know times. Eight even is just not enough. uh An ampere a100 likes to work on thousands, tens of thousands of different elements all at the same time.

A

So some of the ampere hardware characteristics.

A

So we can still use high-level parallelism via mpaa and perhaps openmp. Now, there's been a lot of applications ported over the years to gpus that still use openmp, maybe one thread per gpu and in fact some well-known applications uh are only openmp one thread per gpu and then, if you don't have enough gpus all the threads gonna help out do some work sharing and they have schedulers built into them.

A

So because the the gpu is so large, there are some tricks or ideas you can use to like make multiple uses of the gpu. So we'll talk about some of these things, multiple cuda contexts, multiple streams running in parallel, there's mig and mps, which are some products provided by nvidia, which allow sharing uh the gpu resources either within your program or with other people.

A

You know other entities um on a gpu synchronization between the columns in this diagram uh is hard. Synchronization down the column is a feature of cuda, that's been there forever and you sync all the uh threads in a single block.

A

I say it's hard, but it is solvable and a lot of clever people have solved it, but it's usually buried into a library. It's not really as much a part of the programming model. So if you need to do synchronization between the thread blocks or the columns in this diagram, you'll struggle to find a good solution.

A

And it turns out offloading compilers are very important. Kernel, scheduling, tuning and flexibility of launch parameters is key. People spend quite a bit of time working on that and we'll go over some examples of how you do that in some of the various programming models over the next two days.

A

Memory bandwidth is many times higher than a cpu, so a number we've thrown out for years and years is like 6x, and I don't I don't know for sure that that still holds it's usually somewhere between 5 and 10x, with each generation, the bandwidth number between a gpu and a cpu.

A

The memory latency is very high and the caches on a gpu are relatively small. Now the caches on a gpu have gotten better and better with every generation, but still they are much smaller than on a cpu and every once in a while. We run into applications that run faster on a multi-core cpu than on a gpu, and it's usually because on a multi-core cpu, the whole data set fits in like l1 cache. So you will not see a lot of speed up if your data sets are that small.

A

There is a shared memory which jeff I think touched on, so it is a programmer, managed cache and it's useful for performance and to communicate between cores and an sm or in the column, within a thread block in a column on the diagram and massively over subscribing. The cores is a key to performance. So I don't know they may say an a100 has 5000 cuda cores in it.

A

uh But you know if you can launch 25 000 units of work, uh you'll be you'll, see a lot better performance and if you can only launch 5000 units of work on those, what what you have on a gpu is almost instantaneous, like one cycle context switches.

A

So the context switches are much much faster on a gpu than on a cpu. So that is why over subscribing is really helpful in getting gpu performance.

A

uh Cudit is a lot different and it was you know: I've been around a long time. It was kind of the first model where you coded to the core or the lane, and so the cuda run time. That's on the gpu handles the divergence, even though you know all the the cores in a column in the diagram might run in lockstep.

A

You can say something like if I'm thread number four go off and do something else, uh and you don't do that on a on a cpu with cindy lanes, but you can do that in cuda and that ability kind of trickles up into the other programming models. So there's a cost.

A

uh There's a you know: cost in performance, but the cuda runtime that runs on the gpu just handles that uh each core lane loads and stores its own data uh and the os or the low level runtime, ideally coalesces, those into contiguous blocks. I've found over the years. The biggest uh factor in gpu performance is getting all the cores in a thread block like a column of the cuts on the diagram to read consecutive memory locations in memory.

A

The bandwidth that you see is just factors higher, if you do that than if you read sort of random data or across a matrix rather than you know, contiguous.

A

In other programming considerations, each core or lane has a very small stack compared to the cpu and a limited number of registers compared to a cpu. So uh you, you may have openmp code.

A

That goes parallel and has you know big functions that run on the cpu cores and long call stacks and generates a lot of local data you're going to have trouble with that on a gpu. So a lot of work has gone into applications over the years that are ported to the gpu to change that type of program.

A

And finally, this is one of the things that I always tell. People overheads can really adversely affect the performance. So the gpu programming is a kernel-based programming model. You launch kernels they're short-lived and then the kernel ends, and then you launch another kernel. The kernel ends and every thread in every cuda thread.

A

That's doing work may only work on one or two elements or just a handful of elements of the bigger problem.

A

So, unfortunately, if you have overhead, like indirect memory, access or other you know, function calls where you have to set up uh argument lists and then tear down argument lists things like that. That can actually take longer for a thread than doing the actual work. So you need to think about.

A

How can I just simplify my kernels that I'm launching to do kind of the minimal amount of extra stuff and where I always harp on this for people is, is in a high-level fortran that passes things by using descriptors or creates temporal arrays, or things like that?

A

Just try to remove those from your code in any model when you're moving code to the cheat.

A

I'm going to wrap up here, general recommendations, so there are lots of of tuning guides, literature, books on moving to gpus and whether you're using cuda, openmp, openvcc or higher level standard r languages.

A

These all uh follow, you know, follow the same guidelines, so look for the cuda c, plus plus programming guide, cuda c, plus plus best practices guide, the stuff that they talk about applies to all models, so you know find ways if you can to parallelize sequential code, minimize the data transfers between the host and the device kind of back to that original diagram. Either. You know if it's andy, link or pci a lot of times. You will run a multi a large number of time, steps on the gpu.

A

You need to think about minimizing the updates that have to go back to the host at each time step and the new updates that have to go back to the gpu at each time.

A

At the beginning of your runs or the end of the runs, there's really not much you can do, but at each time step you can really try to minimize.

A

uh Adjust the kernel launch configuration, you may know more about the uh you know: loop, extents, the size of each dimension of your raise than the compiler ever does you can play around with the launch configurations in cuda the number of blocks, the number of threads per block and sure global memory accesses are coalesced. I talked about that. So within a warp or a thread block, you want all of the threads to be accessing consecutive elements of memory.

A

You know internally, the gpu has a very wide memory bus and if you access uh consecutive areas, you take advantage of that wide memory, bus minimize, redundant accesses to global memory.

A

You know gpus do have some caches on them: small l1 and a small l2, but you don't want to be doing a lot of extra reads and writes and avoid long sequences of diverged execution by threads within the same warp. So I don't really see this as as much of a problem, usually because of the problems I'm working on.

A

So finally, you've seen this slide. I I just want to reiterate one more time everything that I've talked about in this presentation applies to all three of these columns, whether you're using a new concurrent, uh standard, r and c plus plus directive based models. The way you form your loops and the way the accesses are done in the innermost loop are very important and I think that's all. I.