National Energy Research Scientific Computing Center (NERSC) Using Perlmutter Training, January 2022, 11 Jan 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GPUs 101

Description

Part of the Using Perlmutter Training, Jan 5-7, 2022. Slides and more details are available at https://www.nersc.gov/users/training/events/using-perlmutter-training-jan2022/

A

Really excited to see all the the people who are attending here- and um you know, I think the point of my talk- is to really tee up the rest of the day. And so I just want to give you kind of an introduction to pearl mudder. Why we're kind of using gpus and just some of the really high level gpu.

A

Kind of technology features and and top level considerations that you need to kind of think about as you're moving your application over to to gpus and also kind of end, with some of the progress that we've been making um with applications that we've been working with on reporting or optimizing their their codes for for gpu systems.

A

So this is kind of a picture of nurse roadmap in terms of systems that we've been deploying. So if you go back to nurse 7, which was the edison system, you see kind of a traditional multi-core um uh intel, xeon powered uh hpc architecture.

A

uh As we deployed quarry, I think we started this transition towards exascale energy, efficient type, architectures and so corey was powered by the many core um intel knights, landing xeon phi processors and with nurse nine we're deploying we have deployed. I guess our first cpu gpu accelerated hpc architecture, and um you can see that nurse 10 is expected to be our first sort of like exascale class system arriving in the 2024, maybe 2025 time frame um in terms of the greater kind of doe hpc ecosystem within the office of science.

A

You can see that gpus are really beginning to play an important role um so with summit at the oakridge leadership computing facility, there's already a cpu plus gpu system powered by the nvidia volta gpus.

A

With perlmutter, of course, we're using the nvidia ampere, the next generation of gpus and the next generation of systems coming to the argonne leadership computing facility in oak ridge. Leadership facility will also support gpus from intel, as well as uh am amd gpus at the oakridge facility.

A

um So why is this happening? And I think you know I think the short answer is.

A

um It allows us to deliver more capability in um terms of overall flops and memory bandwidth, and um you know, operations per per second for for kind of less power, so here's an example of an application running on edison versus versus summit and on the y-axis. You have time on the x-axis.

A

You have power and so time times power is is energy, and so you can see as you're kind of moving down the diagonal, then you're improving on energy, energy efficiency and what you're seeing is the same application running on edison sort of that traditional cpu uh multi-core um architecture versus summit at oak ridge, the gpu architecture, and you can see that it's essentially sort of like an order of magnitude improvement that you're getting by using the the accelerators.

A

um And so, as I mentioned, this change has kind of arrived and it's really kind of driven by the power consumption towards these. These lightweight cores, and so you know, I think we found that corey using the many core architecture has been kind of a boon because of this. The kind of new capabilities- and this continues with the the gpu architectures on on promoter.

A

So one of the things I kind of want to highlight uh in this slide is what are the kind of the main concepts that you need to think about when programming for gpus, in particular the a100 gpus, um and I'm going to start with kind of two main concepts that I think, if you gather those, I think you kind of have have got what most of what you you need to understand about gpus and we'll talk about a couple. A couple, other concepts as well.

A

So one- and I think this is probably really the most important- is that you need lots and lots of parallelism.

A

um So if you compare the going from kind of like the cpu architecture that um we have sort of on, like the corey haswell system um to the gpu architecture, on perlmutter you're, going from 64 cores.

A

uh Well, I guess uh this this might have been for kml. I guess very for hospitalization 32 cores um to what you might consider 108 sms on a on a gpu socket so 108 what we call streaming multiprocessors.

A

Each of a has each of the hassle cores can handle kind of two hyper threads. This is sort of equivalent on the gpu to having 64 warps that are available per sm.

A

um Two can really be active at a time, but you can actually kind of over subscribe them to get additional levels of performance um to get really the most out of a haswell cpu. You have to think about using vector operations instead of scalar operations, so there are two 256 bit wide vectors, so that would be like four double precision um uh in instructions times times two, whereas a gpu you're, really thinking about 32, simply threads per warp, and you can kind of think of these as a little bit similar.

A

There are some important differences between what, like uh you, can do in a vector instruction on the cpu versus what these uh kind of simply threads can do on a gpu. But you really should be kind of thinking about these as 32 operations that are kind of working on different data, but doing the same instruction every every cycle, um and so, if you think, if you're talking about double precision, then you're really getting around.

A

Like 2 000 weight, parallelism 64 times 4 times, uh eight uh versus something like 200 000 way, parallelism on a a gpu, so significantly more parallelism by orders and magnitude that are necessary to really keep a gpu busy, um and so the you know, one of the things that you can think about on the cpu. Are these hyper threads that help you kind of hide, latency or hide different kind of waiting time or stalls on the on the processor and that's similar to the to the gpus?

A

Where you really want to kind of over subscribe, the gpus with either um you know more warps per sm or more streams to really help hide any latency that your that your application might might be seeing.

A

So that that's number one, you can see that the the amount of parallelism has gone up by like one, um not just one order of magnitude, but really like two or three orders of magnitude um when moving from the kind of cpu architecture to the gpu architecture um and the the next main concept I want to talk about is that the the gpu memory is very fast and your application can really take advantage of that.

A

But, on the other hand, moving data to the gpu is is not fast and that can often be a bottleneck. So, let's, let's kind of discuss this.

A

So if you look at the same kind of comparison between the haswell cpu that exists on kind of like the haswell um nodes of corey, you have a total of 128 gigabytes of ddr, um whereas the on the gpu itself on perlmutter on a single gpu, you have 40 gigabytes of uh of h, of hbm or high bandwidth memory, uh so on haswell, the um bandwidth that you can get from that memory is 128 gigabytes per.

A

Second, that's pretty pretty good for a cpu of that that generation it's a little bit faster now on some of the newer, newer generation cpus.

A

uh But you can see it's a big big boost when you go to the a100 gpus, if you're utilizing that on device memory you can get up to, you know 1500 around 1500 gigabytes, a second of memory, memory bandwidth.

A

On the other hand, what is really slow, much slower than both of those memories, is the the pci express bandwidth of transferring data back and forth. um You can see that's sort of uh an order of magnitude less than the than the memory bandwidths that we've been been talking about. So um that's kind of the slowest, the slowest pipe or the slowest um data transfer speed on the on the node, and we want to um kind of avoid moving data back and forth frequently.

A

um So uh you you know, for your application to get the the best performance out of the out of the gp. You really want to kind of keep the the data on the cheap view and get the most out of that that really high bandwidth memory and avoid moving it back and forth frequently between the cpu and the gpu, which is which is really the the slow part.

A

um So those are really the top two concepts for um for programming a gpu, and uh I think everything else is sort of a second order consideration. I think if you have lots of parallelism- and you realize that uh kind of getting the data onto the gpu and trying to keep it, there is important, then you're, like 90 there in terms of cpu performance.

A

um There are a number of second order considerations. So let me talk about just a few of those right now. So one is that when you're defining your kernels- and I think, you'll kind of hear about this in some of the upcoming presentations- um there is some sort of overhead in launching each of those kernels.

A

So you don't want those kernels to be kind of really really super short, um and so some techniques that you can use are like fusing short kernels together to to kind of have longer execution times, and um you know there's a possibility of doing things like defining cuda graphs. If you have, if you do, have a lot of really short kernels, that kind of depend on each other or need to execute in a certain order.

A

You can um kind of tell the gpu about them ahead of time by defining the kind of a kind of like a graph, and that can help eliminate some of the the overhead of launching these these individual kernels um and then for.

A

uh I think um I mentioned that the high bandwidth memory is fast, but just like on a cpu, it's actually better to keep your data in registers in cache or in what is available on a100 called shared memory to keep the data even closer to the compute units when possible. So you know, I think a lot of our applications at nurse tend to be more dependent in terms of their performance on the ability to move data quickly rather than to just compute.

A

um uh You know as many flops as possible and so keeping the data as as fast as close to the compute units as possible can can help um and, in particular, for many applications. We find that there's it's important to kind of like experiment and find an optimal balance between maximizing the parallelism, which I really highlight a lot in this first point here and minimizing the amount of uh kind of spilling of data out of the register.

A

So the gpu has kind of a fixed number of registers and um in some sense, the more uh parallelism you expose or the more kind of warps you have active, the more likely it is that uh your data might might spill from the registers, and so there is some um experimentation. That's often necessary at the kind of last level of optimizing, your application to find that optimal, that optimal balance.

A

Okay, so I feel, like uh you know, if you get number one and two, if you have that in your head, then I think you understand the most important things about programming for a gpu um and then there's a number of second order considerations. These are really kind of just two big examples of them, um but you know these.

A

These are the type of things that will help you kind of just tune and um really get the the absolutely best performance that you can out of uh out of a gpu out of a gpu system um in terms of which programming models you use to kind of express these concepts.

A

uh You're gonna hear a lot about this in the next presentation right after mine, um and you know, I think the really nice thing about perlmutter is that it kind of supports every different programming model out there for programming for programming gpus.

A

We realize that a lot of our users had codes that already had gpu implementations that are maybe using cuda or open acc, um and we wanted to kind of make sure that the system supported those well and in addition, we are supporting programming models that are maybe the primary choices for people who are targeting those amd gpus like hip, or the intel gpus like dpc, plus plus and sickle, and those will be- uh I mean, those are um enabled and available to uh to kind of execute on the on the perlmutter system as well, um then we had a uh you know, particularly.

A

Important partnership with nvidia to enable openmp offloading support on promoter, so one of the programming models that we really kind of pushed users to adopt when porting and optimizing their code for the corey xeon phi system was openmp because it allowed you to kind of identify and express a another level of parallelism in your application at the at the node level.

A

And we wanted to make sure that that was something that users kind of could continue to use to express parallelism when they're optimizing their applications for perlmutter.

A

And so we worked with nvidia and what was the kind of pgi group to make sure that openmp offloading worked was was kind of supported and worked well in the openmp um pgi and what is now the nvidia hpc compilers, and so um this has led to the release of the openmp production offload compiler as of basically last april, and it continues to improve kind of with every every release of the nvidia hpc sdk.

A

um One of the things I really want to highlight the to the audience here today is um the the ability for essentially anyone out there who has an application that they're wanting to optimize on perlmutter to join one of these community hackathons.

A

um You know, I think hackathons have really proven to be an effective tool for um preparing applications for not just promoter but other other new systems.

A

We use it a lot for corey and I know other centers out there have used hackathons really effectively and they're kind of effective, um not just because of uh oh, like the technical things that you learn at a hackathon, but really just because of the sort of sociology of them, I think, being surrounded by a lot of people who are doing sort of the same thing that that you are is really um kind of a contagious environment.

A

That uh sort of takes you out of your day job for a couple days to really focus on your on your application, with um kind of all the right experts. Looking over your your shoulders so um go check out, wwgpu hackathons.org um nurse staff, I think provided more team mentors than any other institution to these worldwide events in 2020, and it's really allowed us to reach nurse teams kind of all around the country uh and really kind of all around the world.

A

um And you know, during the you know, covid pandemic.

A

uh These hackathons are kind of adapted from what was sort of an in-person event to um uh remote events, uh but I think they've they've managed to really to really be very um useful, very profitable, and um I think that even features of this sort of uh remote hackathon format will end up being incorporated into future future hackathons, even whenever we're kind of at a new normal beyond this pandemic pandemic state, but bottom line go check out, gpu hackathons.org and see if there's an event coming up that that you can attend.

A

There's a number of ways that we are trying to take what we've been learning working with applications. You know partly at hackathons, partly as part of our nisap kind of partnership program and expand and deliver that to the community at large. So um one of the things that we're doing is really working closely with the programming models and languages team, and I think again, you're going to hear a lot about this in the next presentation um to make sure that our community needs are being considered and adapted as the kind of accelerator programming.

A

It's standardized within, like the c-plus plus and the fortran standards, as well as um kind of important frameworks like cocos and openmp, and openacc get developed and expanded.

A

One of the ways that you can take advantage of promutter, even if you don't write your own code- is by utilizing the best possible um the best possible installation or version of the community codes out there that are optimized for for promoter.

A

So there's a number of applications that we provide on promutter, that we've worked with the developers to kind of improve their performance and make sure that uh what we, what we have available, is um really optimized for the for the architecture, and so these are just some of the examples of applications that that we provide um checking out the nurse documentation.

A

Our different training events like like this event today, I think, is a great way um to uh to kind of get the most out of the system, and then I'm going to talk a little bit more about the work that we've been doing with vendor tools and just one one more pitch, because I think this is really one of the takeaway messages that I have for. You is to check out the gpu community hackathons and see if there's one that you could that you could attend.

A

So these are kind of all virtual for the time being, but uh will eventually also be back in in person.

A

um So in terms of wrapping your head around sort of the the two and the the four kind of um concepts I talked about earlier for getting the most out of gpus, um what I think is really important is that you don't do it in a vacuum. Is that you kind of use some tools to help you, and so um you know, as we're thinking about the the optimization challenge that our teams have and porting to and optimizing their applications for for pearl mudder.

A

You know, I think that um we found that they have kind of similar questions and that really what they need is kind of an absolute sense of performance when optimizing applications, and they have questions like how do I know if my performance is good in some overall sense or why am I not getting the peak performance that was advertised on the page, and maybe the most important question is: how do I know when to stop like? When is when is my performance uh good enough that it's not worth investing?

A

You know another. Several months of my time to try to improve it- um and you know I've seen a number of presentations I've, even given a number of presentations where people present a result. That is something like the following, like my application is running two times faster today than it was a year ago um and in some sense that's great.

A

You know it's always better when your application is running faster than it was before, but in another sense it's it's not entirely meaningful because you don't know, you know where you stand in any kind of absolute sense. Like uh was the code. You know terrible to begin with, and now it's a little a little better or uh was it already great and you're? You know you put in some kind of like ninja hacking um uh activity to uh to really get it to perform the the 2x better.

A

So I think, what's really important is to know where you're standing something kind of some absolute sense that can guide your um your next steps um and in particular, as you saw on the gpu there's many potential optimization directions that you can take. So is utilizing the the high bandwidth memory. What's really important for you or is really getting the most um out of the um the the different levels of parallelism available on the gpu?

A

What's really important, how do you know what is the limiting factor in your app's performance and again I think it's it's quite important for productivity to know like when. When is the performance good enough, and when can you? When can you stop, um and so what we found is that uh framing these, these conversations or like framing the answers to the questions in terms of a simple performance model called the roofline model and the gpus is a really good way, a good way to begin thinking about it.

A

um So the roof line basically tells you: what are the performance ceilings on the device based on the characteristics of your application? So you characterize your application on the x-axis in terms of the floating point operations that it does. um You could also think if you don't have an application that does floating point operations, you could also think of it in terms of like integer operations or just other other type of operations.

A

But you want to think about the amount of operations that you're doing per second uh per the amount of data that you need to transfer from some level of the memory uh hierarchy to the to the compute units, um and given that your application has this characteristic there. You then have these different ceilings that limit your performance, based on your ability to utilize different parts of the of the architecture. So this is whether your application really does floating, um or I guess what are called fuse multiply- add operations.

A

So if you have an equal number of multiplies and adds in your in your application- um and the nice thing is that we worked with nvidia to enable this analysis uh directly within the insight performance tool- um and I think you're gonna hear a little bit about this next week um at the training from nvidia about the hpc sdk, and so you can kind of give yourself uh an understanding of where you stand against the kind of potential of the architecture in an absolute sense by running uh by profiling.

A

Your application, with with insight- um and you know one thing I want to highlight about this- is that there's nothing here in this roofline model that you know you couldn't find in uh different profiling tools sort of already.

A

um But we think it's just a really nice clean, easy way to to think about your performance and where you stand in an absolute sense in which directions you might be able to um improve your performance in a in a pretty kind of quick and easy way for for everyone to um to kind of grok or understand.

A

Okay, so uh we've been working with uh pretty closely with a number of applications from a whole bunch of different different science areas and um and working with them to improve their application for the pearl mudder system, and so this is a plot of where some of those applications stand from different algorithmic areas in comparing their performance on the perlmutter system versus the the edison system, honest kind of system per system, um performance uh performance comparison, um and so one of the things I just want to highlight is that uh across all of these different applica application areas or algorithmic areas, we are seeing.

A

Applications that are able to achieve at least sort of a 20x system-wide throughput increase over over edison. Here are some specific applications that we look that we've looked at on a node per node basis.

A

um You can see that application performance is varying anywhere from you know about 20x, for some of the toughest apps to optimize on the gpu, all the way up to like over a thousand a thousand x for some of the machine learning applications that uh are able to take advantage of some of the low precision acceleration that's available on the gpu as well. um So let me I'm just going to end my talk.

A

I know that I'm sort of running a little bit low on time here and so I'm going to end my talk by showing you a few examples of what uh what can be done with uh with with perlmutter. um So the first application I want to talk about is desi.

A

That stands for the dark energy spectroscopic instrument, um and I think this is a particular kind of app to start with, because it's related to the namesake of the of the system promutter himself, who kind of discovered that um not only is the universe expanding, but the rate of expansion is actually accelerating, and um you know the kind of the the the term for that is uh kind of dark, dark energy, and so scientists believe that about 70 of the universe is dark energy.

A

Although we don't really have have a good understanding about about what that is, um and so the desi instrument is going to send nurse data every night for five years, and this data will kind of be used to construct a really detailed map of the universe, to better understand the nature of of dark energy, and so they've been working to accelerate the the kind of key desi data analysis pipeline uh on on perlmutter. um And that's what you see here.

A

So uh they completed kind of a major refactor um and optimized the cpu code, uh with the first gpu port really coming in early 20 2020. And so that's. What you see here is the performance of the the gpu port.

A

They continue to optimize uh the application over a series of kind of types of optimizations targeting different features of the of the gpu and the and the very latest performance of the application on perlmutter uh has a kind of 25x improvement in per node throughput using perlmutter compared to the to the edison baseline or the initial.

A

The initial code on on edison um xfl so x is an application that uses hpc to analyze the data from x-ray free electron lasers, and I think one of the interesting things here is that this is a community who wants to employ kind of hpc to enable real-time data analysis to make decisions and analyze their data, not just after uh kind of their beam time is over or the data collection time is over, but really during the experiment itself or during their data collection time at a at one of the facilities that provides these x-ray free electron lasers.

A

So they really used um uh the gpu systems at nurse to develop now a highly scalable application that uh analyzes these uh x-ray diffraction patterns uh with the runtime. That's really improved by many orders of magnitude, so something that would take 12. You know. 12 hours on edison is now on the order of two minutes on a uh a perlmutter node, um and you can see that they're there they've also been working a lot on the scaling across uh across gpu nodes on the on the system.

A

um Another example is lamp, so lamps does molecular dynamics. Calculations are basically molecules kind of interacting with each other atoms atoms and molecules interacting with each other and.

A

Kind of moving around over time or evolving over time, and so this was an application that had an existing gpu version, but they've been working with nurse and then particularly, I want to highlight the effect of the hackathons that they've attended um in 2019 in 2020 um to improve their their performance um and through those efforts, largely centered around hackathons, uh as well as working with some of the nvidia engineers.

A

They've been able to get a 22x improvement in their gpu performance um to the point where their node versus node speed up uh with a new application on perlmutter, compared to kind of where they were at on edison. To begin with is over 250 50x.

A

So that's a combination of using the new system, as well as the improvements that they put into the into the application, um and after this effort they were really able to achieve uh um some. Some pretty impressive runs um on perlmutter and other gpu systems that they couldn't have really done without all of the gpu acceleration um and improvements that they made to the to the code, um and this was uh recognized as a gordon bell, finalist in um sc uh back in back in november.

A

The super computing conference back in in november- and I think one of the interesting things that they were able to measure is the uh performance in terms of like atom steps or millions of atom steps um per gpu per. I guess per gpu second, and so this sort of takes out the the um the the scale factor um and what you'd ex?

A

What you'd expect or hope to see here would be like a kind of straight line across this graph, that no matter how many gpus you use, you get roughly the same performance in terms of atom steps per gpu, and you can see that they've used three different systems, perlmutter summit and selene. So promutter and selene both have the latest generation a100 gpus, whereas some of that oak ridge was using the previous generation b100 gpus and you can see that they're getting great performance on pro mutter and roughly, like a one.

A

You know 1.6 x, speed up between the previous generation voltage gpus and the pro mutter a100 gpus.

A

um uh I think I'm running a little bit low on time, because I'm supposed to be finishing and maybe taking some questions here, so I'll kind of go through some of these quickly, but this is another comparison between the previous generation v100 gpu from uh summit and the a100 gpus from role moder and again you're, seeing about a factor of 1.6 or in some cases close to a factor of two in in performance improvements.

A

um Let me just end with this last science story here: um accelerating some fluid dynamics, applications with gans on promoter. So I know a number of people in the audience are probably hoping to do some machine learning, um either training or inference on the promoter system. So this is a case where a group is replacing part of a fluid dynamics simulation with a uh with a gan or kind of a trained neural network.

A

um And you know, one of the things I want to to highlight is that again we're seeing big improvements over the previous generation of gpus, the v1, the v100s compared to the to the a100s.

A

In this case a 2.9 x, performance improvement in the machine learning workflow, and I think that that'll be something that a number of you see when.

A

You uh kind of apply your machine learning workflows to to permuter okay, so let me conclude so I think uh the key takeaways is that we've kind of you know nursed as a whole, has been successful in preparing a number of applications for perlmutter.

A

um One of the things I want to highlight to this audience is that we really want to continue working with you to enable the use of pro mudder productively, and I think, really really one of the best ways of doing that is to have you all apply and join these public events that are well. I guess they're virtual for now, but they're kind of being led by institutions all around the country and really all around the world so check out gpu, hackathons.org um and that gpu optimizations.

A

You know that we talked about increasing parallels and understanding and minimizing the the data movement are things that really continue on the themes from corey and are really the most important things that you need to still think about um when optimizing your applications for for promoter um and then, finally, I think you're going to hear a lot about a lot more about this in the next talk.

A

But I really think that you know openmp and then the c plus plus frameworks, for example, are becoming viable options for you to utilize, not just uh pro model productively, but also the upcoming exascale systems at the other doe doe facilities.

A

So I will stop there.