National Energy Research Scientific Computing Center (NERSC) Migrating from Cori to Perlmutter Training, December 1, 2022, 1 Dec 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 02 Intro to Perlmutter and GPUs

Description

Part of the Migrating from Cori to Perlmutter Training, December 1, 2022.

Please see https://www.nersc.gov/users/training/events/migrating-from-cori-to-perlmutter-training-dec2022/ for the training day agenda and presentation slides.

A

Hi everybody I am the application performance lead at nurse and so I'm, going to kind of give you an introduction to the promoter system a little bit about programming gpus and like the different programming models and languages, libraries Frameworks that are available on promoter and um I. Think we'll close here with just a few early science stories to maybe motivate the the audience uh with what what can and kind of has been done on the on the system.

A

Already I think as Helen um pointed out, if you have questions put them in the in the Google, Doc I think we're doing that, because um then the questions kind of stay after the after the zoom meeting is is over um okay. So let me kind of put promoter in perspective.

A

So this is the nurse system roadmap, um and one of the things I want to highlight here is that we're we're kind of in this transition from what was kind of pretty typical HPC systems a decade ago, with Edison being a um kind of a multi-core cpu-based system um replicated across you know thousands of nodes to to make an HPC um an HPC architecture. We've started this transition towards sort of energy efficient, exascale like architectures. In order to kind of Meet the demands of the community.

A

We started that transition with Corey, uh which was powered by Intel Xeon, 5 processors and then with Pro Mudder. We have um our first ever GPU accelerated system um and uh we're thinking we're kind of already beginning the process to procure the nurse 10 system, which I think is expected in, like the 2025-2026 uh 2026 time frame. There's not much. We can say about that, but I think we're expecting this trend towards these energy efficient architectures to continue.

A

um So, if you look at Pearl Mudder, um we have two types of nodes: we have first, the Nvidia amdia ampere I should say GPU powered nodes, Each of which has four gpus and one CPU um per per node, and they have a tremendous amount of performance in the in the node. You can see here over 75, um uh I, guess, I, guess that says teraflops but I think that's a um lower number.

A

um And then we have these AMD Milan CPU nodes which are don't have the gpus, but do have two CPUs uh 256 gigabytes of DDR. So a little bit more um DDR per node, but you don't have the high bandwidth memory that comes with the with the gpus on those um the the system as a whole has 1500 uh GPU nodes.

A

Again those are nodes with one CPU, the AMD Milan CPU and four Nvidia a100 gpus, um and then it has 3 000 CPU only nodes, um and it may seem like there's a lot more uh CP CPU nodes and GPU nodes, but keep in mind that most of the actual performance of the system or like most of the toll available flops, come from those GPU powered um powered nodes oops.

A

um So you can see that that's sort of here um in terms of the in terms of the performance. If you look at the performance of this of the GPU nodes, you have, uh you know about 120, total petal, flops I. Think that's the right number.

A

um If you include the the power and capability of the tensor cores within the within the gpus um compared to about uh you, know close to I, guess eight petaflops for the for the CPU nodes. um This top row here, where my mouse is, is showing the performance of the CPUs that are within the GPU node. So if you ignore the gpus, you have another a little bit of a about about four, um more petaflops of performance.

A

um So here's actually a picture of what the system looks like.

A

um It's all downstairs in the building that I'm talking to you from, um and you can find even more details about the uh the architecture and the um the different components at this URL here I'm going to talk a little bit more just about the different about the two types of nodes. So, um as I mentioned, most of the performance on the system comes from the GPU notes um and I mentioned that you have one AMD Milan processor. So that's pictured here with four Nvidia a100 gpus. Those are the four ampere uh gpus pictured here.

A

The the CPU and GPU are connected with a PCI Express 4 bus, but the gpus are connected to each other via a NV link, connection and so I think that's what's pictured here in the green and then down here these these arrows uh in the in the blow diagram.

A

uh Each GPU has 40 gigabytes of high bandwidth uh memory, so that's um a pretty big Boost from the previous generation, where you were looking at.

A

You know, typically on the order of like maybe 16 gigabytes per per GPU, um and that the one of the important things about the gpus is that uh that memory comes with very, very high bandwidth, so you're able to achieve um you know close to 1600 gigabytes per second of GPU bandwidth um and, uh in importantly, as I'm, going to kind of discuss in a minute that bandwidth is much higher than what you get uh by uh compared to what you get by moving data across that PCI Express Bus between the CPU and the and the gpus um okay.

A

So then, uh the the CPU node here looks a little bit simpler. So you have two AMD Milan um processors.

A

um Each one of those is a 64 core um part. They support AVX, 2 instructions um similar, but not quite as high of a of a vector width, as you had on the Intel Xeon 5 processor for Corey, um and they do have relatively High memory bandwidth for uh C for c for CPU. You see that you have 204 uh gigabytes per second memory bandwidth um but of course that's significantly lower than that memory, bandwidth that you saw on the on the um gpus.

A

uh The other part of the system that I really want to highlight today is the old flash file system, so the scratch file system on Pro Mudder is made out of entirely um uh flash or these nvme ssds that you see here on the bottom, uh there's 35 petabytes total in the system.

A

um It can support an angry, get bandwidth up to five terabytes per second five million High Ops- um and um you know here- are some of the other characteristics of the parallel file system, including the number of metadata servers in IO servers, um unlike with Corey, where you had this sort of uh spinning disc based flash uh Spinning Disk based scratch file system combined with a burst buffer, uh just everything on promoter or scratch or is Flash, which we think is um you know a kind of a nice usability Improvement.

A

um So one of the things that I want to highlight here is is uh that we, together with you all and the community kind of, have this common challenge, which is to enable all of these different science um users and codes to run efficiently on these Advanced architectures, uh including Corey and now moving on to Pro Mudder and then eventually to the nurse 10 system, um and here is sort of like a side-by-side comparison of what that challenge looks like in this generation um moving from Corey the Quarry system to promoters.

A

So uh we we're moving up significantly in the total capability um of the system. um Significantly more memory, uh one of the real big differences is the performance per node has gone up dramatically on promoter compared to Corey. These nodes are much much more dense in terms of their compute power. You see from about three teraflops to 70 teraflops, um the the no performance uh or the the processors um now include the accelerator.

A

The number of nodes is actually shrunk, I think and that's consistent with the the nodes themselves being more powerful, um and the other thing I would highlight is this old flash file system um I? Think, actually is one of the things that makes running on promoter. Even you know, perhaps eat a little bit easier than than it was on on Corey.

A

um So we um we kind of started this process of uh when you, when we were thinking about kind of procuring a GPU system. We were looking at our workload and and asking ourselves what fraction of the workload could really take advantage of the gpus, and this is where things sort of stood in the 2017-2018 time frame um and uh the good news is because gpus had been around at a number of different facilities and have been used different places throughout the world.

A

The number of codes had a GPU versions available already, but a number of applications um we're also kind of only partly ported or hadn't, uh even even kind of started in in some cases, and so we've been working with these code teams. You know with a number of code teams over the last, um you know five five five years or so to make sure that they would be ready for for promoter, um and uh you know some of what I want to tell you about.

A

Today is what we've learned from that process and and sort of what um some of those lessons are that you can kind of um uh learn, learn from as well um so in in general, um as you're thinking about a CPU to GPU transition for an application.

A

One of the ways to think about this is through this analogy that I'm, showing on this slide that a you know a CPU is something like a Ferrari I think that I think this is supposed to be a Ferrari on this on this picture and that's like it's uh um a car that it can go really fast. Make really tight turns, um but is really good for kind of doing one one task at a time or taking one person to kind of one place at a time where a GPU is something like.

A

uh You know this double decker bus, which is uh kind of good at going uh taking a lot a lot of different people to one place, um not as fast as the CPU would take an individual, but with a higher overall through throughput, um and this is kind of evident if you compare the amount of parallelism that is available um in a CPU from our Quarry system. So this is the the Corey Haswell partition compared to a GPU on Chrome letter.

A

So if you think about the two sockets of a one of the Corey Haswell nodes, you have 64 compute cores uh each one of those can support up to two threads via Intel's hyper threading technology.

A

And if you were to use those avx2 instructions, um you can compute on two by 256 bit vectors at a time, um so that all adds up to about 2 000 way, parallelism that a Haswell, CPU node is uh sort of capable of. If you compare that to a single a100 GPU, um the equivalent of the 64 cores is approx is is basically these 108 SMS or what they call streaming multi-processors on a GPU and each one of those can support up to 64.

A

Warps per SM um only two can be active at a time, but you really want to generally over subscribe, the number of warps per SM to to keep the the GPU busy and then within each one of those warps.

A

um You have 32 simty lanes per Warp.

A

So if we do, the math here that adds up to I guess multiplies up to 200 000 way, parallelism, um which is, of course like an order of magnitude bigger than um you know, I guess two orders of magnitude bigger than what you see on the on the CPU node, um and this is what I I had said verbally- is that you typically want to over subscribe to gpus in order to keep keep them busy and hide any latency.

A

um So another big difference between a the CPU and the GPU is the memory bandwidth. So if you, if you look at the uh Haswell CPU uh you, we had 128 gigabytes of DDR per node and approximately 128 gigabytes per second of memory, bandwidth on that Haswell CPU node. Now, if you compare that to a single a100 GPU again, we have 40 gigabytes of high band with memory, but significantly higher memory. Bandwidth so 1500 gigabytes a second, so an increase in an order of magnitude.

A

Again um now, as I was highlighting when I was kind of talking about the architecture, um this should be compared against the speed of moving data between the CPU and GPU across the PCI Express Bus, and that is about 32 gigabytes a second, um and so you can see you can move data within the GPU very, very fast, but moving data across the CPU to GPU can be very slow. So the lesson learned here is to try to avoid moving data back and forth frequently um and in general.

A

The challenge for um optimizing, an application or bringing an application to the GPU is that there can be kind of multiple GPU, optimization Avenues.

A

um So the two themes I just highlighted are you need orders of magnitude, more parallelism uh and uh you need to recognize that, while the GPU memory is very fast, moving data back and forth is very slow um and then there are of course other like second order considerations here, like just kind of two examples um in that there is some overhead in launching kernels or like bits of code to run on the GPU, and so you want to make sure that you're giving the GPU kind of enough uh contiguous work to to work on, um and even though the memory on the GPU is fast, you still want to take advantage of cash and um registers and shared memory as much as possible.

A

So we've kind of realized that this uh can be kind of like a multi-access multi, a problem that kind of has multiple accesses that you need to kind of understand your performance on, and so one of the things that we've been working with our vendor Partners on is putting together some tools that help you kind of quickly profile your application and understand.

A

What's limiting your performance, so we've worked with Nvidia in particular on their Insight uh profiling tool, and uh it now integrates uh the what we call the roofline performance model, which will kind of plot your application performance on this sort of two-dimensional plot.

A

That considers your kind of the characteristics of your algorithm in terms of the data movement versus the compute required and shows you kind of where you stand against what we would expect for that particular for an algorithm of that characteristic um and from from there you can then kind of devise a strategy for improving your overall performance, and so this is something that's baked into the tools that you can use here now at nurse.

A

um So our strategy over the last several years as part of this program called nisap, which stands for the nurse exascale science application program, has been to partner, with a number of application teams um to work with them to prepare or Pearl Mudder at a pretty at a pretty deep level and then um to take what we've learned there and kind of share with you the share, with kind of everybody in the community.

A

um The lessons learned and kind of best practices through kind of training events like this and various hackathons that are available um throughout the the country.

A

um This I, just wanted to kind of highlight, was an all hands on deck activity. A number of the people who you see here um will be kind of available in the afternoon to um work with you and to kind of talk with you about your own applications, um and one of the things I really want to highlight here is that these hackathons have been really effective in um kind of helping improve various code teams around the country.

A

um The we've had sort of two types. The first type is sort of uh now now wrapping up.

A

It was kind of part of the nurse project itself, but the second one is the these public GPU hackathons that anybody and everybody can go apply to become a team at um at this URL, and uh you know, we've actually provided more mentors, I think than any other institution in the world to these uh to these hackathons and so we'd love to kind of actually work with you at hackathons all around the all around the country um and, to a certain extent all around the world.

A

So these are uh organized by uh Nvidia, Oak, Ridge, National, Lab and uh and us here at nurse um and I, think there's, there's I think probably on the average of one a month within like North America um and so really.

A

One of the one of my kind of take-home messages today is go check out this URL and if you think, you really want to Deep dive into your application and optimizing it for gpus I think these can be really really good good events um and so just an example of how this worked with like one of these uh one of these applications.

A

um If you look at lamps, so lamps is a molecular Dynamics application, that's used for kind of materials modeling.

A

um You can see their performance here, sort of over time as uh they were working on this application and optimizing it towards gpus and um there one of the things I want to highlight is sort of like the speed ups that they obtain um are sort of centered around uh these different hackathon events, where they were able to make a significant difference in just a few days by attending a hackathon and then continuing on those improvements and then attending another hackathon and making significantly more more improvements, um and this actually led the team to um do some really large scale.

A

Science runs um on uh both uh Pearl Mudder and uh other other GPU systems available.

A

That I, don't think, really could have been done without the without this new system itself and the work that the team put into the to the to the project um so uh yeah, so the one you know, one of the other things I wanted to highlight is that we've been working with teams to do really large scale. Science um runs on uh Pro, Mudder and related GPU systems. Over the last.

A

Several years um and one of the outcomes is uh these really kind of large-scale state-of-the-art science calculations that are recognized each year at the super Computing conference as Gordon Bell prize finalists or in some cases in some cases, winners.

A

um So just some observations about this process. This is I think that you know many applications have been successful in preparing for prom letter and we'd really like to keep engaging with uh with you all in the community to enable um uh you all to basically use the system productively, and uh we really do encourage everyone to join these Community hackathons, uh GPU, hackathons.org I think that's just a great way to um to get a lot done done quickly, and we do kind of recognize that they're, you know moving uh you know optimizing.

A

Your application for gpus is um not a kind of linear one-size-fits-all solution, there's kind of multiple optimization angles that exist and uh kind of profiling and using the roofline tool that I highlighted is a great way to get started.

A

um The the other thing that I would just note here at the end is that I I really think that we've seen a lot of energy coming from you all. The community, who is really excited about the potential of of pro Mudder and I, think that's really great to see, and it's uh um something that um we are really really excited about as well, um so I want to now change so I think I'm um uh kind of moving into the second part of this.

A

This talk and I think I'll try to not go over my time too much here and I just want to talk about some of like the the programming environment that is available on promoter for everybody to to use, and so one of the things that um I want to highlight about promoter is that compared to um some of the other GPU systems out there that are not that with using gpus from vendors that are not quite as mature, maybe as the Nvidia, um the Nvidia Parts.

A

uh We have essentially support for every single GPU programming model out there on on Pro monitor. So um we support uh um Fortran, and we recognize that some of the applications are written in Cuda Fortran and you can use those on Pro Mudder.

A

We realize that a lot of applications out there written in Cuda and you can use those on Pro, Mudder uh and and also open ACC, for example, like basp, is a is an important application, that's written in open ACC, um and then um we also realize that people have invested a lot in open NP in their applications for Corey, and uh one of the things that we are happy with you know happy to to say about promoter is that you can kind of transition, those openmp codes and uh Target the gpus with the new openmp5.x standard um and then for more kind of modern, c c, uh C, plus plus applications.

A

You could use these uh Frameworks like Cocos, Raja and even DPC, plus plus and sickle to run on the on the system as well.

A

um We also have a pretty robust programming environment around data and Analytics um Helen had highlighted uh using Jupiter to log on to the system and, of course, Python and uh the Nvidia rapid stack are supported, and uh you know pytorch and tensorflow. uh Two of the the the more important machine learning Frameworks are also really well optimized for the system. We have a set of debuggers and profilers that I highlight here in particular, including the insight profiler that I that I talked about earlier in regards to the roof line performance modeling.

A

um We have a really growing segment of our user base, that is using python um and so, uh through collaboration with uh with Nvidia and hpe, we've been working to make sure that getting um you know, performant python acceleration with the gpus is, is available on the system, and uh here are some of the libraries that you can use to do that, including Pi torch and tensorflow again for AI or machine learning, applications um and I I might just skip this slide.

A

I think because I think uh Helen mentioned it in in her her deck, but really quickly. uh You can definitely run Jupiter notebooks on promoter in a whole. Bunch of different configurations, including um you know, a shared CPU, node or exclusive GPU um uh access as well.

A

um So, as I said, I think we've we're trying to take this sort of pragmatic approach where we recognize that there's a lot of users out there who have existing GPU codes and we want to kind of just meet them where they, where they are and allow them to run those codes performingly on the system, um and we also want to at the same time promote some sort of performance, portable programming models that we think will give your code a little bit more longevity or long kind of longer legs going forward and those include openmp.

A

uh You know 4.5 5.x support as well as things like Coco switch, are big Investments within the doe to allow C plus applications to run on CPUs and gpus, from multiple vendors um and and also into the into the future. So our strategy has really been, as I said, to kind of support, all those major programming models and languages um to also uh pre-install optimized versions of many of your kind of favorite applications.

A

um You know, particularly in the material science and chemistry space. There are a lot of shared applications, codes like vasp and lamps, which we talked about and Quantum espresso, um as well as working with the vendors to kind of make the process of understanding your performance on gpus, something that is a little bit more tractable, and um you know I just want to highlight again how I think how useful these GPU hackathons can be.

A

If you go to GPU hackathons.org again, you can register for an upcoming an upcoming event, um so we uh in particular, have kind of invested some of our own time and resources into um enabling a couple performance. Portable programming models, the one that I really want to highlight here is openmp 4.5 and 5.x, which uh has gained accelerator support in the Nvidia HPC compiler stack because of nurse collaboration with with Nvidia.

A

um What we did was we kind of settled on a well-defined subset of the openmp standard for optimized GPU acceleration, and this has now been released in production in the Nvidia compiler stack that is available on Pro Motors. You can basically use it use it today, um and so I will close here.

A

I think with just maybe a couple examples um or science examples from from under I know that I'm just about out of time here so I'll, maybe just do one one or two, um so this is actually an application, that's sort of near and dear to my heart.

A

um This is a material science application where um the goal here is to kind of help, design potential future qubits um for like a Quantum quantum computer in complex um materials that have some sort of defect in them. So in this, in this case the defect is, what's called a die vacancy, so essentially, two atoms are removed from the the Crystal and to understand the kind of quantum states around that that defect, um and so to do that they needed this.

A

This team needed to simulate sort of unprecedented simulation sizes with a with thousands of of atoms and uh kind of here you can see some of the some of the results.

A

um You can see uh kind of performance improvements as the gpus um have evolved into the final GPU part for uh for promoter, um and you can see the um uh that the scaling to essentially a full GPU system, like uh like um Pearl Mudder in the in in the plot above um another example, is uh exobiome. So this is a metagenomics code where they basically take and analyze.

A

The Genome of an entire, like population from a um you know, could could be like a sort of like a chunk of dirt or inside the gut of a of an animal and there's um kind of a thriving ecosystem of different types of, like viruses, bacteria that you find, and they want us kind of um sequence, that entire population, and so they have a lot of sort of challenging operations to separate analyze and assemble that that genome.

A

um And you can see that even with a set of work that isn't kind of immediately amenable to gpus. They've been able to make significant improvements. And what you're seeing here is a comparison of the CPU versus the GPU performance on on promoter for for their particular algorithms um and I. Think maybe, let's see if I'm just going to highlight, uh maybe one more here, which is some of the early successes we've had working with different facilities, um in particular uh some of the light sources and the kind of astrophysics or observational facilities.

A

um This includes the lcls at Stanford and cem here at Berkeley lab and the dark energy spectroscopic instrument Desi um and the LZ projects as well. All of these teams are kind of up and running on on Pro Mudder and have sort of stories that you can read about on our on our on our website.

A

So I will I'll end there and I can kind of take any questions or since I'm over time. I don't know if we have time for questions, but I can answer them in the in the dock as well.