National Energy Research Scientific Computing Center (NERSC) Introduction to GPU Training, February 2020, 14 Mar 2020

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Intro to GPU: 01 Why GPUs

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, well I'm really excited to be here presenting this because I think everybody here at Newark is really excited about. You know the upcoming Perlmutter system and and GPUs and the potential that it has. So it's really exciting to get a chance to share that with you as well.

A

Okay, there we go so so I'm, just gonna kind of give you a high-level introduction to why we are excited about GPUs here, and you know why you should be what the potential is.

A

So you know nurse is the mission HPC Center for the Department of Energy's Office of Science, and so what that means is that predominantly our mission is to to advance science and what the scientists continually tell us is. They need more and more and more cycles, more and more compute resources, storage resources to stay competitive in sort of a global science.

A

Kind of community and what that that means is that we both need to have a mission of advancing science, but also a mission of kind of advancing the state of the yard in supercomputing, so that the scientists who use nurse kind of maintain a competitive advantage in this global scientific community.

A

So this is sort of a challenge and when we and we bring in a new supercomputing system, we have to think about the seven thousand users. Eight hundred projects, seven hundred codes, that that that kind of used and rely on nurse resources to to produce this science.

A

And so one of the things that we've realized over the last decade or so, is that, in order to maintain this kind of competitive advantage to supply the user base with the resources they need, we need to move to these sort of energy-efficient exascale like architectures, that that kind of put us on a new trajectory for for computing.

A

And what what this really means is that sort of change has arrived in computer architecture and at nurse we really see it as our mission to make sure that, as the HPC community moves towards these energy-efficient exascale like architectures, that you know the greater scientific computing community, those 7,000 users, every line nurse don't get get left behind.

A

And so this light is just kind of a motivation of how how and why that change is coming about. It's largely driven by the consumption of power and sort of heat dissipation. That is pushing hardware, vendors towards kind of lightweight cores and when I was kind of describing as exascale like architectures. So this plot here kind of shows a trajectory of energy per flop over time, and you can see that the these kind of two flat lines here are basically business as usual, whereas the many core and heterogeneous computing lines are down here.

A

As you can see that you make a substantial several orders of magnitude increase in capability by by kind of switching from traditional heavyweight server processors to these lightweight processors. And we started this transition with the quarry system at nurse which is largely powered by these Intel knights landing many core processors, and you know what we found is that Cory Cory is a boon to Science in the u.s.

A

because of the new capability that it brings, but the Intel Xeon Phi or the Knights landing processors that it deployed do requires a modernization effort and I'll talk a little bit about that here and then that'll be a you know, as we talked about GPUs throughout the day, I think that'll be a theme as in terms of how you really harness these these devices.

A

So it's sort of further motivation in terms of moving towards GPUs in particular, as we think about replacing the Edison system editors, there was decommissioned sort of the middle of middle of last year and replacing it with the upcoming Perlmutter system. You can see the potential of GPUs to really kind of increase, our our energy efficiency and the total capability that we can provide to the user. So this is an example of a code running at sort of different, different different scales on the Edison system.

A

In the summit system, which has the current generation of NVIDIA GPUs and for a few different problem sizes, so if you compare, for example, these blue squares with the the red squares and the blue circles with the red circles, you can see that we're essentially achieving an order of magnitude and energy efficiency, which is along the diagonal in this plot. The y axis is time. The x axis is power and so time times power's energy.

A

So, if you look at along the diagonal, you get the average energy used for the for the simulation, so this is I think really exciting. There's a lot of potential gain here, as we go from the Edison system that just retired to the upcoming upcoming Perlmutter system, so nurse users have been demonstrating kind of groundbreaking science on the the KL system.

A

That we've deployed here are several examples of really systems that couldn't have been done without this calculations that couldn't have been done without the scale and resource that we provided it in in Cori, and so we we think that you know the user community is really capable of harnessing these.

A

These large-scale energy, efficient, compute and I want to say that modernizing codes is possible, so I, you know, I mentioned a couple slides back that, while Cori has been a boo, also requires effort on behalf of the co teams, and what we found is that it it's it's definitely possible. We. We kicked off a nice app program for Cory and found that on average, when these teams looked at their applications performance, they analyzed it they improved it.

A

They ended up with, on average, about 3x improvement, and one of the other takeaways is that when you improve your application targeting one of these sort of exascale like architectures, you end up basically learning things about your performance. Learning things about your code that end up improving it everywhere. So you end up with you know: even the code running back on sort of more traditional HPC system, like Edison, ended up being about two times faster after the after the changes that you make, and so that's I.

A

Think good news is that when you optimize your code, the improvements are basically relevant to multiple, multiple architectures, and so we're kicking off this.

A

This program again for four perlmutter, where we've chosen about 25 different projects to work with and one of the ways that you all can benefit, even if you're, not part of the nice app program, is by attending training sessions like this, where we make the lessons they learned from nice, app available to the wider community through kind of training and documentation, and then you know open hackathons to to anyone in the community.

A

Ok! So let's talk about where this increase in performance is coming from on the exascale architecture, so on KML and GPUs, getting performance kind of relies in you effectively using essentially the increased parallelism that is coming in the in the processors. So, for example, you have order of a hundred cores or I. Think sort of the equivalent might be SMS on a GPU per per processor per chip with many.

A

What I would call hyper threads on K and L or warps on a GPU to hide any to hide any latency in the case of KL each one of those cores had what we call a vector processing unit that could process eight double-precision wide vectors at at a time. So you could basically, instead of operating on a single number. You operate on a vector of numbers.

A

When you go to a GPU, that's basically a 32 wide vector that we I guess we would typically call a warp and then there's multiple flops, even available per vector, Lane using sort of advanced instructions like FM. A s which stands for fused multiply adds. So you can do a multiply than add, essentially in one cycle, as well as tensor instructions on the on the GPUs.

A

So this basically adds an increase in parallelism at almost every level in in the in the architecture that needs to be exploited and then beyond that, you need to sort of make sure that you're utilizing the cache, the high bandwidth memory and the entire kind of memory storage hierarchy in order to feed the processor to get the performance.

A

So, as as a short way to kind of describe, this change from sort of traditional CPUs to GPUs on sort of one, the sort of throughput extreme is that you know you could see the parallelism increasing across every single. One of these lines are going from 64 to 82 threads to potentially 64 warps per SM. So this would be.

A

This is sort of not the amount that it can compute every cycle, but the amount that you may want to use to hide the latency increasing the size of the vectors and the way I think you can think about this- is that a CPU is kind of like a general-purpose processor built for speed, whereas a GPU is really a processor bill for throughput and data parallelism.

A

We when we thought about the procurement of Perlmutter, we spent a lot of time determining how the workload would benefit from the GPUs, and we kind of did this analysis of the the workloads readiness. This plot on the this sort of pie chart on the left here shows the breakdown of the nurse workload by cycles across the different codes that are used at the center.

A

So we can see the vasp is the number one code and then you know there's quite a lot, there's quite a large tale here that ends up with around 700 different codes across the pie chart.

A

We found that because of you know the use of GPUs already at places like Oak, Ridge and other centers, that a good fraction of the codes were were kind of already GPU enabled, or they had they've kind of belonged to a category where we knew that other codes had been GPU enabled and they could kind of readily learn from those applications.

A

So the good news is have a good fraction of the codes that are at you, Cetner score already GPU enabled, and then there are some that are you know down here where we, you know, there's still work to be done, and you know some cases we think it you know it could be. It could be a challenge to get the GPUs to work.

A

So let me tell you a little bit about Perlmutter and then how we broke it down between the sort of the GPU partition and the CPU, the cpu part, so nurse nine. It will be named after Saul Perlmutter, who is pictured here, so he was the winner of the 2011 Nobel Prize in Physics.

A

You know, I, guess that you know. One of the interesting stories about him is that when our director asked, if he would, you know, be willing to share his name with our computer.

A

This was actually I think the first time that we've named a system after somebody who's still alive, but I I think that Saul was fairly humble I. Think one of the things that he was worried about is whether people would have to type the entire his entire last name, every time they ssh to Perlmutter and so I guess he and our director made a compromise that you would be able to log in here to just solve down there stuck up instead of the entire Perlmutter that that nurse go.

A

So we we just we designed this as from kind of the beginning, as a system optimized for science and as I said part of our mission is to make sure that we can deliver the capability that the science community relies on, and so a large fraction of the system will be gpu-accelerated. But there will be some cpu only nodes to meet the needs of some of the large scale, sort of simulation and data analysis projects that we think are going to have. That will require some time before they can port to the to the GPUs.

A

The you know, one of the things I want to kind of emphasize here is that this is part of a bigger picture of nurse transition towards exascale and kind of post exascale like architectures and we'd, be we began this process with Cori deploying the many core Intel knights landing processors, we're continuing that transition with the CPU post, GPU architecture of perlmutter, and then we expect nurse tend to be an exascale class, a class system that will likely arrive in 20, 20, 25 and I.

A

Think one of the trends here is that there's an increasing need for energy-efficient architectures as you move forward and time in order to meet the requirements of the user community I also want to kind of show how this fits into the picture. The bigger picture of the do we so there's this.

A

You know there are essentially three do-e office of science computing facilities, one here at nurse, the other at the other two at argon and Oak Ridge, and one of the the interesting things that you can see is that these next three systems at each one of these at each one of the facilities will be GPUs.

A

They will and they'll end up being GPUs from three different vendors, so including Intel and AMD, but I think the trend is pretty clear here that we're moving towards these these GPUs, these energy-efficient processors that are capable of sort of high throughput computing.

A

So to tell you a little bit more about the specs of the system, here is the breakdown of the the CPU nodes and also the essentially the CPU parts of the of the GPU nodes will be using the next-generation, AMD Milan CPU. These are the specs for the current generation and so I think what you can expect you can kind of put like a greater than or equal sign to.

A

Essentially, you know assume, just you know, bigger better faster for the for the next generation, and then here is the is what we're expecting for the GPU. So we'll have a configuration with one CPU and four GPUs per node. These again are the current generation, volta specs, and so for both the next I think you can again kind of put a greater than or equal sign and just expect, you know somewhat bigger, better faster, but the Volt and next product hasn't been formally announced. Yet.

A

Okay, and so, as I said earlier, we've begun this process with nice, app working with our teams on the on a number of the applications getting them ready for particularly the DG, the GPU partition of perlmutter, and so this is some of the early progress that we've been making and helps answer sort of. Why? Why GPUs?

A

You can see that what I'm, comparing to projected roughly roughly projected perlmutter GPU partitions speed-up, there's a lot of sort of slightly fuzzy numbers here, so you could probably think of this as like a back-of-the-envelope projection versus versus Edison, and you can see in a lot of cases we're making.

A

You know pretty pretty good progress and that the overall scientific throughput will go up pretty significantly. There's a couple cases. You know there's a challenging code here, for example Atlas that is sort of at the at one where you know currently they're projecting actually worse performance on the GPUs and CPUs, but.

A

You know that's actively being being tackled and if we look at the different categories, we essentially have six categories or different types of applications and at least happen. If we compare their projected GPU to CPU no performance on perlmutter, we have this plot here we can see that there's significant performance increase using the the GPU projected for the for the GPU new nodes over the CPU nodes for at least a representative app in each one of the categories.

A

It's not surprising to see machine learning, really high I think everybody knows that machine learning runs really well on the GPUs. It's it's I think it's really great to see apps in each of these other categories high, you know, even even the grids of particles, I, think this number ends up being about a 9x speed-up and that's a pretty challenging category. That's where we include like the climate, apps, the block, structured, great apps and like the pic and particle and cell codes, for example, as well as one example of early progress.

A

I'll just highlight this Tomo PI application, so tomo pi is a tomographic reconstruction code that is used at the I. Think the Advanced Photon Source in Argonne, National Lab, and so essentially they have a bunch of you- know a bunch of 2d images where they kind of rotate a sample in front of a camera and they try to reconstruct the 3d volume.

A

So we had a postdoc here at nurse working on this problem and you can see that they ended up making significant speed ups by porting it to the to the GPU, where, if you compare an Edison node wall time to the wall time on a node with four of the current generation, Nvidia Volta GPUs, you gain essentially an order of magnitude in performance.

A

And you know, one of the things I want to highlight is that this wasn't entirely just kind of a straight port of the code in the sense of let's kind of slap, a few directives here and there they actually did change the algorithm kind of fundamentally to use the GPUs and I think in our experience. We found that you know, and sometimes you can.

A

You know annotate your code to use directives and then will, and that will be sufficient and in other cases you kind of need to really rethink your algorithm and you'll hear a little bit more about that throughout the day.

A

So I'm just going to kind of close here with some a few practical notes about using you know how you can go about using the GPUs on Cori or sorry on the upcoming perlmutter system. So we have kind of taken a practical approach here at nurse. We realize that lots of folks have you know existing GPU codes or thought about porting to GPUs in the past and I think we're basically ready to engage.

A

You know the community wherever they they already are. So if that means you have a CUDA port of your application, I think that's fine! There's some application of CUDA Fortran open a cc, Koko's, Raja I! Think all of those are expected to work well on the system.

A

That said we do we.

A

We do have a goal of finding a path for users who haven't kind of already jumped into the GPU game and porting their applications from Edison onto Cori and then on to Perlmutter and we're investing, particularly in OpenMP in this for this trajectory, and so we have an open, MP NRI with PGI towards the goal of basically enabling directives, a directive based porting strategy from from Cori to Perlmutter, and so at this point, we've basically agreed on this subset of the open MP standard that will implement with PGI and we've been working on some micro benchmarks, and we expect that they'll, be.

A

You know a compiler available to test on the core GPU system and perlmutter and in the near future, for this activity. You know the the other thing that I think is important as you're getting started is thinking about what are the optimization concepts involved in moving towards you know an energy-efficient architecture and GPUs in particular, and you know, I, think that in our conversations with users, we've discovered that users kind of want to know the following the answers to the following questions. So what part of my code should I move to the to the GPU?

A

For example? How do I know which hardware features to target? Should I target the HBM? Do I need to worry about hiding latency should I use. The shared memory do I need to worry about.

A

The occupancy and so on, and then how do I know how my code performs in some absolute sense, and probably the most important question is how do I know when to stop, and so we've been working a lot on this roofline model and there'll be a roofline hackathon up in just about a month where you can play with the Nvidia, the sort of up-and-coming and video tools and the roof line and getting gathering roof line, performance data, and so we've been working with Nvidia to ensure that these tools can collect all the required metrics and that you can kind of analyze your performance in a in an absolute sense.

A

So I don't think I'm going to go through a detailed explanation of this performance model here. But if you're interested in sort of the answers to those questions, I'd encourage you to look out for this roof flight. Hackathon! That's coming up, okay, so this is my last slide and I'll.

A

Just basically conclude here with the answer of why GPUs well I think the practical answer is basically because they're coming, but I hope I kind of convince you that they're coming for exciting reasons and that Perlmutter is really a system that is optimized for the scientific community and it'll include both these NVIDIA GPU accelerated nodes, where a large fraction of the capability will be as well as the CPU only knows, and so with that I will conclude and I guess. I could take questions there or any questions.

A

I can clunker in general if your 20 motor is technologies, but you highlighted, which ones are Nvidia specific vendor buying versus are portable across Intel.

A

Yeah, that's a good, that's a good point I'm! So the you know some of the specifics of the architecture. We can't quite talk about cuz, not I, guess not all of the. Not all the products are completely announced, but one of the reasons why we're advocating for OpenMP is is because it is a kind of a portable approach between the different different vendors.

A

As we talk here, the ones the you know, there are two here that are clearly vendor specific and those are CUDA and CUDA. Fortran, the you know, open, ACC, Koko's and Raja would be good. You know, potentially good performance, portable options as well. I think maybe your question is just more like a suggestion of something to do throughout the day. Is that right.

A

It's not a video system, yeah right, yeah, so I mean I. Think in general, that's a good thing, and maybe we, the speakers, could kind of think about that.

A

Throughout the day you know, I think we are conscious of the fact that our users, you know, don't want to have to rewrite their codes every two years or every three years, there's a new system and that they also have accounts kind of not just a DIY systems, but it's systems by throughout the throughout the world, and so a performance portability is definitely something we care about, and that's that is largely the motivation for why we think openmp is a particularly good path, path forward and I.

A

You know, I would just comment that I do think it's a good. It's actually nice that in some sense that the upcoming systems that are gone and Oakridge will also be GPUs, even if there are different vendors, I think for the first time in all and in a while, the architectures at least look similar enough that there's kind of hope that you can portably code for for all of them.