National Energy Research Scientific Computing Center (NERSC) GPUs for Science 2020, 5 Aug 2020

Previous Meeting

⏯

youtube image

►

From YouTube: Using OpenACC to accelerate scientific applications on GPUs

Description

Sunita Chandrasekaran of the University of Delaware presents a talk on Using OpenACC to accelerate scientific applications on GPUs. Recorded live via Zoom at GPUs for Science 2020. https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/ Session Chair: Muaaz Awan

A

Talk is going to be on openacc. I just joined a few minutes before and previous one was on openmp.

A

um I believe um I didn't quite catch all of it, but you know, following up with the question from the previous um talk you know, um directives can do so much um programming languages or handwritten code. You know using cuda can do so much more, but you know that's where the balance we have to strike right.

A

So this talk is going to focus on open, acc and uh first want to take a second to thank all the um collaborators and the funders uh for different projects. Going on in my group and just a small plug-in, there are a lot of openmp talks in the um in the agenda that I noticed.

A

uh My group is also working on the openmp validation verification test, suite on offloading um with oakridge national lab. So it's on 4.5, 5.0 features. So do take a look at it. I have been writing test cases and uh there is a it's. An open source project. Pull request is totally um you know, works so do take a look at it while we're talking about the offloading side of openmp things as well.

A

All right, so um a quick recognition to all my lab members and our group meeting has come down to zone group meetings, so we can see a lot of square boxes and crpl my lab computational research programming lab members. Without them.

A

You know it will be very difficult to do several of these research so shout out to each and every one of them I'm starting off to talk about. You know the different ways to program, and if we were in the same room I would probably have a show of hands as to what is your favorite programming model, since we can't do that, I'm sure all of you are grooming because we see different boxes of different sizes. That was unintentional.

A

There is no particular reason why the boxes are growing in size, but I'm just trying to fit the text in so we have libraries we could program using abstractions. We could use directives, we could use programming frameworks. What have you right? So it's good thing that there is more than one way to program.

A

um You know our scientific applications and I'm sure you all are thinking what's yours and I would have loved to talk about this a little bit in person, but, oh well, so we chose uh directives and this talk is going to talk about two different projects using openacc.

A

um One was relatively easier project. One was not so I'm just trying to give you a flare of you know um what are the different directions directives could take and how directives are not magical and we have to do our fair share before expecting. um You know detectives to do um their their their magic right.

A

So I believe this development cycle holds good for any programming model. You choose, um you know, optimization analysis, you first, you know try to take a code, you try to profile it and you try to figure out what are the portions of the program that you want to accelerate um and then you parallelize them you're happy with what you did. You're, not happy with what you did you go back to re-profile it. Then you optimize it and then you reanalyze your code.

A

So you know that's obviously um the kind of um pattern you know we all seem to be taking.

A

um So this is a common pattern. um There's no rocket science going on here, I'm just going to pause a minute to see if I'm missing any chat that I should be looking at.

B

Actually, we are taking questions at the end, so you can continue.

A

Thank you, um okay, so that's the cycle right, so going forward um a little bit about open, acc programming model. Obviously I cannot fit all of that in one slide, but there are plenty of materials out there that you could.

A

um You know, go back and take a look at, but in a nutshell, it's a directive based programming model and, if you don't know, the programming model can be used to also run code on cpus just like openmp, which is why I did not say accelerate code on gpus, but I said heterogeneous systems, which means you keep the code base as it is. You retarget you recompile to a cpu as well as gpu.

A

There are pgi gcc implementations cray until 2.0 any questions you have about conversions availability, extraterrestrial note and we are actively working on the different implementations feature implementations as we speak.

A

Pgi community editions are also available licensed yet free, and I love this part about the compiler, because I can get my students to download the pgi open, acc compiler on their laptops. You know in class when I'm teaching a course and get them to you know program a while. We are talking, so that's the that's, the wonderful part of the community editions. I believe the latest edition is 1910 um and the link points you to the community edition.

A

But it's easy to find. If you just look up online, so we're going to look at two projects. One is a biophysics project. The other one is a solar physics project, uh the biophysics project. I just wanted to give you a you, know short story about where we started and how we ended. We started this project back in 2018 and I think 2017 2018, with two students eric and marcio, the first two in the pictures, who were my undergrad students and uh we continued to work on the project and they loved hpc.

A

They loved working on gpus and um they really enjoyed the project and guess what they turned around and they were like hey how about we pursue phds, I'm like who can say no to that right. So then they became they are now my phd students in my lab. So this project is, um you know, like a poster child to me, because it drove undergrad students to work on a gpu project and it led them. It kind of motivated them to you know, look at bigger problems, robbie, who is now in nvidia robert sells.

A

He was also my phd student, who graduated last year and alex from chemistry department in ud, served as mentors um robbie graduated last year and he's now with nvidia um alex continues his phd in chemistry, and this is obviously an interdisciplinary project domain science from chemistry with professor juan peria. So that's there's a timeline going on here. We did not do this in two weeks. We did not do this in two months right. We can see that it started in 2018 with students who had no background in parallel computing. They started from scratch.

A

They were learning directives on the fly. They were learning about, profilers, they were learning to use, you know systems and it scaled up and we landed up getting wonderful numbers and speed up and we were able to publish a project that the undergrad students started. So I just wanted to emphasize here that it takes time, but if you're a patient, you can get wonderful results.

A

um So that's the purpose of working with a domain scientist. You can get a cool graphics, which also revolves. How cool is that, so this is a biophysics project where the idea here is to accelerate prediction of chemical shift of protein structures.

A

The code is from ohio state university, we've got a serial code and the code has never seen a gpu before and it was a scientific. It's a chemistry, chemistry project happening in ohio state, but then we figured that this was kind of a function which is often used in large molecular dynamics package and it has been taken hours to run and the idea was. Oh, let's take a look what's happening in the code. Could we do something about?

A

It was how we started so only serial version is available and now a parallel version using open acc on gpu is available. uh So there is no cuda version on the code, not yet so what we started with was serial profiling, and this was so much important and I'm going to show you a little bit as to why because no offense to any domain scientist on the call. But I do want to point out that you know scientific code is usually written algorithm-facing right.

A

The scientific code is not typically written for parallel architectures because you have an algorithmic mind and you're trying to get the equations in order and write a code out of it. But when we jump in you look at the core from a parallel gpu standpoint and you try to say ah this, data structure could have been designed the other way so and so forth. So we spent about you know a month or two or maybe even more, just cleaning up the serial version of the code so using profilers and everything.

A

And if you see this little portion of 23 4 14, you see different pies in the chart right. So the very first goal was to get rid of bad code written that will not work well. For um you know a gpu architecture, so cleaning up the code, we were able to toss out the 23 get select, and you know we were able to restructure the code in a way that all the profiling got adjusted and you see that another major chunk stood out, which was the get contact function.

A

So we started going in the order where the percentages stood out. Those were the compute intensive hot spots. If you like that, we wanted to accelerate and go forward so zoomed in to get contact- and this is you know, a piece of code using open acc, where we can see an open access. Implementation use your parallel directive to handle all the parallelization of the loops. Then you have, you know, enter and exit data directives which will manage device memory.

A

You also see a reduction clause to facilitate in any kind of parallel, scalar reductions and there's text behind the figure that you see. Basically, what goes in the outer loop and what goes into the inner loop, and you want to collect routines, which don't necessarily need to be running within a loop, several several iterations. Rather, you can collect them, accumulate them and open. Acc also allows a couple of different levels of parallelism, young worker and vector three levels of parallelism and the loop that you do not benefit from parallelization.

A

You could mark that as sequential, which is the last loop that you see loop seq so doing this. And of course you know this is open source project and it's on github. um It's called ppm, one paths per million, one uh feel free to drop me a note. The entire code is available online to take a look at I'm jumping into the results quickly, where we see that the serial unoptimized version literally took about 14 hours for 11.3 million atoms.

A

We didn't even finish it because we had to kill the code because you know the system wasn't available for so long, but definitely 14 hours, and you can see that the scalability is from 100 100 000 atoms through 11.3 million atoms on the left-hand side. Is your serial unoptimized serial optimized multi-core nvidia pascal 40 nvidia volt 100?

A

So if you want to compare an nvidia, volt, 100 v100 versus the unoptimized you're, looking at 14 hours to for 47.71 seconds, that's what we're talking about here right and if you look at the multi-core 146 seconds to 47 seconds of approximately 2-3 x.

A

um So imagine the serial version was running for 14 hours and this was a function. A bigger molecular dynamics package was calling x number of times so 14 into say. 100 times, the iteration was called that many number of hours you're going to be running as opposed to running that routine for 47 seconds. If you were to accelerate it on a gpu and this code was made possible because of openacc and its availability.

A

um You know implementation availability on gpus by undergrad students who are able to learn the whole thing and be able to apply and, like I said it was close to a two years project by the time we published. So the paper is open access, so feel free to download and read through. There is ton of more material um and it's a plus computational biology.

A

If you look up ppm- and you know university of delaware and open acc and gpu, the paper would pop up. So this is a case study which was smooth. I would say we didn't run into hiccups. um It was the usual reprofiling. You know optimization hiccups, but nothing major. The next one I want to take you through is a solar physics application, because I want to tell you that it's not a hunky-dory process right, you earn um the effort that you put in so we put in enough effort.

A

We got that cool speed up in the in the biophysics project, so this is a solar physics project. Eric wright is my phd student working on this, who was also one of the phd one of the students working in the previous project, and this is in collaboration with dr rich loft and team at encar, as well as max planck and universal system research in germany.

A

This was this is a tough nut to crack. This has been a an interesting project and it is. It has not been smooth, but I do want to tell you you know how how we have been approaching this problem, because, given a domain science problem, it's we all know does not matrix multiplication or a jacobi iteration right, there's a lot more going on in the code. um So the the the crux of this picture that you're seeing is basically a a 300 million dollar nsf investment into the telescope.

A

um You know recently they had invested so much money and the telescope is hosted in hawaii and it's generating tons and tons of data and the data needs to be processed um in the same or problem, hopefully in the speed at which the data is being received. So it's kind of a big data problem and muram is the code's name which is max blank university of chicago radiative, magneto hydrodynamics, that's what the mhd stands for, and the kernel of interest to encar in this particular problem is the radiation transport, the rt.

A

um You know problem of this muram code, um so this is probably you uh um nurse or lbl. Has you know a lot of tutorials and webinars and there's also a roofline analysis coming up on july 8th some of my students have registered, so I'm just throwing in a bunch of you know, tools that we have used so that you're aware these tools exist and you can look up them to learn more about it.

A

um So we used aniprov, kaptee good, occupancy calculator that excel sheet, which can tell you so much about the gpu that it's just awesome and we loved the pcast tool. We have used this used extensively. It's also a pgi tool that lets you compare um code on cpu gpu, which is so important. You know when you are when you are trying to figure out where the bug is where the error is um so yeah.

A

So these are some of the tools that we have used and the screenshots are from the project um and what stands up? As you can see, the top three um rows are the compute intensive rows. This is again on a single core just in seconds just toning down the problem to you know bare bones: tbd magneto, hydrodynamics mhd and the radiation transport rt.

A

Encar is interested in accelerating the rp portion of the code, but if you have profiled and looked at you know a decent size of code, you would know that when you profile and clean up and profile and clean up the profile shifts right, you could merge some kernels. You could disintegrate some kernels. You could move around things when you can and when the dependency is sorted out. So the numbers that you see is not going to be the same once you optimize and re-profile, but of interest, are the top three rows.

B

So it's interesting that we are running out of time. Can you please wind up.

A

Oops, sorry, yes, so what you see is uh the envy prof. You know you, the the ones. On the left hand, side are the profiler pictures and stepping in through the profiler picture, where you see the different colors of the different slabs of radiation transport and how the nvprop really tells you. You know where the data computation and the data management is going on and that's a gist of the code, how many lines of code the tools we used. We are still debugging a problem um and I could talk for hours about software engineering.

A

Clearly, we didn't do a good job and we are learning from the goof up that we did with the code and we're trying to clean up and go back to the drawing board and start from scratch. But that's what you learn from a domain science code.

A

We also created a jupyter notebook to figure out where the issues are, and you see some discrepancies between our code and the ground. Truth. We needed that and pcast helped a lot here. um I think these slides are available, but that's just experimental setup.

A

um So what you see is again, you know, speed up on the nvidia volt 100 of rts versus a single core and a full node of 32 cores. There is a lot of room for improvement, uh but that's where we are at the moment when we run into the bug and the solar physicists were not happy, so we had to fix the bug first before we moved on which makes total sense, and um so this was a tough problem right. This is still a tough problem.

A

We don't have numbers to show off here, like the biophysics code, so I want to wrap up by saying that directives are not magical. It's incremental improvement is what gets you to where you want to go to, um but the the fun part of using openacc was to be able to recompile and retarget in the biophysics code. We have not changed a single client in the source code and it runs on the multi-core and the gpu.

A

um I mean the idea of using directives is to be able to maintain um yeah contact me if you have any questions, and I wanted to stop with this slide, which has info on gpu, hackathons and boot camps, and if you go to that particular website, you have tons of more information on the gpu hackathons, um that's ongoing um over the period of time.

A

um Thank you, mice,.

B

Thank you very much anita. uh This was very inspiring considering that you've achieved all this using direct test. So we have a question here: did you have any issue with branching during porting radiation code to gpu.

A

Yes, we did that's an excellent question. We did we are we did we had to restructure a kernel because of the branching issue. As we know, branching is detrimental to gpu performance, and this is a radiation transport code. So I can imagine you know um how the waves uh travel. You know from one corner to the other corner and I think I can suspect where this question coming from um or who's asking this.

A

um So yes, short answer is yes, there are issues and we ended up restructuring the code to get through it and still not lose accuracy.

B

So I think we can take one more question and that is: did you find any performance improvements on cpu? After any refractoring, you did for the for the use of gpu.

A

um Hi ron, so here performance improve on cpu. No, we did not. I was actually glad that we did not lose performance on the cpu and I have the biophysics code in mind because solar physics code has a ton of more work to do. We didn't lose performance, but I wouldn't say we necessarily saw performance improvement, but that's a very interesting question.

B

Yeah, that's a good question, so that's good to know. uh I think it's time we move on to the next speaker. Thank you very much sunita. It was very interesting to know all the interesting work your team is doing.

A

Thank you all for listening.