National Energy Research Scientific Computing Center (NERSC) GPUs for Science 2020, 5 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: DESI

Description

Laurie Stephey and Daniel Margala of NERSC present a talk on DESI. Recorded live via Zoom at GPUs for Science 2020. https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/ Session Chair: Yan Zhang.

A

Hey everybody so, uh as young said, um I'm lori steffy I've been working with the desi team for about three years uh and I'd just like to acknowledge. First of all that this has been a team effort, so I'll give the first half of this talk. Daniel will give the second, but also um roland thomas and stephen bailey, made significant contributions here. So um what makes this talk different from the rest of the agenda here at gps, for science? Is that desi is a python code.

A

So we're going to be telling you about what that means and how you can make python work on gpus.

A

Okay, so desi for anyone who's not familiar is the dark energy spectroscopic instrument they are. Their mission is to better understand, dark energy, and the best way to do that is by making a 3d map of the universe. So they'll be doing that uh over the course of a five year survey which uh officially starts next year, although they they started taking data late 2019, they are located um at a telescope in kitt, peak arizona and they'll, be scanning the sky, taking data, sending those data to nursk every night for five years.

A

um So, as you might imagine, that's a lot of compute time, and so it's important to make that as efficient as possible, not just for desi but for everybody else uh running on nurse who also wants to get their jobs through the queue. So uh the initial charge uh from nisap was to try to speed their code up um on the knights landing partition for corey, um but also to keep their code in python um desi. The developers are astronomers they're, not necessarily computer scientists, and they really like python.

A

It's readable, it's maintainable, so that was their ass to us.

A

Okay. Well, that was that was um easy enough. I guess on the cpu, but it's kind of a different situation on gpus.

A

uh Even getting your python code to run on gpus is maybe a little bit more tricky than people think uh there's no import gpu that you put at the top of your python script and the normal libraries that a lot of us use um like numpy and scipy just will not run out of the box on gpus.

A

So what are your options if, if you're a python programmer um things are changing pretty quickly? So what what I'm telling you today will probably be out of date uh soon and the landscape was different a year ago when we were starting this effort, um but at that time koopa uh was kind of the best documented most fleshed out option. So that's what we chose.

A

um What is coupon, it's, basically a drop in replacement for numpy, so it looks very much like numhai, um but instead of the back end being uh c or c, plus it's cuda, so for you, the python programmer. You mostly don't have to worry about that. You just continue. Writing your code um with some caveats, but that that's what we chose but in case you're wondering there's, there's quite a few additional options out there.

A

um I don't know if you can see my mouse or not, but on the left is koopa uh in the middle is jacks, which is a framework out of google uh jax uses the xla compiler to write code. So it's a little more portable there's also number which I'll talk about in a minute and uh if you're, a scikit-learner, pandas user, you might consider nvidia rapids. So there's really no um one answer.

A

Okay. So if you're curious, what coupon looks like? um I told you, it was easy, but uh don't take my word for it. So it looks very much like using uh your normal numpy except uh you can import. um I mean you could even import numpy kupay as np. If you wanted to and yeah you as the programmer just need to make sure that you move your arrays to and from the gpu and kind of keep track of where they are, but otherwise it's pretty straightforward to use.

A

The question is what, if you need something that coupy has not implemented so they've implemented a lot of the numpy interface, but some some more custom, more niche things are not so the answer is you need to do it yourself? Okay, fine, um there's a lot of different ways. You could do that, but the first one uh or maybe not the first, but the easiest one and the most python like one- is a tool called number.

A

I don't know if anybody's used number for cpus, it's a very friendly way um to jit, compile and really speed up cpu code and it's similar for gpu, um except that it you really. You have to think about writing code for gpu, which is different than for cpu. So what does that mean? um Here's just a very basic example of what it looks like to use number, um and this is what we have done for desi, so in situations where they needed functions that were not in kupai, we have written uh kernels using number.

A

So you add this decorator, this number.cuda.jitdecorator, which which tells um your code hey this is this is gpu code. We have this coup to grid function, which, basically, um is how you communicate with the gpu threads, and you need to double check um is make sure that you're not exceeding uh your thread block size. That's what the zero between or I'm between, 0 and 32 is doing. You can't return anything. You can't allocate any memory. You basically can't do most of the things that you would like to do um in numpy, so it really.

A

It starts to look more and more like cuda and less like python. um It's still easier and more friendly than some of the other frameworks, um but yeah. This is this is the option we chose uh for desi, okay, so with coupon and number we were able to get their code on the gpu, but as a main theme in gpu for science day has been it's not enough just to get it on the gpu, you want it to run well.

A

So a lot of the things that we had done to make the desi code fast on cpus were really counterproductive for gpu.

A

Instead of cutting your your task up into lots of small pieces, which is maybe good for a cpu, um you want to do the opposite of that for the gpu to take care or to take advantage of the massively parallel nature of a gpu.

A

So we we kind of saw this coming um and we started a major code refactor um in 2019. So this was not trivial. We had to kind of rethink uh how desi was approaching the problem and um from here I'll let daniel mulgala take over and tell you the rest of the story.

B

Thanks laurie, let me just share my screen here.

B

Okay, so yes, so I joined the this effort um a couple months ago, right around the time that the major code refactor was wrapping up and I helped a lot with implementing a lot of tests.

B

So hopefully, this this next part will give anyone who has a python application is looking to leverage a gpu, some good examples of how they can get started in that process. So one of the first things that I just mentioned was implementing a lot of tests to make sure that, as we're making these changes that the results are still correct, or at least the same as the cpu version was um producing.

B

uh And then the next major effort was was doing a bunch of profiling.

B

So it's using some of the tools that were just just mentioned in earlier talks like the insight systems to identify the application bottlenecks and give us some clues about where we can focus some of our efforts for optimization and as laurie mentioned, our main strategy is to use coupe as a drop-in replacement for for numpy, um and there are some places where we also implemented uh number kuda kernels as well, um and then I also mentioned this uh nvidia multi-process service, which allowed us to kind of explore saturating the gpu utilization um as well, okay, so so yeah, so in in coupe.

B

I couple also has supports the nvidia tools, extension markers and ranges. So it's easy to add these little decorators throughout the code to to um annotate the timeline that's produced by the nvidia profiling tools that give us information about where the time is being spent.

B

While our code is running and so there's a couple different ways of doing this, you have these direct and nvtx range push and pop functions. You can also add uh the fancy python decorators to functions, so you don't even have to modify any part of the function body and there's also support for the with statement context box as well.

B

um So we added these to our code and especially where, um in the intense intensive parts of the the process, so the the main what this code is doing is basically splitting up an entire image into thousands of sub images and processing each one uh separately.

B

uh This is the strategy that was employed on the on the cpu, uh basically to leverage a lot of parallels and to divide the task into lots of small bits. So on on the gpu, with these nvtx markers, we can see that most of the time is being spent in this function and it's making a bunch of calls to this uh hermitian eigenvalue decomposition function, um but we also noticed some some unexpected performance issues during the profiling.

B

So there are some gaps that just intuitively we missed when we were adding these decorators, because we didn't expect them to be performance intensive but like, but by looking at the profile, we saw that there's some blank spaces that we weren't catching. So that gave us some clues about some unexpected things we could speed up as well.

B

So the main lesson here was that uh basic numpy optimizations are still still useful, so pretty much anywhere. There's a there's, a for loop, there's, probably an opportunity to vectorize some part of the code um and so yeah. This is just one example where making a really small uh change was able to to to provide a noticeable speed up, because this this function is being called so many times. Even though uh it's a pretty small change, it still makes a pretty big impact.

B

And the other topic is because we're running a bunch of these small functions we wanted to see if we could saturate our gpu usage by using this multi-process service by nvidia, which is essentially uh lets multiple processes use the use, the gpu by overlapping kernel and and mem copy operations. um So this, let us kind of just increase the number of of processes, kind of divide up all those tasks to more processes that were making gpu calls and gave us better performance just by turning on something that was completely outside of the application.

B

So this didn't require very much uh code refactor. Just considering the limited memory usage capacity on the gpu, we had to make sure that we weren't going to be using too much memory on the gpu.

B

And then the the next part we were trying to figure out. Okay is there? Is there a way we can actually optimize this? These all these calls to this hermitian eigenvalue, uh decomposition and so, and we're able to leverage um a batch eigenvalue solver in the the cuda ku solver api, um to give us um to basically again remove this for loop and do all these smaller calls to this function um in just one call, so there actually wasn't a wrapper for this in coupey, but is able to um without too much work.

B

Look at how those coup, wrappers wrap other calls to ku solver and implement this ourselves, and this also example demonstrates how we're able to write code uh that works with both numpy arrays and coup.

B

I arrays uh because coupon implements some um some helper functions to uh return the array module that your actual data is in, so you can write code that will work both numpy arrays and coup iras, which helps again with the porting application to the gpu and being confident that the results are still the same um yeah, and so this just kind of is a demonstration of some of the speedups that we've been able to to make so by batching that call to the eigenvalue decomposition we were able to see about a 1.5 x, beat up in a variety of different run, run configurations and again also demonstrating the uh the mps how that helps improve importance as well.

B

So I think that's at the end added some links here to give anyone getting started, porting an application, some references and resources on where they can start out so yeah. With that we'll take some questions. Thank you.

C

Thank you lori. Thank you, daniel any questions. First from the audience.

C

Any question from the panelists.

C

Okay, I have a comment.

C

um Not a really question, I think, is very useful uh from uh both of your slides that there are a lot of developing tools for the bestie project and then the profiling debugging tools, so I think for knee set projects when you're a nurse, uh it's very important that one nissa posed up uh helping uh identify the developing tool for uh certain science projects and then the uh matched or best suitable um profiling tools for that and that part can be transferable to other knees, app or the future needs that project.

C

So I feel like this is really valuable. Thank you, lori. Thank you, daniel. If you could stay for a little bit of time to answer questions pop up in the q, a that would be really good.