National Energy Research Scientific Computing Center (NERSC) NUG Annual Meeting 2021, 30 Oct 2021

Previous Meeting

⏯

youtube image

►

From YouTube: NUG Annual Meeting 2021: CuPy at NERSC, with Daniel Margala

Description

Daniel Margala, a NESAP Postdoc at NERSC, tells of experiences and tips for using CuPy to make use of Perlmutter's GPUs within Python

A

Two topics that are very interesting, I think to nurse users are python usage and gpus, and uh daniel magala is a uh nurse nissa postdoc who is uh working with both and he's going to tell us a bit about uh coupe desk.

A

Hi, can you see my slides? Okay? Yes, looking good, okay! Yes, thank you for having me um I'm a nisap postdoc working with the dark energy spectroscopic instrument, um and today I'm here to uh help those of you um who have python applications um get started, get off the ground on pro mud or gpus. By telling a little bit about our story of porting, the the desi spectral extraction pipeline um to use gpus in in preparation for for pearl mutter.

A

So, just to give you a bit of a quick background on the actual desi spectral extraction code. It uses a lot of uh it's implemented in python, uses a lot of special functions like exponentials, um hermet, polynomials, ludondor polynomials, and operates on lots of um array, array-like data structures. uh They get images from their telescope in new mexico.

A

They send them over to nurse and they process them nightly, and then they also reprocess years worth of images um every few months and so there's a lot of matrix operations and linear algebra going on in their scientific pipeline. And it's all. As I mentioned, it's implemented in python, leveraging the numpy and scipy libraries for linear algebra, which often wrap lower level uh routines from from blast or back.

A

They also leverage number for some, some specialized functions so that they're compiled and they run a little bit faster than native python code and then for scaling for both multi-core and multi-node scaling. Their application uses mpi via mpi for pi.

A

um So we decided to port their application to to gpus using the kupai and nambakuda libraries and since they're already the developers are already very familiar with numpy and scipy.

A

We could use those libraries to translate directly on many many pieces of their code and since the apis are compatible, the developers on desi would be familiar with the apis of kupai and nambakuda, so they'd be able to further develop and maintain the application, and so on. The right here. There's a bit of a speed up relative to the edison baseline that we used to kind of track progress during this.

A

This work- um and I just want to mention that it wasn't like a straight- you know, find and replace numpy with coupe. I it was kind of an iterative process where we report pieces of the application test, the code profile, the code track our progress. So it's not something that we just you know hammered out in one one weekend or one month or something like that. It's kind of an iterative process, learning learning about the gpus and learning what changes we could make to fully utilize.

A

The gpus that were available to us, um but at the end of this work so far, we've seen a 25 x improvement in the the per node throughput of data, which is how desi kind of tracks um kind of a figure of merit for this work that desi uses to track progress um using the a100 gpus on the the gpu promoter nose compared to uh an edison cpu baseline.

A

um So getting started with gpus, you actually have many options, so some some of these were actually mentioned in the previous talk like um like tensorflow, pi, torch and jacks. So there's a lot of different um libraries out there. That will give you access to the gpus in python, and many of these are actually interoperable. They can work well with each other, there's some standards and some efforts in the community to to make sure that these different libraries are able to kind of efficiently share um array like objects.

A

So if you want to do something in in one of these libraries, you can you can bring, you can also bring in a different library and you don't have to keep changing the data format and all of that.

A

So, as I mentioned earlier, we chose coupe and number cuda, because they're kind of natural extensions of the libraries that desi was already leveraging in their python code.

A

So what is coupei um so kubai? Essentially, it's it's trying to implement the the numpy api, which is kind of like the foundation of a lot of scientific computing in python, um but it lets you kind of work with array-like objects on the gpu and it implements many many features in numpy and scipy and then under the hood. What kupai is doing is um when you call a function, it compiles the cuda kernel uh on the fly and then caches that result.

A

So when you reuse that in the future, you don't have to recompile that again, so it's kind of a form of that um just in time, compilation.

A

um So here's a long list, so we don't have to go through all of these, but basically the the most of the functionality of numpy is available to you using google. There are some some differences and you'd have to go and look at the the coupon documentation um for for a list of what those are. But for the most part, many of the features are and numpy are implemented in kupai um and let you leverage a lot of things, especially like the the special um cuda libraries like coo, glass and ku solver.

A

um A lot of the the linear algebra routines are are kind of wrappers around those just like numpy and scipy wrap the lower level uh blas and the libraries, um as I mentioned, there's also a lot of support for for many functions in scipy.

A

This list is from from the like the first uh one of the welcome pages on the coupe idocs. I will mention that not all of the functions in inside pi are actually implemented in all these domains, so but they do have a comparison table that shows you what's available and what what's not.

A

So how do you use coupe I on on promutter, so for the most up-to-date information, it's best to check the the nurse docs? So there's a there's, a page, that's being kind of actively maintained um this using python and pearlmutter page that I've linked to at the bottom. um If you also have trouble, you can always open a ticket at help.nurse.gov, um but it's essentially as simple as uh logging into promutter, um making sure that your cuda and python modules are loaded.

A

Creating a new uh environment for this for working for installing coupe into um and the the only the main gotcha is making sure that you install a version of coupe, that's compatible with whatever cuda module is loaded. So right now the default cuda module on perlmutter is 11.3.0.

A

um So when you pip install coupe, I should make sure it's coupe cuda113.

A

And then you can just import coupe as cp, similar to how you might import numpy as np, and then you can start computing on the gpu.

A

So one of the main things to think about when you're first getting started on on the gpu and using coupe. I is kind of having this this concept, that you have um objects that are where the the memory where its eyes on the host or the cpu and the gpu objects, reside on other objects that reside on the gpu.

A

So is there a question? Sorry, just five minute warning: oh okay, thank you um yeah! So that's one thing to keep in mind as you get started, um so you might want to um one thing to be aware of that. Transferring um data between the host and gpu can can be uh expensive performance wise. um So it's it's important to keep and keep that in mind uh when you're getting started and if you can minimize the amount of data that you're transferring back from the host to the device.

A

So when should you use coupon versus numpy, so here I've. I've just done a simple thing, where I kind of create two um two by uh two two dimensional arrays uh using random numbers a and b here, and then we have just a simple function that that adds them together, element wise and the the blue line here kind of shows the the amount of time it takes to run this operation in numpy for various sizes of 2d matrices, and then the orange line is um the equivalent, but using coupei and so for smaller sizes.

A

So you know below 100 or a few hundreds of of elements. Numpy is actually much faster and isn't until you get to much larger array sizes where coupe I. It is a few orders of magnitude faster at doing such an operation.

A

um So it's not as simple again. You don't want to just translate all of your code directly to um using you don't want to replace all numpy operations with coupe. I um it's important to be aware of. When coupe I will be useful um performance wise and that's typically at larger array sizes. So it's always important to measure these sorts of things, as, as you start to use the gpu as well, um and then here's just another example.

A

Instead of um adding those two matrices um taking the uh doing a matrix multiplication there, and so there there's yeah you kind of see that that tradeoff happens at a smaller size, because it's a different different operation.

A

um And then, as I mentioned before, um you can sort of mix and match and and combine these different libraries. So so here I just kind of have an example of really combining you know code, that's using the the coupe api number cuda, just in time, compilation um and also using the there's a really cool feature of numpy.

A

This array function protocol which allows you to actually just use the the numpy api on um on objects that support the numpy's array function, which the kupai nd arrays do um so on the right here, um I'm kind of initializing, uh an array on the gpu device uh using coupe I and then we're passing that array to a kernel, that's compiled using number cuda and then we're using the numpy api and the result result of that. um The number cuda result.

A

And then, as I mentioned in our our desi work, an important piece of porting, the application to the gpu was making measurements and profiling the the application to figure out.

A

You know strategically what what what we should work on next to get the most um bank for our buck time-wise and so one way one really helpful way of doing that is to profile the application using uh nvidia's n-site systems and using the uh the cuda and vtx markers as well to to kind of really make it a lot easier to to label the the timelines that come out of the profiler.

A

um So we can kind of label the regions that are of interest to us um and so yeah there's a bunch of different ways to do that. um Using the handles um and then here yeah. I just have an example of how to actually run that and says profile with your python application um and then here's quickly kind of just what the encys system profile looks like. And so the area marked in the orange box is kind of what we get out of the nvtx markers.

A

And so we can kind of go and study that and kind of figure out what we want to do next after profiling, our application trying to figure out where, where we should focus our development efforts, so there's many more uh topics in in kupai. You can go to the coupe documentation to dive into some of those more advanced topics, and that's all I have so.

A