National Energy Research Scientific Computing Center (NERSC) Data Day 2022, October 26-27, 2022, 4 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Python on GPUs JAX

Description

Part of the Data Day 2022 October 26-27, 2022

Please see https://www.nersc.gov/users/training/data-day/data-day-2022/ for the training agenda and presentation slides.

A

On so my name is nestada I'm, currently, a positive chart nest and I'm working and parting, the those software framework to build meter and in particular to GPU, which led me to think about how do we pop python code to GPU efficiently and the thing I'm going to show you today is track, switch to give you a slight teaser right now. Let us like reported a bunch of candles and we got things like a Time 16 speed up fairly easily, which is nice, but first, how do you pop python code, 2gp?

A

What other device approaches that are available? The easiest one is to use of the Shelf cannons. Maybe you have some numpy called use: sci-fi a user use qpi, something plus app icon. If you're using planners and exactly learn, you can use Rapids and that's going to be super easy to use because, basically it's plug and play and it's perfect, there is one kernel, but already does the thing you want. You want to solve it in our system. You can use an existing kernel and your problem is so that's very good.

A

The problem comes when you have something more specific in mind when you want to hide something more tailored to your particular scientific problem, then you can try to combine those like you can try to use cubai as you reduce numpy, to build your your application in particular algorithm. But then you allocate allocating a lot more intermediate values. You have more data transfer to the GPU and the performance starts to behind very quickly.

A

Another idea is to use a deep learning library like Python's transferflower tracks, which works great if you're doing deep learning, obviously- and they are very easy to use that when documented there are thousands of users, so that's very nice and clearly they are doing things on GPU in Python, so something is working for them. They have most useful numerical building blocks that pass for your transform, linear, algebra under number generators. All the things we want in numerical applications, but also when you try and use them to actually do GPU Computing.

A

What you realize is they tend to have a very large overhead, because they're optimize for a different use case. Most of the time. The thing that's expensive for them: it's going to be the garden competition during the training, so they're going to make that as cheap as they can and, for example, if you're using pytorch. Even if you tell pythons that you don't care about the Guardians, it's still going to be more expensive than if you were able to cut the parts of the card, but what are used for the gardens.

A

So there are some. There is some non-trivial overhead coming from that, that is non-trivial and how to eliminate. Maybe you know what you're doing with a GPU and you're thinking? Okay, let's hide the canon in Kuda pencil if cycle, something like that, it's going to be very low level, but also very fast, and then you link it into python using something like python, CL or pipoda, and that gives you great performance if you know what you're doing.

A

But it's going to make it much harder to use those numerical building blocks things like random numbers, password linear algebra, because they are linear, algebra, Cuda, libraries for example. But if you need to use the linear solver inside the loop inside the loop inside a loop in your kernel, then you're in for World of Pain. Also it requires a lot of expect of expertise into high performance Computing, because the karmic code, that is they can let you write code that is very performant but I, think covers. Actually government takes a lot of time and experience.

A

If you don't know what shell memory is on GPU same thing, you need to learn a lot of things before you start acting up. That's interestingly performance to make it harder to act correct, but because you're working at a lower level, and once you've done all of that, you still need to manage to compile your code with something other than Python and Link it into your python.

A

In Python, you could use this number and it has a low level very quickly, like syntax and I'm, saying, learn available, because you come back to all of the Kuda construction memories there all of them are going to be there. There are other things like that. Actually, the nice thing is you're going to be in Python, so you don't have to care about compiling and linking or that is taken care of for you, but you're still very low level, and it's still very hard to call all of those numerical building blocks.

A

We would like to be using.

A

So my question was: is there a way to have good, GPU performance portability and productivity in Python? Is there a solution and the thing I found that worked really nicely for our particular use case is tracks. So what is drugs? Drugs is a python library but lets you write code in Python and then run it and whatever Hardware you are, which could be your CPU GPU Nvidia IMD, a TPU with your Google or something else.

A

It also works on Specialized hardware, for the planning and at first it was developed by Google as a building block for the Deep learning Frameworks. Now it's something able to so that they wouldn't be able to hide Python and run it on GPU, but it is seeing why the application in numerical application application things like molecular Dynamics or some simulation computational free Dynamics.

A

So what does it look like? It looks a lot like numpy like here. We have a cut sample and you see we are importing something called numpy from Jax. We are generating some random numbers. We come back to that later and then we are using dot. Also about array and array dot. T and people in the numpies are going to numpy are going to recognize that this is exactly what they will write it in then by except, but this is going to run on GPU, which is very nice what's happening in the world.

A

Is you have a Justin type compiler? Whenever you call a Jack's function, it's going to be test, so Jax is going to look at the shape of your inputs, then the shape of the output of each computation. So you have a sum. It looks at the shape of the input, the shape of the output of the sum and builds the computational graph like that, once there's a computational graph, it passes it to The XLE compiler, that's going to actually compile it for your current Hardware.

A

So what you have is a just in time. Compiler compilation happens at runtime, which has a price like in C plus plus. If you've got takes 15 minutes to compile. That's fine in Jacks. If it took 15 minutes to combine your card will be 15 minutes slower in practice your below one second, and unless you have a problem somewhere also, it means the input sizes must be known to the Tyson, not only the input sizes, but the sizes of evidence are inside the computation.

A

So you cannot have the size of an intermediate result: BF function of the data which sounded very restrictive when we started, and then we realized that with padding masking the compiling for by sizes, you can work around that. Also. The fact that you cannot have things like dependent on the data, at least for most things means that looks and tests are restricted and something that that felt very restrictive. When we started and then we realized Jax is providing a bunch of function, you can work on that.

A

They also don't have those side effects in place, modifications and they focus on GPU, optimization, meaning the compile the drugs compiler is really good. On tpus, it's very nice on GPU, but on CPU we found We have basically single car, C, plus plus performance, which is better than python, but not acceptable. If you're running on parameter CP, so it depends on your use case, given all of that, how can we actually use drugs because that's a long list of limitations and then we are going to see up to see?

A

Is it worth it to actually use it? So I told you Jax looks a lot like numpy like here. You have a Jack Scott, that's walking, Jack's Garden, that's been a bicycle and that's really nice, because if you're already in an Empire you're 90 of the weather and if you don't know how to write something in drugs very often you can just search on Google how to do that in numpy, and the answer is going to be very juxtap, which is very useful because it jump starts you into using the library very very quickly.

A

Now, let's look at where it diverge, I told you you're not allowed to mutate thing. So if you're done, I really wanted to update it add one to all of its value. You have to create a new array and you can call it like the previous one. So that way, it's functionally identical to mutation ba is pretty much the same inside your function.

A

If you wanted to modify an index that provides some function to deal with that, so there is an add function, you can modify it at the index. That could be another indices, and this is going to produce a new array. There is no actual needability happening if you want to update something that are all the usual increment increment operations and something that's interesting is that Jax makes parallelism transparent, meaning if you were to write a Cuda kernel and you wanted to modify a bunch of indices on some of those indices could be identical.

A

You need to think about. Oh, why am I going to deliver synchronization, maybe I'm going to use anatomic index parallelism is transparent and Jax is going to take care of that. For you, this operation is going to be atomic and if the compiler decides oh, these might cause problems. It's going to put security safety is on top of it for you, you don't have to think about it.

A

At any point also, I told you that Jack's code is compiled and that's important, because if you hijacks just like that, it's going to be very slow, you need to compile it's not quite to be fast and so to compile it. You have a function called jits. So here we have a demo function called f and to get a compiled version of that we call g10f we get fitted and whenever we call F and something it's going to trace the function by running it on a on a shape.

A

Basically that doesn't contain any data, so that is going to trigger our print and once it is traced it's going to be sent to the compiler, the compiled version going to be created to be run and the next time we run our function and inputs with the same size going to detect. Oh, we have already compiled for that given size. So let's reuse it if you use, if you pass new inputs with different sizes to that function, is going to be recompiled.

A

So that's something you have to be aware of, but on the plus side, the compiler, knowing the size can do some very clever things can detects. Oh, that look is going to be tiny, so we're not going to be working on Parallel design, parallelizing that one, let's paralyzed this other one and the past other inputs, and it's like, oh, but in that case this is the loop. That's going to be the Outlook. This is where we should be focusing. So it lets the compiler be very clever, which is nice.

A

Also I told you, you cannot depend on the value inside Function One exception to that is static values. You can tell the compiler. This value is always going to be the same use it when you're tracing. So here, for example, you pass a Boolean and we delete. This is going to be static, and so you can test on whatever you want on that Boolean things work and that test is going to be led that tracing time and that helps the optimizer also like to take care of things to simplify things.

A

You can do something which, which is called donating inputs. So that's not often useful, but when it's useful, it's going to be a performance benefit, a significant performance benefit, sometimes she's by default. That function takes an input, that's something returns an output and that output is going to be a new array. But if you know your input is never going to be reused, you can tell the compiler I, don't care about my impacts, feel free to reuse it to recycle it, and then the compiler is going to be able to say.

A

Oh, this should be an In-Place operation, which is something that's important, because here you saw we did. We created a new array about this scenario and some people might be thinking. Oh, that's going to be terrible, like we're we're using a lot of memory from nothing. If you do that inside the compile section the compiler is going to be able to say: oh, we can do an in place modification or sometimes to say that's going to be useful to have both the previous version and the new version.

A

Given what we're doing so, let's keep both version. So the compiler has the the leeway to be more clever than we are, which is nuts, so we can donate inputs to deal with tests like to actually do test on the value of things. You have two main ways you can call numpy, where you give it a mask or Boolean something to return the true case, something to return in the false case and that's useful when the computation of both branches is fairly inexpensive, which, on GPU most computations are fine expensive.

A

If they are actually expensive, there is a luxcom function, you pass it a Boolean function, 2 and the true case function to and the false case and the inputs to give to either of those functions and that deals with tests. There is also a way to deal with loops same thing: you have some while loop and file Loop Primitives and, more importantly, there are some vectorization Primitives.

A

So, to give you an example: here we have a double loop on I and then G and in the res, in IG of the result, we apply the body, error function and the slice on X and Y, and if we call x-map, which is something that's going to vectorize our current and that function body, we can tell it. Oh on the first input we are going to slice it like this second impact with slices.

A

We slice it like this and the output is going to be organized like that is going to take our body function and turn it into a function that can process the full block of indices at once and that's the pattern you find the surprisingly open in codes and the more you use it. The more you see it everywhere, but the GPU really loves that, because basically you're able to run your loop as a single block under GPU and that's very performance, and here I'm using xmap.

A

There is also the map which is going to, which is a less powerful version of its map, that works on a single index and another thing is pair map, which is what, if you have your own parameter, you have four gpus. Where you can, you can use Spam app and it's going to run in parallel on your 4GB. So that's something that you can do in tracks.

A

It has its own system to deal with random number generation, I'm not going to go into details, but basically they have a system to Generate random numbers without having problems with States and like all of the problems you run into when you try to scale random number generation to a large number of threads that are running pilot.

A

Automatic differentiation, which is where it differs from most of the planning Frameworks, where I told you like a lot of overheads coming just from all these infrastructure to computer gradient Jax does guardian computation by cut transformation, meaning the overhead of there is no override to something that does not care about the guardian, because nothing is there. They deal with the Guardians, and when you want the gradient of the function, you can call card on that function and it's going to transform your code and produce something that's about as fast as an analytic solution.

A

So you get very, very fast, current computation and no other element. You don't care about clutch, that's a zero cost abstraction, which is very nice. Obviously summer passion cannot be differentiated because some things cannot be done and to end on the how to use track section. There are some very simple performance tricks that are worth thinking about.

A

One thing is: compilation is cheap ish, but still it has a price, so I recommend putting just the print in the function in your drug scanner when you're starting to use them and your function is going too hard and you see how many times is my prints trigger and if it's triggered a lot more than when you were expecting you, you have something you should be modifying, because that is going to be costly and that's going to be, because that is easy to miss and something that's easily fixed at all.

A

Also I told you, the drugs compiler does a very good job like all of those restrictions above are there to make the compiler's job easier. If you talk to compiler people they're going to tell you it's a shame, but we our programming language, are all about mutation.

A

If we are not mutating data, we could we could have compilers that are so much clever and the the thing behind the Jacks compiler is to say: okay, let's do that, let's make our life a bit harder so that the compiler can be more clever, and so very often you can you're going to add, like maybe two lines inside the compiler section, and certainly the compiler realizes. Oh I could reuse this I.

A

Don't care that much about that and it's going to make your cut five or ten percent faster, which is always something that is worth getting and as always, when you're dealing with GPU Computing, you keep you keeping the data and GPU as long as you can is very worthwhile. It brings a lot of benefits. So here are some libraries that are worth looking into. You find a lot more in the awesome Jacks GitHub repository.

A

There is an MPI for drugs, library that introduce MPI Primitives as tax Primitives that you can use in deleted section, and this is going to use your MPI GPU super. If it is there. So that's very nice. There is something to help you there's checks code and make sure it runs similarly and CPU GPU GT section or not.

A

There are some iterative Optimizer and shape if you're familiar with Einstein's, Mission and step does the same thing and shape does the same thing, but for reshaping and that's going to make your life much easier if you're dealing with tonsils with like three five eight dimensions and a bunch of deep learning Frameworks, so is it worth it?

A

What we did is we took test, which is a fairly large python, application about 200, 000 lines of python cut, that's used to Python and C, plus plus code. That's used to study the cosmic microwave background. This is made of several pipelines that are distributed with MPI. It's like it's able to use the perimeter supercomputer at full capacity, and all of those pipelines are composed of C, plus plus kernels that are parallelized, and this care analysis pretty much everything you can think of. There are some random number of generators.

A

There is some fast Fourier transformation going on linear, algebra, obviously sparse matrices. All of the things are somewhere in there and what we did is we took two of those pipelines spotted all of the kernels. To answer the question of: is it doable, given all the restrictions shown below and is it worth it? Is it actually performance?

A

So we parted all of its kernels first from C, plus plus to numpy event to Jacks, trying to keep the interface face, identical to be able to use our unit test to make sure that our part is actually functioning the way it should be, and the things we found is. First, we had a bunch of kernels that had loops on irregular intervals. An intervals with size is dynamic and function of the data, which is something I told you Jax does not like. So we we realized, oh just with some padding and masking.

A

We can work with that. We introduce the type to I've talked about that and our life was nice. Then a bunch of our kernels meet data output parameters, which is something Jax also currency should not be doing so. We introduces, we introduced the mutable Jack, sorry, which is just boxing a jack sarain. Whenever you mutate it, it replaces replace it with a new one and that's a term abstraction, but the G compiler does a good job with it.

A

So we are happy about that and we worked a lot on reducing data movement and that's something we're still working on, because we have a pipeline abstraction and there are ways to improve. On that doing all of this we got a seven times reduction in lines of code, so for all of the card that is used inside these pipelines. Our code is now seven times shorter, going from the C plus plus python version to the Jax version, and if you look at the so those are all of the candles that have been partied.

A

And if you look there, you can tell me: okay, none of those bar is seven times shorter, some of them maybe five times but nothing, seven times shorter. That's because a lot of the reduction was not in the actual kernels, but in the utility functions since we're using Jax, which gives us access to them by sci-fi and a bunch of other things. We don't have to hide our own binding to the mkl. We don't have to hack things like how to normalize Vector all of those things.

A

So that was a lot of Reinventing. The wheel that was cut out of the card and overall, the code reduction, means that the cut is much easier to keep in your head, which is a nice one. Was it worth it so here we have a bunch of timing for various kernels. The first two lines are the timing took to move data to GPU and back to GPU, and finalize is also a little bit of cleanup.

A

That's why it's not the zero power of an MP, and we are comparing so openmp C, plus plus kernels, that have been optimized. They run on four threads with four particular problems is the fastest we can go if you had more heads bad things, happen and Jack's running on one GPU and those kernels are slower in Jacks and that's because those cannot spend their time getting their data and sending it back to and from GPU, but something we're working on and that's going to be fixed in probably one week.

A

Do scanners are nicer, and here we have a Time 16 speed up and in the first version of this slide, we're at the time 62 speed up of that Cannon, the nose white Cannon, and that was such a good speed up. We went to look at the card thinking. Okay, there is a problem somewhere, that's not normal and we found the problem. We had the performance bag in our C plus plus card. We had a critical section inside the parallel Loop that was making the code slower than the second shot.

A

We fixed it and no things are much better in the C plus plus card. Something that's interesting with Jax, which is parallelism, is transparent in drugs. You don't have to think about Buddhism, meaning you cannot introduce performance perks by making mistakes in the way you paralyze your car and that's really useful for people who are primarily domain expats. We know that science standard cosmology, but writing high performance code is not their main field of study.

A

So this is a proof of concept. You have partied a bunch of kernels two pipelines, it's a work in progress, but we found that's doable, which is very nice and that's worth doing, because the code is easier to understand, shorter and faster. What we could do to go faster to go faster to go further is to reduce that movement. That's a work, that's going on where still a lot of data movement, that is avoidable.

A

We could reduce the number of C plus plus dependency like there is no reason to keep a code base, but it's a mix of C plus plus and python if you can have full Python and the same or better performances.

A

Also, a lot of our Cloud complexity comes from the fact that we're trying to preserve the interface of the super space kernels, but we could also defer to drug size to Pure functions and then our code will be significantly simpler and we could- and if we did all of that, we could redesign our pipeline to get them as gitaval stereo kernel into a single, huge GPU kernel on the Fly and that will bring. We expect lots of government's benefits.

A

So should you use, should you use tracks in your projects? I think Jax would be interesting for you if first look at us to be in Python. Obviously, if you're using Julia, there are some things, but you probably will not want to use jigsaw the back end of the Jax compiler.

A

Your card should be rated in numpy, which is most likely. What you're doing if you have a python card and should be readable in a numpy style, your rsid should not be too Dynamic.

A

We thought that would be a big problem that has not been a blocker for us so far, and the Jack's team is working and reducing those limitations, but still I'm sure there are some cars that are way too Dynamic for the Jax programming model at the moment, and also single single credit, C plus plus performance should be an acceptable fallback in the absence of GPU.

A

If you know that your card will sometimes run on CPU and not on GPU and you need save, you need multi-core performance, then Jax is not going to be a good seat and the flip side I think Jax is really. Nice.

A

Is that a sweet spot in the design, Pareto phones for people who are doing research on complex Miracle card, because it lets you write code, what is easier to make correct fast and that very quickly, so you're being very productive because lets you focus on the semantic of the card and separates neatly between the semantic of the coverage is the thing you're dealing with as a domain expert and the optimization which is led to the compiler, and you had all of his restrictions. You like the compiler will be as clever as it can be.

A

That's where you never think about optimization.

A

Also it lets you have a single cut Base today with both CPU and GPU code, and that's really nice, because you can test your code in your guitar CI and then run it on GPU on parameter and that's going to be the same code and behaving the same way. That's very practical. Also, the immutable design limitation are actually very nice for correctness. What we found is, after having parts of the code we found bugs in the suppress plus part of the gun, but we are not present in the jack spot because it was immutable.

A

So we were not allowed to some things that we have into this products and Mexico really really easy to use numerical building blocks because you're using the things you're used to be using in Python like numpy and sci-fi, and so, if you need to solve a linear system inside a loop inside loop inside the loop inside your kernel, which is a real thing, I did like two weeks ago. You can do it, that's going to be very easy, and that is very practical, very important.

A

So thank you and for people who are interested in drugs, you can follow that link. I also put a link in the.

A

Exercises that are on Google, collab and they're, going to walk you through putting some simple python codes to Jacks to get a feel for it and whether that's a programming model you're interested in or not.