National Energy Research Scientific Computing Center (NERSC) Data Day 2022, October 26-27, 2022, 4 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scaling Python Applications

Description

Part of the Data Day 2022 October 26-27, 2022

Please see https://www.nersc.gov/users/training/data-day/data-day-2022/ for the training agenda and presentation slides.

A

To talk about scaling, python application, so a nurse coming I work in the data services and data analytics and Services Group and one of the primary things I do is help our python users use python on the nurse supercomputers and in particular, kind of think about scaling challenges and using the gpus on Pro Mudder. So when I started at nurse actually, as a postdoc, I was working with Desi and boarding their science, um their science code processes, um data from a telescope in Arizona um and it's all implemented in Python.

A

So it's helping Port that over to use the gpus.

B

A

On the left here, I have kind of an example.

A

The Desi pipeline code, which is all implemented in Python up to use the entire full system of a pro Mudder, so I also add a little asterisk there because again, there's a lot of caveats there. But these slides just demonstrate that this can be done. You can move your code over to the gpus and you can run at at the entire scale of the promoter um so to get started.

A

We're actually going to take a lot of steps back and just look at a very simple problem um and kind of use this as an example to think about parallelism in Python and kind of consider different options. So this is a very common example of just using a Monte Carlo method to estimate the value of pi so on the left here: kind of have a python implementation of that function um and we draw random samples in X and Y position, and if those X and Y positions are within this quarter Circle, then we we count.

A

We increment a counter and then, at the end, once we've generated all the samples we want to use. We estimate our value value of pi by Computing the ratio of counts that were inside that little quarter circle to the outside, and that gives us a Monte Carlo estimation of Pi.

A

um And then so some terms in the next few flight I'll probably use many times so I just want to maybe throw this out there. um So when I say a program, a program is a collection of instructions that a computer will execute. So we could sort of think about that that file there. That library.pi that's that's kind of like our program.

A

It's not quite the case for python, but a process is an instance of that program that is being executed and I can contain one or more threads, and a thread is a unit of execution in a process and typically threads within a process can share State and memory foreign.

A

Version of this um code, so this is just a single threaded um version. So there's no parallels in here, um so we have our our serial version, Pi Dash serial Dot, um and so we import our our function from our library. We say we want to generate 20 million samples and we run. We run the pi estimation code. We also um measure the time that it takes.

A

So if we look at the simple example, it takes about three and a half seconds um and what's Happening Here is when we run python, that's the file name, we start up the python interpreter, so the python interpreter is the real program it takes in our file, translates that into Python bytecode and then passes those instructions um and those instructions get executed at runtime.

A

And so one thing to point out here is that python is slower because it's interpreted at runtime, it's slower than compiled languages like C or C, plus plus portrait, um but it's a very popular language and developers like it because they feel more productive and it's easier to use than some of those compiled languages.

A

So this is the world that we're working in and just to give you a kind of an order of magnitude. Benchmark and I implemented a c version of this, and it's it's about 10 times faster um than this simple example. Here.

A

But on the bright side you know people are still working on python, the language and so python. 3.11 was released just a few days ago and it's it's getting faster. So I I noticed this in the kind of release notes for the the new python uh version It's. They say it's about um 10 to 60 faster than the previous version and I tested this on our simple code, and it is a lot faster and, as my colleague Lori Steffy likes to stay, green speed up is the best seat up.

A

So that's encouraging for python Developers.

A

um Okay, so the first parallelism example that we want to look at is kind of multi-threading, um so one issue with parallelism in Python is that multi-threading is, is not really helpful for compute bound tasks like our simple Pi estimation thing, and that's because of this thing called the global interpreter lock, and so we can't really get into the details of that here.

A

But this example just shows a case where we create multiple threads, so here creating four threads and I give each of those threads a portion of the work to do so, I'm giving it a quarter. Each thread gets a quarter of those number of samples to generate, um and then we start our Benchmark, we say start equals time to time and then the threads actually launch uh using when that start method is called for each thread and then the main process thread keeps going as as it launches each of those other threads in it.

A

And then it won't wait for those threads to to finish until you call the join join method on that, um and so we notice when we run this program, it's actually slightly slower than this completely serial version and that's again because of this Global interpreter lock. So it doesn't help us so multi-threading is typically not going to help you in Python.

A

There are some cases, though, where it does help. So if you have non-compute bound things so things like I O, if you're like waiting for a file system, I o operations or something like that. Multi-Threading can help in that case. So this this is just a quick example of just showing for a case where multi-threading can actually help, but most people aren't just calling sleep in their scientific data processing code.

B

Integers, the.

A

Workflow managers yeah.

B

A

For things like like web servers and stuff like there are use plenty of valid use cases out there in the in the wild, where multi-threading is helpful um and then looking further beyond the current release of python. Well, I noticed one of the goals in the the next for the next version of python is actually developing some work around this multi-threaded parallelism.

A

So that's also encouraging.

A

Another popular parallels and framework in Python that comes with the the standard python library is multi-processing.

A

So in multi-processing now it's we kind of bypass that um the Gill, the global interpreter lot by spawning up separate processes, and so those processes can um run in parallel and and make progress. So so here, I have a simple example: again, a version of our program where we start up four new processes using the multi-processing pool um method and then again we pass each of them a quarter of the work.

A

And now we see we do see a good speed up here, not quite a factor of four but but close to a factor of four um and then I also just wanted to highlight the way those processes start up. um It can vary um I'm, demonstrating that using the spawn method, which is not the the default method on Linux systems, um because it's a little more composable with the MPI um um using MPI on on HPC systems,.

A

um So speaking of so MPI, MPI stands for the message. Processing message passes in interface and it's really just it's a standard which defines a set of Library functions um that facilitate inter-process communication. So one thing I was kind of cheating in those last two examples is I, didn't really collect the results from each of those separate processes or threads and try to combine them and I did a very just simple.

A

um You know telling each thread or process how much work to do. I didn't like pass a lot of data, so MPI really think gives the the user, like a common set of functions that they could use for, for sharing data between processes and in Python. We could use MPI for pi, which Builds on top of that um specification and provides an interface where you can just pass: pickable um python objects and, and things like numpy arrays to those Collective um or commute those communication functions.

A

um So here's an example using MPI for pi in Python. So one thing: that's that's different about this. If you notice our our execution command here, we have this s, run Dash N4 python.

A

So now we have something external to the python interpreter that we use to launch our program and so that that launcher launches um for processes um and those processes, sync up during the MPI initialization, um when your processes, when they're all each of them are executing so and here it happens in that line from MPI for pi import, pi all of those process kind of sync up and they figure out how they're going to communicate with each other.

A

um And so you have this Communicator object now, which you can use so so here again we're not doing very very much um it's a pretty simple example, but the those com, dot barriers are saying are making sure each of those processes are in sync before the next before they move on to the next bits of instructions in their process.

A

So here again we see a pretty good good speed up.

A

um Another uh very popular parallelism framework for python is is dasc, um so desk is a very popular tool in the python Community um and I I. Don't have I won't go into a lot of details here. I just wanted to share this, because this is this is popular and another nice thing about this, which I didn't mention about MPI, but the MPI gives you a way of of scaling out not just within the server or the node, but beyond.

A

Using multi-node, parallels and so desk also gives you a good way to scale out to multiple nodes as well, um and then there's there's also um a lot of documentation with examples. There's a lot of different ways to use desk um and the documentation is pretty good, though it has a lot of examples and tips for performance.

A

um Another thing you should consider strongly: you probably already are, if you're doing science when python is, is doing array programming, but I wanted to call it out, because it really is kind of the foundation of doing a lot of scientific data processing with python, um and um it also I, guess I'll just show the next example here. So here, I'm just kind of demonstrating what array programming is here.

A

um um Here's a version of our simple example using array programming, so I've redefined our estimate Pi function um here, but now, instead of using like a for loop, we're creating an array of random um uniform samples and then, instead of you, know, looping through each of those uh samples, we can perform array operations on those using the numpy API, and this gives us a way to kind of bypass the that Global interpreter line, because the numpy is built on top of um of c and Fortran libraries.

A

And so, when you call these array, vectorized operations, you're, actually calling into code, that's implemented in a lower level, language like C or C plus, and so you get C like performance or Fortran like performance um when you use numpy. So this is, this is incredibly useful and you see this is actually faster than than all the previous examples combined or not combined, but just all of the previous examples.

A

But this also adds a little bit of um a challenge here, because now we actually have kind of multiple levels of parallelism to think about. So here's another example of using uh it's a different example using numpy where we're creating a matrix, a thousand by a thousand and then we're turning it into like it's just a random Matrix, but we turning it into a positive um symmetric, positive, different make definite, Matrix and I just want to um kind of Benchmark.

A

This eigenvalue decomposition function in in the numpy API, um and so the bottom here I'm just using a python module that helps with some some benchmarking. It's not that big of a deal, but it takes about half a second to execute uh to do perform this eigenvalue decomposition on this a thousand by a thousand Matrix, um and so the numpy code is calling into um this lower level um last back end, um which is typically something like open, Blaster, Intel's, mkl and those libraries use are multi-threaded.

A

So here we're we are using parallelism, but it's indirect we're not really specifying it ourselves.

A

um But that um openmp runtime that's used by the the back end in numpy. um It chooses at runtime how many threads it should use. So this is kind of something is something to think about when you're, combining um parallelism, uh composing parallelism in Python is um thinking about like how many, how much resources you're using and how composable are those different layers of um of parallelism.

A

So here I kind of highlight on in uh in Orange, where the optimal number of threads that that runtime Library should use to perform this operation is, is not the default value, and so here what I'm doing is a scaling study where I explicitly limit that openmp runtime, how many threads it should use and then run that Benchmark to measure the performance, um and so this this lets me kind of build an intuition of what the optimal number of threads I should let, um but that piece of code have or specified for that piece of code and I'll just point out like by default.

A

Typically, the the openmp runtime will will use um we'll choose one thread per core, so on promo on a pro model: CPU node, for example- that would be 128 um threads, and that gives us a value that was close to that. What I showed in the previous slide.

A

um So that type of kind of scaling performance um is a powerful tool for understanding, so kind of we started off with a single threaded um example and then I showed a couple different um ways of doing some sort of multi, apparelism, multi-threading or multi-processing, but I showed everything with an example of using like four threads or four processes.

A

So any honors I want to understand the performance of your code. It's good to kind of do this. um What what we refer to as a scaling analysis, where you vary the number of processes or threads and kind of look at the behavior. So in this case here I'm showing an example from from code that I've worked on moving over to the GPU and the blue line.

A

Kind of just show like the original CPU implementation and I measure, the runtime of the the whole program as I increase the number of tasks, and so they kind of go down. It kind of goes down um almost perfectly for a little bit, but then it starts to kind of flatten out. It's typically what we see as you increase the number of processes, usually there's some overhead or communication, or something that that kind of slows down performance. You don't keep getting these perfect speed.

A

Ups um I think this is a really powerful tool, while you're developing or moving things over to the GPU, or something like that. Because when you make changes to your code, uh you might you change the performance at different scales of parallelism. So this is just an example illustrating that um and here's a here's. Another example where different I've, where I'm um capturing the runtime of not just the total execution time of the program, but also measuring kind of the specific steps that make up that program um and I just want to highlight one.

A

One thing here uh is the there's, the black, the Total Line here. So it's going down for a little bit and it kind of bottoms out around 32 or 64 tests and then at 128. It starts to dip back up and you can see. There's one one line: that's really increasing throughout that whole time and that's the import section. So here's an example where the import step in Python you're just importing all the libraries that you're going to need at the top of your program. As you add more and more tasks.

A

Those are all separate processes that are doing a ton of file system operations and when you have too many of those things they start to to collide with each other and okay.

A

Okay, so time check, 11 26.

B

Yeah I think about five or a bit more minutes, because we started a bit late. Okay, so I'll just also Point towards a microphone when you're speaking this, because it's sometimes you speaking away.

A

Hard to see where I'm looking.

B

If you take a step back and maybe then like anyway, it's good, but it's just occasionally, if you face the other way.

A

Okay, so I'm gonna switch gears a little bit just to because I want to cover using gpus as well, um so yeah I mentioned gpus, and so just I also point out that just out of the box, you can't use the gpus using numpy or scipy they're not set up to to do any sort of computation on the GPU.

A

There are many GPU Frameworks that will give you access to the gpus in Python.

A

So some of those I kind of give you like a drop in replacement for numpy or scipy, the pandas or scikit-learn are from something called coupon gives you like the the numpy API but lets you use um arrays that are stored in GPU memory um and Rapids also provides things like like pandas and scikit-learn, um but it runs that stuff on the GPU there's also the machine learning um libraries um and a lot of those also do more than just machine learning. They also provide kind of array like apis and support General GPU Computing.

A

So it's like Pi torch, tensorflow and Jacks, and we have some talks later this afternoon about about those um and if you want to get into like more lower level, um GPU programming. There are a lot of options too, like uh number um Pi, opencl, Pi, Cuda and Cuda python. Also give you ways to to kind of dig a Little Deeper.

A

um If, if you want to um and then one of the challenges again with um GPU program, is you have four gpus on promoter GPU, another four gpus?

A

um So if you want to use more all four of those, it's kind of like a similar challenge to um scaling out to using multiple notes and and python, so you kind of typically would need to combine some distributed memory um parallelism um with um with your GPU library of choice, so the work that I've done, I've used, MPI um plus like coupon, for example, for for multi-gpu multi-node programming.

A

You can also achieve something like that with with desk um and even multi-processing but multi-processing, but it's a little bit more work um and then there are other options that are maybe a little bit more experimental like like coup numeric. But it could be something to look at keep it keep in mind in the future.

A

um And I'll also just point out here at the bottom of this I'm, just kind of demonstrating, there's there's so many of these Frameworks. It's kind of um almost a little messy like trying to you, know compose these or mix and match. But there is an effort in the community to sort of I mean there's a recognition of this issue and the community is trying to standardize around a common python array. Api.

A

um And just to give you a little example of that, you could combine kind of writing like a low-level Cuda kernel using number Cuda. um That's on the left you can on the right. I have an example: where use the coupon API to create an array on the on the GPU device.

A

You could pass that to your number Cuda kernel and then, when you get the result you can you can use the numpy API and it still runs everything it still performs that some operation on the on the GPU. So there's there's no um data movement back to the CPU. In this example,.

A

um And then, when you're we're thinking about which code should you move over to the GPU so right now I, imagine you know your if you've you're already have a a code or application that's running on the CPU, you don't want to just move the whole thing over to the GPU in one go. You want to understand where the performance bottlenecks are of your current application and then figure out if you can get a speed up or or get some performance benefit by moving that over to the GPU.

A

um This is not, oh, you don't want to move everything over there because there's a there's, an overhead to launching um GPU kernels, so here on the right. I have an example where we're just doing a DOT product um or a matrix multiplication. Really it's a two-dimensional, arrays and I. Have this XP random, because I'm using either the numpy API or the coupon API, so in blue, you can see for small Matrix sizes is number here we get very fast.

A

This is very fast operation until about a size of like 20 or 30 or so, and then it jumps up to kind of taking about. Like a millisecond or Beyond as we get as we keep increasing the scale, uh if you compare that to coupon, you can see that that uh performance is really relatively flat um and but it's flat for a lot longer than the numpy version. So after about for Matrix size about 20 or 30 or whatever, it is actually beneficial to do this operation on the GPU.

B

um Is that a one-dimensional size there 100.

A

Yeah I don't see.

B

Yes, yes, yeah.

A

So if you're um your algorithm or your code, if it's, if it's, if it's working with large arrays or matrices or images, those are good, you know there's a good chance that you could see a speed up on the GPU and another thing: if your application is already I O bound, then you know you have to read all if your application already, if that's already the bottleneck, then moving to the GPU, is not going to fix your. I o iosu.

A

um And then just kind of a thing to keep in mind CPUs are are great they're super fast um for doing like a few operations like a few things in in parallel, the gpus are maybe a bit slower for like a single thread, but they have so many so many threads that it's a higher you get higher throughput with the gpus.

A

So, just to kind of wrap up, I think you know one of the most powerful things you could do to improve the your your. The performance of your code is to really like, learn and and really um become an expert at a rate programming at numpy and eliminate many for Loops in your program as as possible, using kind of the vectorization and broadcasting and indexing features of numpy had felt a number of users.

A

um In my time here at nurse various hackathons and I think you you, a lot of those are focused on.

A

You know: how do we move stuff over the GPU but I've seen so many times, or even just removing python for loops and using numpy API is a huge benefit already um and then, with things like coupon that lets you use the same exact API, but now in the GPU you almost get all of that work of moving over to the GPU for free um and then another thing to keep in mind when you start running it at scale.

A

Is that you know python is a file system, intensive language, and we see this a lot of times as you increase your process um count and the number of nodes that that file system startup becomes to be an issue.

A

um So, as Dan mentioned earlier, containers can help with this at larger scales, um but I will briefly I'm going to go back all the way to the beginning, really quick, just to point something out.

A

uh uh So for this week's scaling thing where I ran the Desi pipeline at all the way up to 1500 nodes on promoter, you can see that the performance was pretty flat, all the way out to about 300 or so-ish nodes. Then it starts to pick up so when I ran. This I was running 32 tasks per per node, so this is starting up. You know tens of thousands of processes all the same time and I I did not actually use a container for this.

A

So at the time um you know, I was able to get pretty far without having to do this, but that take up at the end, really is just due to the startup time. So beyond 100-ish, no or 300ish nodes, then that python startup is really taking. uh You know, 15 minutes just for the application to start.