National Energy Research Scientific Computing Center (NERSC) NVIDIA RAPIDS Training, April 14, 2020, 14 Apr 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 7. Introduction to Dask + GPUs

Description

From the NERSC NVIDIA RAPIDS Workshop on April 14, 2020. Please see https://www.nersc.gov/users/training/events/rapids-hackathon/ for all course materials.

A

So for those who have done the notebooks and in advance, this will look fairly familiar, but there will be a difference in that. I am not running this, as you can see right right at the top here. I'm not running this on the quarry system. I'm gonna be running this on you, one of our internal servers and NVIDIA and as a result, I'm gonna also show and highlight how this works without just using a single GPU, but how it works with multiple GPU. So you can get a visual sense of that.

A

If you did the notebooks, but don't have current access to multiple GPUs, but so anyway, driving in as vibhut mentioned, your desk is a flexible library for parallel computing and it makes scaling out easy and we have, you know, got a lot of work to support the desk community and contribute into the desk development support for GPUs, in particular, support for ku DF and enhanced support for coop I erase, and so in general. There's a couple things I want to preface this with tasks is great.

A

It is scale up and scale out too many machines, but task does introduce a small amount of overhead. Any distributed. Computing framework will introduce overhead if your workload fits on a single machine. That's just the nature of distributed computing. It has to have overhead, and desk is really efficient. So it's great, um but if your workflow is fast enough on a single GPU or your data, comfortably fits in memory on a single GPU. You wouldn't want to use desk unless you expected it to scale.

A

You would want to just stay in the single machine, libraries of kudi, F or Koopa, and that applies both to CPUs and GPUs them. The same applies for using pandas and numpy with that said, there's a little bit of a benefit that doesn't come through on the GPU in the same way, which is when you use pandas for an unpopular pandas on your laptop. You can use to use all of your cores, which might not already be happening on the GPU.

A

It's gonna saturate, the GPUs CUDA cores, no matter what so that's something to keep in mind both for CPU and GPU workflows, but one additional benefit that we'll see a very little bit of here is that task allows the GPU libraries to spill to hosts memory and actually also to disk which I briefly mentioned before this is something that is fairly complex and it's it's out of scope to go into the details right now.

A

We can take discussions and questions about it later, but I will show an example of what that means, and all of this is controllable when you create a desk cluster. So that's what we're gonna do here so right here, I'm, just gonna restart my kernel to make sure everything is clean.

A

I'm gonna create a desk cluster with a couple of commands, so there's some things here that our desk CUDA specific and in particular CUDA, is this the add-ons that allow it to work well with GPUs, which we were working on upstreaming and some of them have been up streamed, so we're gonna import. This local CUDA cluster, which of the boo, alluded to in the presentation- and he mentioned how, like the general pattern for desk, is you create a cluster and you scale it up. That is the general pattern.

A

In this case, though, we don't need to scale it up, because it's just gonna use the entire local machine unless we tell it only use GPU, zero, we're also gonna use a client from distributed. This client is how we connect and interact with our cluster scheduler, so I'm gonna fire this up right now, I'm gonna use a single GPU and note that I'm also setting a memory limit here. This is a memory limit for the GPU, which means task is going to target keeping memory below 4 gigabytes on this GPU.

A

Now, if work is happening and it's going above, 4 gigabytes, that work won't be eliminated, what will happen is that work will be spilled to CPU host. You know, host memory or other work will be spilled to make room for it and then it'll be brought back into the GPU as appropriate. So you can see here that I have a client in the notebooks. This configuration on on yours has not been commented out. This is to help you access.

A

The desk dashboard on the Corey system is that nurse I've commented it out because I've already got this up here, and so you can see that right now, I have a task dashboard up. There's a lot of interesting information here, which is worth taking a couple minutes to talk about.

A

This is the Status page, and by default it's showing me, you know a little bit of activity is happening in this case. I don't have much going on in GPUs, but it could show that I had memory already allocated if I have been doing other work or some what else I've been using the machine. This would tell me how much memory is being allocated on GPUs.

A

If I had already acts actually run any tasks. We'd see tasks here and if they were tasks currently running we'd see the stream as they go by and various progress bars and we'll see that in a moment, I can also look at the workers, the desk scheduler.

A

In this case is running and there's one worker attached to it, as weaboo mentioned, we operate with a one worker per GPU model and I've assigned it to be using one GPU, so I have one worker and we can see that in this worker tab where I have one worker- and this gives me metrics about the CPU utilization what's going on and all sorts of things like that, I can also look at a task stream, but not just the running tasks dream. This is gonna, be a live version.

A

This will be the entire history of my task.

A

It is not just the current state, it is all of my tasks that I've run and we'll see that in a moment and there's a variety of other things we can get to, but in particular some of the really valuable ones will be the profile which will show activity in a moment and that lets you see where time is being spent, which operations are you running and are taking the most time and there's a variety of things to job to dive into so with that said, that's sort of how the dashboard works.

A

So we'll create some random data. In this case, we're gonna create a distributed array using the desk arrays random state. We're gonna do this on the GPU, so we can create a random GPU array by using coop I here. If we wanted to use a CPU array, we could call it with numpy and I'd have to import numpy, though, but we could do the same thing, and so you'll notice here that you know with this generator I, can create a fairly large array of 100,000 by 1,000 and I'm choosing that chunk size.

A

This is exactly like the example that Prabhu showed when we're doing like the small array of ones the 15 elements. In this case, though much larger elements so using much larger chunks, and just as before, when we run this code, nothing has been executed. A task has been added to the graph.

A

It takes a call to persist for us to execute this, and so this is just like apache spark. It's lazy execution, it's very common in parallel processing, because it lets us do optimizations once we know what the full task graph is. So, let's run this. This is gonna, do what we just said: it's just gonna create some random data. Now, let's take a look, so what happened? Our GPU is running the code and- and it's already finished, so we don't get to see it really screaming.

A

Hope you, you know a half-second, you can see it and I can zoom in if it's not big enough. So let me let me zoom in this is the random sample task. There were 100 tasks, all of them succeeded, they're, all done, and now a couple things have happened. We've got a task history, we have a history and we should have a profile when we do. This tells us where x being spent. So we can see that, as expected, we spent our time doing. Random sample no surprises there.

A

We had a hundred tasks, because this 100,000 by 1,000 array in chunks of 1,000 by 1,000, naturally is going to have 100 chunks, and we can see that right here provides a really nice string, representation that actually uses HTML in notebook cells. To present this, you can see that it gives you a shape of the array. This is a tall and skinny. Excuse me, this is a a tall and skinny array. The array is 800 megabytes.

A

Each chunk is 8 megabytes give you the shapes, accounts the data types, and you can see here that it's backed by Koopa, it's a GPU array.

A

Now this is that visual representation is great, but let's do some work now. So, let's actually schedule some work with the same operation of the boo mentioned in the slides, singular value. Decomposition is a matrix decomposition and that ran instantly because it didn't actually do the compute. What we've done is scheduled work and we can see this very explicitly notice now that we have 708 tasks to do on this on these objects before we only had a hundred tasks.

A

Task maintains a single task per chunk or port partition, and this makes sense it has to organize and orchestrate it, but this actually added 608 tasks to our graph, but we haven't done anything yet. So, let's do something. We'll call persists to run all these at once and then we'll use this wait keep command.

A

Now this is optional, but sometimes we want to wait for the results of all of these asynchronous operations to happen before we go forward, so we call wait, but it's not actually necessary, but it's a nice convenience function, at least in this point. Sometimes it's useful for worklets, but in this case it's more of a convenience, so I'm going to launch this and go back to the scheduler dashboard page, and so a lot of stuff is happening. A lot of stuff just happened. Let's take a look.

A

There were a bunch of different operations that we had to do in order to make this singular value. Decomposition compute actually happen. We called SVD, we called dot products. We had a QR decomposition. Why do we have a QR decomposition? Well, it turns out that a distributed algorithm for SVD does rely on QR decomposition x', and you know that goes into the internals of desc, but we can see each of these tasks and we can see in our task graph that they all took different amounts of time.

A

Note that, because I had a delay in running these, there's a delay here, but task helpfully lets me zoom in if I'm so inclined and I can zoom into anything. I want and see. Okay. This took one second, this qrd.

A

Comprehend so on and so forth, and then I can just reset with a single click very convenient, and the profile of course, was updated too. Now time has been spent in multiple places. The time we spent with the random sampling is now much less than the time we spent doing things like QR, decomposition from the linear, algebra library, or from the actual SVD arrey, arrey function and and so on and so forth. So that's sort of how dass works, but now the results are still distributed.

A

We can't really look at these results if I click, if I look at you, it's a you know, it's a distributed array, there's no more tasks, waiting to complete we're back to the 100 tasks for the 100 chunks, but I can't see anything well. I can call compute right now to make this array essentially one partition and just grab that underlying coop IRA. So in this case I'm just gonna slice in and grab a 5x5 slice and there it is so. This is my actual coop IRA.

A

That is these 25 values, and you can see that if I type type it's actually, my coop IRA dast manages this, and it knows what to do to give me that compute, this tiny call right here was my get item. Call I was accessing those elements from those distributed arrays and that's really all there is to it.

A

We can do the same thing now, not using arrays, but with data frames and we'll go through a couple. Other data frame examples showing some fairly complex operations, so this is going to generate some random data frame. Data and you'll see that we're calling functions that are being executed later. This right here didn't actually do any compute until I called head. Why did that happen getting the beginning? Are they in this case? The first few rows of a distributed array is not a lazy operation when I call head on this object.

A

This comes active and in order to actually give me these values I had to do computation. That's why we saw this computation actually happened. This purple bar right here where I called head and it actually it actually triggered the data generation itself. So hopefully this is beginning to make sense where most operations are lazy, but we can explicitly force computation with persist. We can force computation by calling head to inspect things, but it's nice that we can do things lazy because it lets us optimize.

A

So to make a more complicated example, we can take this data frame, which we can you know see how long it is. It's actually a fairly large data frame and we can call length on it, and so you know we got a variety of length. Combs there's one task for each of the partitions and in this case there were 60 partitions, which we can see right here. Actually sorry, there were 30 partitions, which we can also get right here and with 30 partitions.

A

We call length, we get 30 tasks and we see that there are two and a half million rows. So, let's do a group by this is a fairly large group, I and notice, though, most importantly, it's the same API as we saw in the coudé EF notebooks. These api's don't change at all and also notice that I'm calling head here, which is gonna force the computation to actually happened so we'll watch this dashboard. While it's happening, you see that we're doing aggregations across different chunks, we're doing lots of operations.

A

We are combining we're doing a variety of things. We didn't really see that group by operation. Why didn't we see that, let's zoom in a bit? Is it possible? It was just so fast that it seems like it didn't even make it can even happen.

A

Okay, maybe have to zoom in a little more okay. So this is an example of how using the profiler can be really valuable. We saw the tasks happen, but the tasks have have gone away. Now our profile is a little complex. So if we want to understand you know, how long are we spending in this workflow? Where where's the group eye time spent, we can dig into these profiles and we can see.

A

Oh wait: okay, I called group by that aggregation that took one point five six seconds to do all the steps that needed to do to make that happen. There we go there's our answer and of course we had our results. You know fairly quickly because this is on a GPU. We can do hierarchical, multi-column group eyes with multiple aggregations on millions of rows, and you know a second which is great and so, but when we've every time we run this we're creating the data right.

A

We write this from pandas make time series for creating the data every time. It's because we're not persisting the data, it's not cached in memory. So let's do this one more time and create the data. So now our data is in memory. Just visually, we'll see that this group by now runs way faster.

A

Well in theory, in theory, in theory, it runs quite a bit faster. The aggregation, I didn't measure it, but I measured. It would have been faster because it didn't have to wait for this data generation. But, of course the results are gonna be the same and if the same API is qdf that works across multiple GPUs we're doing it with one GPU here, but in a second I'll show it with a lot of GPUs and before we do that, I want to highlight another example: functionality that's fairly complex, that has support as well rolling windows.

A

We saw an example of that very briefly that I mentioned in the ku DF notebook with the user-defined functions. We can do rolling window operations on desk as well and just like before it's a lazy operation, and so, if I don't call ahead here, it's not actually going to execute, but I will call head. So we can see it now in this case. It's it's incredibly fast, because this is a very efficient operation.

A

You can see. We can do these rolling window. You know rolling window averages and all these complex things with tasks.

B

A

For it, alright.

B

So Remy asks given that desk is lazy. Why did it not optimize and persist the DDF data frame automatically great.

A

Question so will not do that it just it's a design choice, so in this case I've already persisted the D, this D D F, which is why this was particularly fast but desk will not cache that computation if I'm doing it in separate cells. If I do this already working in a Python script, it would actually it would do that task would actually not recompute the calculation, but if I'm explicitly calling persist and compute and things like that, it's not it's not going to catch it in between and that's a that's a design choice.

B

In GPU memory or DRAM, great.

A

Question so because we're using task- qu TF, oh sorry, DDA, because we're using gas CUDA yet coup, D F and we call persist, we're explicitly putting data in GPU memory and actually that question is a great segue, because the next thing I'm gonna do is show that we've used data and GPU memory. This data frame, this D D F, as well as the arrays we created above, are using this much data in GPU memory. I said about six gigabytes.

A

That was my estimate before perhaps I have another process running that has about a gigabyte used roughly when we use operations with qdf, everything is being persisted into GPU memory by default. In order to use CPU memory, we have to explicitly spill the CPU memory, which is something we enable in the beginning and I sexually. What I'm going to show right now.

B

A limit of 4 gigabytes right and now we've exceeded that. So how did you choose 4 and why is it more than 4? It's.

A

Great question so I chose 4 arbitrarily there's a couple things in play here. What's what gask is using to make the assessment of when it should spill is multifaceted, one at the high level? It's not using the total available GPU memory to decide. Oh I only have twenty seven gigabytes left. Therefore, I should spill its instead using the size of objects that it has visibility to in memory. So we've done a bunch of compute here and asked has got visibility into the things we're doing the data.

A

That's not in memory from this workflow but is on the GPU is not visible to desk, and so it's not using that information. When thinking about spilling that's the high level a little more nitty-gritty is that task is going to schedule compute in a way that it thinks it can do it in an efficient manner, and so task will go over that memory limit that we set as long as it thinks it's not cannot efficient to spill for that.

A

It's not a hard cap, it's more of a soft target and there's a lot of details that go into that I'm happy to talk more about that at the end in the QA section, because it's fairly nuanced and but it's that at a high level, it doesn't have a hard cap.

A

B

Okay, thanks: that's all the questions for now great.

A

So right now the the next thing I was gonna show was sort of about spilling, and so you can see in this case there's about seven gigs using the GPU. We I I think that we've used about six of them, so we should start spilling if we do more operations. That are very you know very intense, compute, wise such as this one right here so notice. This operation is very similar to one before, except it's bigger.

A

Instead of creating a 100,000 by 1,000 array, we're gonna create a 500,000 by 1,000 array, with larger chunks and so again right here, we've actually not created the array. We've just made a task to create the array, and so when we do this, we'll see a couple of things and I'm gonna switch to the dashboard, and you should see a few examples of these things that look like disk reads or disk writes.

A

These are gonna, be examples of when we're spilling, and so I'm gonna run this right now and go back to the dashboard. So this is a more complex workflow, but you notice that there's suddenly these yellowish kind of goldish tan I'm, not Rachel. You know I'll go with gold bars. These gold bars are our examples of when we're doing spilling. We were scheduling, work to be executed. That was too intensive for our soft target of four gigabytes, and so the scheduler said.

A

Okay, you know what I need to do this computation, so I have to temporarily spill some of my memory that I'm holding into CPU memory or perhaps into disk depending on how we configure it. In this case it's built the disk and then once it's done, it's going to read that. So you see right here, there's a you know, a bunch of different things happening, I'm, just gonna quickly, zoom in on a portion of it, you notice that okay I'm doing a dot product, and so I need to do a right step.

A

I just spill, then I do a dot product and now I'm doing getitem and now I'm.

A

Reading data back into memory into the disease to be clear into the GPU memory is where I'm reading it back into this is that spilling that desk is handling for us now to be clear, this operation would have succeeded, no matter what we put it arbitrarily low memory target, but I wanted to demonstrate that this is very useful because if you imagine you're working not with four or five or six gigabytes of memory on a GPU like this, but 25 petabytes on a 32 gigabyte machine, it's very important at that point, when you're getting close to the limit to be able to spill effectively, and so again we can do the same thing we did before and you know we can do the same computations.

A

These are obviously different numbers, because this was a different array, but the point was tasks handles it for us and so right now, just really quickly. I'm gonna run through the exact same thing except I'm.

A

Gonna use way larger value, so you can see a significant difference in you know, compute time with one GPU, you could have done all this on a single machine or a single GPU, and it would been fine, but right now, instead, I'm gonna use all the GPUs and just demonstrate we can do have very complex calculation, and so you can see that I actually have 16 GB using this machine. This is a DG x2.

A

It's got 16 32 gigabyte GPUs that are connected with envy links and actually all of those are connected with an env switch. I'm gonna create a cluster using all of them and I'm, not gonna use a memory limit I'm just going to ignore that and I'm gonna set a scratch space directory, which is just good practice. This is not necessary, but I'm just gonna. Do it to be nice to this machine I'm not going to have scratch space, be on a shared file system and put them.

A

You know a place that I know is not a shared file system to be considerate, so this is gonna. Do a couple of things it's going to spin up a new cluster which you'll see is gonna, have a different number of workers, and the profile, of course, will be empty.

A

So I'm going to go to the workers and you'll see that it's going to spin up 16 workers, so we're gonna eventually see the number 16 and all the different machines coming and I'm gonna skip this right here and just go all the way back down to the very bottom, where we do this large SVD again and actually I have to import. The libraries actually I think we're good I'll skip down to this once this is ready and you'll see that you know, we've got 16 GPUs here.

A

Each one is an independent worker, coordinated by the scheduler note that the scheduler is running on the same node as use, because it's a single node, not something you can set something you can do in a multi, node setup as well, and I'm gonna just expand this significantly and so right now, you'll see I have very little going on in these GPUs.

A

In fact, I have a tiny bit of memory used because I have a crude EF, which is that CUDA data frame I have a context going on or I have a GPU context for tasks and instead of this array, I'm gonna make a bigger array, maybe a five million by 1,000, with perhaps even larger chunks.

A

I've been important task array. Sorry about that.

A

So you'll see here this is this is a large gray. It's not enormous. I have a lot of GPUs, but it's still 40 gigabytes. This is too much data for a single GPU right now, unless you're on one of the very very large Quadra GPUs, and these are big chunks. So, let's see what happens if I actually run this we'll get a sense of really what's what's going on here, and so you can see that we've got a lot more tasks.

A

We have 700 tasks and these this operations, it's scheduling these tasks, it's then gonna run them and it's gonna run them across the 16 GPUs, and so actually this one is. This is not enough. It's too fast, but you can see it did this whole thing, and you know about 10 in about 15 seconds notice that it was transfers at the end. This is exactly what vibhut was mentioning when he was showing that example of how. If we use the unified communication X protocol, you see X.

A

These transfers can be much faster, but you can see that we were able to use all of the GPUs to calculate a very large, very large, matrix decomposition in about 10 to 15 seconds, which is great, and if we wanted to get the results, we could just get them there right there. The same code goes one GPU, 16, GPUs and the same code goes multi node.

A

If we had multiple of these machines, that's the power of tasks, and so hopefully this has been a good introduction for those of you who had a chance to go through this. Hopefully it was nice to see it running on multiple machines. There's also, you know a lot of value in the profile and again it's out of scope, and we don't have enough time to go in depth about the profile here, but in general. This profiles is the first place to look when you're, looking at workflows with tasks and understanding. Where is time being spent?

A

Where should I spend my time optimizing things where am I, perhaps doing things inefficiently? You know in this case, if we were the all the developers and we were all coming together to say the most important thing we can do in the next ask. Release, let's say, is to make our SVD computation better and more efficient. If that was our goal, the first thing we want to do is understand. Where is time being spent? And you know the SVD computation is all happening right here.

A

I hope you can read this I apologize, it's happening in these dot products and these wrapped QR decomposition, and it looks like about 20% of the you know of the total time, and this workflow was spent there 38 percent there. The rest was on data generations. Let's not worry about that, but it's pretty clear that we spend twice as much time doing the QR decomposition.

A

Of the workflow- and it turns out actually a lot of the time we spent in the QR decomposition was in not just the actual QR operation. It meant it was spent doing other things, perhaps deserializing serializing data and other things like that. These are things we can look at in different parts of the profile, in particular in the administrative profile, which I think is something that is more in advanced usage, so I'm not gonna, go into it now, but I'm happy to take questions and talk about that later.

A

One last thing provides the task graph. This was the test graph notice that, as Daboo mentioned earlier, when a task is released from memory, it goes blue when it's held in memory it's in red, and in this case we only have that final result. Member because we finished our tasks, but if I were to run this again, you could actually see live. This will update and we'll see how things are going, and so this is being held it's because this is the stage where communicating results, presumably and okay. Now it's finished but conceptually.

A

This is why things get held off and we often need to hold them for downstream computation and that's task in a nutshell, happy to take questions now and I believe we have a brief break and then we go into another section. I may be wrong about that, but happy to take more questions about desk.

B

If no one has any I have one, so how much more do you need to change to target eight GPUs instead of one, for example, great.

A

Question, um actually, let me add: shouldn't have gotten rid of chrome. We actually don't have a break coming up anyway. So that's okay, but the answer to that question is very little.

A

So task right now has 16 workers, because I set up a cluster that has 16 workers, so in this case I can set up a cluster that has 8 workers, and so you know, I can do this manually by typing out the GPUs I want, maybe I don't want you P 3, maybe I only want GPUs, 0, 1, 2 and I want you to use a 7 8 9 its etcetera. This is how you would set that up. You obviously could do this.

A

You know much faster if you did like you know, range 8, but conceptually. This is exactly how you do it. When you do this, we'll choose the GPUs that you tell it to choose for workers.

B

Okay, what if I had let's say two CPUs and GPUs? Is it smart enough to share between the CPUs it's.

A

A good question so currently right now, local CUDA cluster is going to use the CPUs available for spilling, but it's not gonna use the CPUs available for compute. So if you're, if you're thinking about a world in which we have these eight GPUs- and we have with these various CPUs as well, this the single cluster is not going to be able to use. For example, you know 40 threads per CPU core and the a GPUs that's a more advanced operation as a as a baseline.

A

It's going to use the CPU is available for spilling if that's necessary, but it's gonna focus its work on only doing compute on GPUs.

B

Is there way to manually free memory using desk there.

A

Is a way, a good question, so perhaps there's a memory error because there was slightly more data than you know could fit in the in the in GPU memory or the computation spiked memory slightly more than expected or the GPU was shared or there were already objects in memory. I can free this memory explicitly I'm now working in Jupiter notebooks. Many of you probably have experienced this. It's a little bit more difficult to free memory, because Jupiter holds onto references, but in general I can free the memory associated with this I.

A

Sorry I killed my kernel, but I can free the memory associated with this. You array that you know we know is fairly large or the x-ray that was 40 gigabytes by simply calling Dell X. Just like you would normally do. This will trigger the garbage collection for Python and ask now because it's Jupiter you actually might need to also delete some of like the hanging references, but in general, that's how you did.

A