National Energy Research Scientific Computing Center (NERSC) NVIDIA RAPIDS Training, April 14, 2020, 14 Apr 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 6. Introduction to Dask

Description

From the NERSC NVIDIA RAPIDS Workshop on April 14, 2020. Please see https://www.nersc.gov/users/training/events/rapids-hackathon/ for all course materials.

A

So today we will talk about how we can distribute our workflows across multiple GPUs, whether it's across multiple machines or whether it's on a single node.

A

So the idea of tasks is that it provides us with a single byte, an API which allows us to enable parallel processing of venues, which is specially important when you are working with large datasets, where essentially, we become mz--, bound or compute down, and to enable the power of using multiple nodes and multiple GPUs, we use tasks. So what is dusk so dusk is essentially a distributed computer scheduler, which is built to scale in Python.

A

The idea is that do you wanna write your code on your laptop and then you can simply scale it to a super computing cluster. So, and the idea is that you want to use multiple workers, and this really helps specially with rapid when you have one worker per GPU model, so you you will have the number of workers set to the number of GPUs. We have and wide ask what are the advantages that ask offices a major advantage.

A

The desk offices is PI data nativity, which what that means is that the API is that we are already familiar with whether it's numpy and our scikit-learn. We can easily translate those api's across multiple nodes, as the API itself almost means the same, so the user kind of becomes agnostic to the fact that your code is right now running on a same single on GP or a single CPU or across n nodes.

A

It's easily deployable, whether is if it's you're, deploying it on a high-performance computing cluster which flow all on cloud with kubernetes and I, and another important part of it is that the people who are working on we're looking tasks almost the same people that oh really involved in the pilot and cyber community for which for the stuff, which is to the API consistency, remains the same, whether you're using dusk or numpy or scikit-learn.

A

So, let's look at an example about how we can use dusk with pandas dataframes or how do pandas dataframes translate into a das data frame. In this example, we can see that our dead of themes, essentially I'll partition by month. So you can imagine here: we have six data frames.

A

All these data frames might be sitting across multiple machines, but they may be on the same machine and all of these them combined make a single last data say: if you want to accelerate our pandas dataframes and unlock the power of our GPU, we can just switch out our panda's data frame with a coup, D F Tet of thing and everything else on the EPI layer remains the same, but just the underlying data, such a changes from a pandas dataframe to a coup, DF data frame.

A

Now, let's look at how a numpy array will look like when you're working with tasks and now so essentially, we again will have multiple partitions here. The example is a 2d number array, but eyesynth. It can be an N dimensional number array which will be sick, which will be partition and since Eve a scalar becomes a rapper Ilan dosa around those partitions again, because when you wanna go from numpy to koopa, so that we can get back better pal ism with our GPUs, we simply translate it translate.

A

An umpire's into Koopa is, and the overall a Piazza means the same.

A

So now, let's look slightly under the hood about how all of this happens and what are the the task actually so I do with which Khedira across multiple nodes and multiple workers I.

A

The idea here is that we split out our execution part from a task of creation. Part in this example. You can see that you have. If you wanted to create a number area of sale at 15, we can be simply called NP dot once if you want to do the same thing in task, you simply called need a scalar or da dot ones with size equal to 50, but there's another argument here, that's important, which is chunks so here we set our chunk size as 5.

A

What this translates to is that we wanna chunk an umpire. It is a of into each of length 5, but at this point, when we create X equal to da dot, once we essentially haven't created anything, this is just a task stuff that we create. So when we execute the task of, but only its only then when these will actually be executed, and then they will be, they will come in memory and then the processing will become. This essentially is a lazy operation. So at this point we only have this task graph.

A

Now, let's look at slightly a small computation that we do on the number I raised here. Essentially, we call some on the previous array doctor previous ayaat. We created how the fastener for this changes is that we now have a reduction operation here, so this reduction will look similar to those of you who have experience with math I'd use. It same is similar to the reduction that's happening here in task.

A

Now I said so does essentially, if you look at this, the API is fairly simple and it will be similar to what you write in number then, but under the hood does will create a complex task that for you. So this is an example of a statistical computation and you can see how that last graph gets created.

A

Now, let's look at now, we have learned how the graph is created. Knowledge try to understand how we actually schedule competing of that graph.

A

Dusk offers us really nice visual debugging as a debugging tools, which allows us to actually see how done scheduling is happening right now in this task graph, which is essentially a graph of a machine learning workflow. We can see that so the red here signifies the stuff that we have to hold in memory, and the blue here signifies the stuff that's already being scheduled and is now is freed up, so does a skid? Does a scheduler essentially handles all the garbage collection for us?

A

So all the tasks are gustation as well as clear as well as maybe management is all taken care of, but the dust distributed scheduler. So the scheduling essentially will tell okay. So on this end worker, please schedule this task and once the task is completed and it the memory is no longer, they don't seem just free up the task. Also, the scheduler is a single threaded scheduler and it essentially is responsible for scheduling which the task goes to which graph.

A

This is important, because this allows us to do cool stuff like dead dalvik ality, so that we are see reduce a transfer overheads.

A

Okay. Now, let's see how we actually create a last question. Creating essence with APL remains is justice. What we see on a screen, this API should remain constant, whether we are doing it on a local machine, or we are doing one really big question. Those of you who will go in the lab after this will I have already experiment and will see how we can create a local Buddha cluster and that essentially should translate as well as when you are doing it on a multiple multi node system.

A

Very essence, the way you switch out your local CUDA trust to a different class. Let's, if you are doing this on kubernetes, you can since weekly in the cluster from classic by using this statement. An important thing to note here is that a non same code remains the same. So you really don't have to worry whether you see you good it on your laptop or you wrote it on somewhere else or it's being deployed somewhere else called the classification step. Essentially these four lines that everything else down she just means the same.

A

The other thing to note is a lot of times when you're working on multi, node, workflows, communication often becomes a bottleneck and I ask natively. We use TCP sockets for communicating a data between workers or form a scheduler to a worker. What you see XOR, unified communication provides us is. It allows us to access any hardware acceleration. We have on a communication, so this may use in the line. Hardware, speed hardware like any link on InfiniBand and to enable using ucx on dusk.

A

I team, rapids route python bindings for ucx ucx natively was in c++, but we load fights and bindings, and what this unlocks for us is that we can now use it with tasks for python based communication. Now let look at an example where this becomes important so interesting that you see again. All of these will be available on our dashboard. You see in the lab, and you can actually see how your tasks are being scheduled as well as them. So in this example, we have four workers and you're doing a SVD with Cooper.

A

Arrays are lying underneath, if you're doing it just for TCP like this blue teal, color tasks are some computation against doing, and the red here is communication, as you can see here after our initial computation, the communication becomes a bottleneck once the switch are out our TCP to ucx. That bottleneck goes away and we drained out our task time for 25 seconds to 17 seconds. Why this is important. Is that a lot of times, I'll work flows become coming, become really a communication bottleneck and with you see eggs, you even accelerate that part.

A

Another important thing to note from a user's perspective is that starting QC x is essentially almost as same as starting a TCP cluster. You just switch out the Camino TCP argument or the protocol argument where you see the from TCP to UC X and everything else just works out of the box.

A

So the idea of the of this presentation is like we saw how we can get speed ups on single nodes for jock, itch I, either by using number on q, pi q, DF q ml and number. And now, when you wanna, unlock a bigger workflow that e ID don't fit on a single machine or be a communication button, they cannot require multiple GPUs. We can do them with tasks without a lot of code changes.

A

Okay now and how do we get the software I think this should have been going at the us pasture, but we have a own github page get up. We are because Rapids is our open source project, so you can look at what in development on them and installing any of our libraries is even in good roses as simple as calling installed in Golconda packages. So you can look at anaconda, app itself or other releases, and that's all I have for you right now and I'm open to any questions about the workflows.

A

You have ordered a skin-changer.

B

Thanks for boom, does anybody have questions? If so, please type them in the chat.

B

Well, so far in our questions.

A

I'm open to discussing any workflows that you think you have right now that.

B

A

Multiple nodes and.

B

All right: well, we have one question from Rollins, so you would like to know when you might choose desk over MP I, like kind of a pros and cons.

A

So the I so essentially I think MP.

A

Yeah, so in IT invited the universe, it's not very straightforward to use MPI with us q. Can you get almost all the benefits of MPI if you are using, you see X. So if you're working with PI data, libraries I think you it's just easier to use, does for that communication.

C

1:1, just another would use me one additional input there. So I, you know, there's no MPI for pi is is useful, but if a breeze points its, they can be limiting more difficult, but one particular difference is really what I would kind of describe is loose versus tight coupling. So, like the MPI parallels parallelism model, it's like pretty tightly defined and that's that's fantastic. It provides a lot of benefits through its guarantees.

C

Tasks has a very flexible relationship with parallelism and, as a result of that, it means you can use desk in areas where you might not be able to as cleanly use MPI and when you can use both of them in certain areas can perform extremely well still. So there's a lot of flexibility that provides out of the box. That would be very difficult to consistently and cleanly do with a different paradigm.

B

Okay, great, so we have some more questions. So do you have your harness s? Do you have any advice on how to use desk in conjunction with a project that has already implemented some pieces in MDI over pi.

C

Yes, we do rather than talk about that here, because I think that's a really deep conversation that might be with a follow-up. Perhaps later I can instead say that there is actually an example of this and there's a library in the desk repository on github called ask MPI, which is something that I encourage you to look into, and perhaps we can chat about that later. Okay,.

B

Great, so we have a next question: easy to use. Mpi so I think maybe a CPU question.

C

Personally, I'm not sure offhand of how cleanly MPI in scikit-learn work together, I suspect that someone has done some exploring on that. But I don't know offhand.

B

Okay, fair sir Owen asked dated parallel. Oh no he's answering guess: Ron's answering okay um continue.

C

C

Think if there's no more questions or if there's continuing questions, we can, we can keep going with them. While we also shift toward the notebooks.

B

We can do that and I'll just remind everyone that we've reserved the half an hour at the end for open questions. So that's that's a good time to ask anything about any of these topics. So don't, if you don't ask your question right away, don't worry you, you haven't missed your chance.

C

C

Great well, the booth thanks for presenting this. If you don't mind relinquishing the screen I will take over for the notebooks awesome.