National Energy Research Scientific Computing Center (NERSC) NVIDIA RAPIDS Training, April 14, 2020, 14 Apr 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 8. Demo: Accelerating a Real Workflow

Description

From the NERSC NVIDIA RAPIDS Workshop on April 14, 2020. Please see https://www.nersc.gov/users/training/events/rapids-hackathon/ for all course materials.

A

So in this segment we're going to talk about a couple of things: we're going to actually accelerate a real workflow and I believe we have Taylor Taylor gross in attendance and we're going to walk through the workflow that he's generously allowed us to to use as a guinea pig for how we can take a cpu, workflow, evaluate it, understand it and convert it to run on the GPU with Rapids and then get some serious speed. Ups, so Thank You Taylor.

A

If, if he's, if you're here, I, don't think I can see the participants right now, but I would love for you to give a brief, yeah overview of your workflow. If that's okay, yeah.

B

Yeah, yes, this workflow and nurse, we collect a lot of counters on our systems running every second there's. You know, probably about one or two 1,400 mm counters purse, which we have on the system collecting data every second. That gives us information about how the network performed, and so my background is looking at our high-speed network performance and trying to improve that so loading in all this data on the CPU takes a lot of time.

B

You know this ends up being terabytes and terabytes, or even maybe a petabyte of data that we have and and just analyzing, that in converting the data from the the very Hardware specific counters to things that are meaningful for analysis was taking quite a bit of time on the CPU, so that was where Nick and NVIDIA volunteered to to see how they could help us out and speeding that up.

B

Hopefully, that's a that's a good introduction. Awesome.

A

Yeah thanks so much so, ok, so with that we're gonna go through this workflow and then actually do the port life. You know because everyone always says live coding is always a good idea, so we're gonna do that and then you know hopefully get some brief moments during us. We're gonna take some stock and say like this is why this makes sense. This is the way we're thinking about this and how to think about structuring workflows for the GPU, because often it's a little bit different than thinking about the CPU.

A

So when we get a workflow like this I mean Oh, Taylor was again nice enough to provide it and work with us and explain it and walk us through it. So he provided a notebook, and this notebook has a variety of things going on in it.

A

um You can see that it's got some standard imports, pandas, multi-processing timing, numpy cetera, he's also written some custom modules and four functions that he's using to do that analytical work and that processing of these counts to make it human informative that he was talking about, and you know that's something we're gonna have to look at as well, and so, let's you know, let's just actually go through this and take a look at what's happening so Taylor's also created a timer object to help us understand how long things take, which is very helpful, and so this workflow is being distributed across scores in the system.

A

It currently I'm running it on the same machine as using earlier, and it's I believe this has 80 CPU cores, but in general it's got some number of CPU cores and in order to use all of them, it's gonna use multi processing.

A

So this workflow is taking a bunch of samples from files and it's combining data frames. We can see that getting some information, it's going to try to read some data.

A

B

A

It's gonna do something, and then it's gonna return. The combined data frame that makes sense.

A

This is called combined quarry data frames, and then we want to parallelize that we want to parallelize that with the multi processing pool API, which is the standard canonical API for spreading Python work across multiple processes, each one using a different core, and you map your functions with the pool mapper, and then you have to make sure that you join and close at the end before returning your data to synchronize and with this this parallelized essentially just takes an arbitrary function. It looks like, and it's going to take this function.

A

It's gonna take some data and some number of processes and it's gonna split the data into chunks. So in this case, by default, we're gonna split. Any data into eight chunks.

A

We're gonna create a pool of that many processes and map each chunk to one process then map the function to that data run it in separate processes and bring it back together, makes perfect sense, and then this function run on subset is that looks like a way to actually run the function in particular running a function in a way that looks like it's gonna, be a row wise operation on a panda's data frame. This access equals one is a giveaway for those familiar with the pandas API that this is probably gonna, be a pandas.

A

A data set. This paint this data subset. We don't know that because we haven't seen it yet, but it's kind of a giveaway and then there's this wrapper function, which is a wrapper around parallel ice to allow passing arguments to this. It's using the func tools, partial API, to pass different arguments, and things like that. So that's that's it. So, let's actually take a look at this workflow there's a sample of data. This is not terabytes of data that will take too long for us to go through. This is a sample of data. I.

A

Think it's some think. It's a few gigabytes I forget how much, but you can see that I haven't cleared my cell, it's gonna be three and a half million rows and we're gonna read this into memory, so it's gonna take a little time. So, while this is going, I'm gonna explain you know: what's gonna come next in this workflow, so we look through this and we saw that the first thing taylors does. Is he source the data by time?

A

Okay makes sense sorting by time, then we can take a look at the data. This data looks like this. It's got a time column. It's got this column. That anecdotally I know, is gonna, be about these different systems and getting counters and use these things, but I'm not well versed in this. We work with Taylor get a better understanding. Then we get an understanding of what's actually going on. So in this case the key aspect here is that there's 800 columns and there's you know three and a half million rows.

A

So this is a large amount of data to process. We're gonna do this kind of counting for every row and we have millions of rows. We have hundreds of columns and the counting logic is fairly complex actually, but we also see that there's an identifier from which router this came from just looks like a hash or something so. Taylor has provided some examples of doing this.

A

This is you know one of the parts of the workflow, and we can see that what's gonna happen here is we're gonna, set up a pool and he's doing this in a loop to get a sense of the scaling capabilities.

A

So you know with one process we're gonna test this with different numbers of rows for I and range 1, 2 3, so 1 or 2 we're going to go either 10 rows or 100 rows, because it's a is 10 to the I and then take a sample and try this to see how it scales and the rest of this code is about. Estimating that so we'll run this and we'll see it took about 10 rows for one one process could act could do this aggregate VC's function, that's being paralyzed in about 1.6 seconds.

A

So naively, if we have one process doing it in 1.6 seconds, we are really concerned are already because we have millions of rows and we see now we have linear scaling, because when we scaled up to 100 rows, it roughly is about 10 times slower 1.60. You know, let's call that 1 and 1/2. Let's call this 15 we're scaling linearly in processes.

A

If we have 4 processes with 10 rows, it took maybe half a second for process 100 rows for seconds the estimated time on this machine to process all the rows at this rate is 2 and 1/2 hours. In this case. That is too long, because this is a sample of the data. We can't wait two and a half hours. We have to speed this up. We know that that's the goal. So, let's see what happens next. This is the second part of the workflow. We had this aggregate VCS function, which we don't really understand.

A

Yet we know it runs. It does something on a robot and a row by row basis.

A

We have this other function, which we also know does something on a row by row basis, because we have parallel eyes on rows and we can see how long this is gonna take I think we have some intuition that this might be fairly time-consuming and we can see that in general we're getting some good speed up when we go to more processes. For this, a thousand rows only took four and a half seconds. A thousand rows took one and a half seconds, and even one second, with eight processes.

A

So this actually runs a lot faster, but it still might benefit from speeding up. So at this point, we've got a sense of the workflow. We see that there's a clear bottleneck right here, but we don't really understand it. So now our job is to say: what's going on in these functions, we've gotten a sense of what's happening. We know that the output we want is this temp color data frame, and it's created a bunch of new information columns in this data frame. We had 800 columns before roughly 400 43.

A

Now we have almost 1100 columns, so we've created a lot of new information at the end of this data frame, which is the goal lots of good stuff. So at this point, we're ready to say well, what's going on in these functions, we noticed that sorry lost my place. All of these functions were using come from this ldms PP module.

A

It's the same one, it's the same one. Let's take a look at this ldms pp module, we're importing it. Let's take a look, and please let me know if you'd like me, to make the font larger I'm happy to do that, because I can see it's a little. Maybe let's do it anyway, so this is a module with a lot of functions, figuring their own module, your insular yes, I am thank you thanks Taylor. So this is a function. This is a module.

A

That also has a lot of functions actually might make this a little less large. We can see more at once, but if it's too small, let me know so, there's a lot of functions here that we don't necessarily need right now. What we need to understand is what goes into this aggregate VCS or VCS function.

A

What goes into this so we'll find this function, and here we go so this is in fact, there's actually two versions of this. This this one was 10% faster. So this is the one that was part of the workflow. So there's a lot of stuff going on here, so it's a fairly large function. So, let's, let's try to unpack this.

A

We know it's operating on pandas dataframes, so we know that it has some implicit structure about how it does processing and that processing is on a row by row basis. So this row is the unit of account. Essentially, this row is going to have attached to it all of the columns and the data frame for that given row, that's sort of that's just the way that we apply functions on pandas dataframe. So we know that so we can see.

A

There's some loops here, there's actually a nested loop, so we're doing something five times and then for every one of those things we're doing another thing eight times so we're doing something 40 times. What are we doing? Well we're creating some variables and initialize them to be zero. These are the Flitz.

A

We are also creating some strings and we're gonna create the strings based on where we are in these loops. It turns out okay, so we know that the looping logic is important for some kind of strings we're creating. Then we see the same thing, but it's now four stalls rather than Flitz, so it looks like F, and s are these indicators that are prepended on these things, to tell us what we're working on and it's the same kind of information.

A

You know there's different things that are being created and these look, like you know, incoming packets versus incoming. You know flit some I'm not well versed in this domain, but it seems like we've kind of sense on what's happening, we're creating some labels and we probably will do something with them. So then we get to some more logic. You know for every one of these iterations every time, for every set of these things that we've created we're going to loop through and grab the value from this column.

A

So this column is presumably in the data frame and we're gonna grab the value. It's a numeric, because this data frame was numeric and add it to this total, which we have up here so we're doing a sum across several columns, in this case four columns and we're defining which columns were using based on this, our value to C value and this VC value in these loops.

A

Okay. So this is a binary operation, we're doing a binary operation and then we're doing a reduction, we're doing a sum. Essentially we're just crab. Sorry we're doing sorry we're doing a reduction operation. It's this! It's a something, we're grabbing it and then we're doing a sum. The total, after the by after the binary operation of the addition- and so we know we're just doing addition and summation. So that's pretty good. So far, we've got a sense of what's happening. We see it's happening again here.

A

The same thing addition summation binary, ops, reductions, those are great GPUs love them, then we're doing some more addition. Work looks like we're after we've gone through these we're going to combine the vc rec and the VC RESP to create the overall Flitz VC X, and then the stalls VC X and we're gonna do something with these. Presumably, okay, here's the answer. Those labels we created up here. We created them.

A

We're gonna, make a new column in the data frame and we're gonna put that sum that we calculated up here or we initialized up here, and we add a two right here: we're gonna make a new column and put that sum in there and we're gonna. Do it a bunch of times? Actually we're gonna. Do it for that one for this one and we're gonna do per then we're gonna do the same thing for the stalls. We're gonna take this stall some this one in this one.

A

So for every iteration of this lupa 40, it looks like we're. Gonna create six new columns and we're gonna do it for every row. Okay, that's pretty good. It seems like we have a pretty good sense of what's happening here. Let's take a look at our other function.

A

Get per router counters by color. Well, what is this gonna? Do I'll find out.

A

Here we go get per router counters bicolor, we're gonna, take in a row just like before, and we're going to initialize some things just like before some some counters, presumably and then we're gonna loop through this index of the row, and so in this case the index of the row is gonna, be all the different identifiers associated with it. So what this is really saying is we want to loop through and evaluate? Oh I. Don't have it handy here, but it'll be on this one?

A

We want to evaluate this router ID here this c7 v, c1 blah blah. That's what this index is and then actually sorry I. So that's what the name is I think that's what the name is and then the index is going to be the actual the actual columns themselves, so we're gonna, eight all of the columns. That's the index, getting my pandas logic, mixed up, we're gonna, evaluate all the columns and loop through them and then so for every column, we're gonna run. This get tile number function on it. So that's gonna give us something.

A

Some number we'll come back to that and if that, if that number is doesn't exist, if it's none we're gonna, just not do anything, we're gonna skip and then we're gonna do some reg X logic, some regular expressions.

A

This is a fairly clear pattern. We're gonna capture a pattern of a C, followed by some number of digits, then a dash, then some some arbitrary number of digits, then a excuse me then a C and it's more digits and an S and so on, and then we're going to see if we can match this pattern to that name- and the name is this: this is the thing we're trying to match against right here these things and it makes sense.

A

We see that there's the letter, the number the dash and so on, so that checks out we're gonna. Do this and we're capture that last match the 8th group 1 2 2 3 4 5 6 7 8. So we want that final one, we're gonna call that slot. Ok, so you can see our already. This is a fairly you know: complex amount of stuff happening it taking a little time to understand.

A

So now we get to some branching logic: we've got our slot and if these teams that we created- which we don't understand yet- but if this tile number is within certain conditions, there's some branching logic. We're gonna do things. If this is the case and if this column has flit vcx in it, you know this one doesn't have flit vcx. Neither does this one. Neither does this one, but some of them do. Presumably if it has flit vcx in it, we're gonna do some counting, we're gonna.

A

Add the value in this column to the blue Flitz counter. If it's cut robust, we're gonna, add.

C

A

A

Locked all numbers, but they're, not super important from the GPU perspective. We can just have the same branching logic with the octal, but we're gonna have a new condition right now. This is again a condition that looks like it's got. Another sub condition is little more complicated. It's using that slot that we just had so we're gonna say if slot is less than eight and flits are in the counter face. Flip vcx is in the column name. Add it to green okay, so we're getting a sense.

A

The color here is important same thing with stahls and then we're gonna add it to black if slot is greater than or equal to eight, because it's the else statement like this a fluke so we're getting a sense of what's happening. This logic then continues where this branching logic comes with this. So if this is not true, then we go to the else, and then we do the same logic, but for different things. At this point then we're gonna return.

A

It looks like some new columns that we're calling router flips black router flats, blue flatscreen, etc, and these are just the sums that we've calculated so at this point, I think we have a pretty good handle on this workflow we're creating new columns that are summing up the counters from all these different. You know, sensors or routers, or that are coming through the system and spitting out the results. This is actually just like Taylor explained it's great when it works out like that, and so the next step we would have is well.

A

We know it's slow, but why is it slow? It's almost always the right approach to start profile profiling. Your work float that exists already before doing anything else. So I'm gonna do that I'm gonna use a tool and I'm zooming out just for a moment, so I can make this cleanly. I'm gonna create a Python script. To help me profile. This and I'm gonna make a tool that I'm gonna use a tool. Excuse me called snake vis, which I highly recommend for those who are not familiar with it.

A

If the tool for visual profiling and it does work in Jupiter notebooks, but it will be cleaner to do it in a Python script, so I'm gonna do it. Do it here I'm going to actually of not use any of his parallelism because it doesn't actually affect the profile. It's just gonna cut it I'll capture it overhead, but it's gonna make profiling very hard to its multi-process. So I'm not gonna, run this.

A

What instead I'm gonna do is grab the data and let me actually because a Python script first, so you can see the syntax highlighting.

A

So I'm gonna now load the data I'm. Sorry I call this ldms profiling, lmds. Sorry.

A

Live coding and I'm gonna use this to do the same sort, so obviously I want to make sure we're on the same workflow and I'm not gonna run this multi-processing. Instead, what I'm, just gonna do is run the function I'm just gonna run it I mean that's very fine with me. So I'm gonna take these right here now, I'm, just gonna run them I'm, going to not run it on here. I'm. Actually, just gonna literally apply this to my data frame and say that temp got apply.

A

Ldms I can get V sees access equals one. This is what's actually happening under the hood with the the parallelization. It's it's doing this in parallel, but I'm not going to need to do that. Then I'm going to do the same thing.

A

I'm gonna put in that set index and then I'm gonna run the same thing with this temp color for that last that last step, but again I'm not going to actually parallel Issac I'm, just gonna run it on the data set and so I'm gonna pass access equals one manually, and so there we go so at this point this is the workflow, then we'll have our results and you know print them out at the end, just for the sake of it.

A

But what we're gonna do here is actually profile this, so I'm gonna make it terminal.

A

So I'm gonna this directory, which I have in nurse and I'm gonna, go to L because I just I just clone this- for this example and I have this saved in scripts. As you can see right here, it's in scripts, so I'm gonna run this now I'm gonna run this to create something called us.

A

D

Hey NIC sard interrupts, um who lost your sound? Can you hear me is.

A

It back now yeah.

D

A

Sorry, my headphones are acting up. Apologies for that um I'm, not sure when I went out, but what I'm gonna do here is profile the workflow using something called C profile and then visualize it in snake this, which is a library that's for visual profiling. The C profiles bit is big, is baked into Python I'm gonna save the result. As this thing, I'm gonna run. This and I have to activate my environment.

A

One second: to get the name of the environment.

A

There we go okay.

A

Now I'm gonna run this now that I have pandas available. This is I, guess a good lesson in making sure you're you're in your kondeh environments that you think you're in so this is gonna run, and you know it's gonna take a little bit of time to run not too much, but we know that there's a lot of data being read by the pandas data reader right now, so maybe I'll take 30 or 45 seconds. While this is running, I'm gonna show you what snake visits.

A

So this is what I'm gonna use to actually profile this with a visual profile. It generates really expressive. Visual results that make it really clear where you're spending time and make it so helpful for helpful for drilling down into details.

A

So this is what I'm actually going to do now. I, don't need this dashboard anymore. This is gonna be done in a moment. What I'm gonna do once this is finished, is basically run this exactly and so I am going to take the results of this this profile. Okay, so it's still going it's taking taken a bit of time to sort. Then it's just gonna run the this apply, then this apply and it should be done shortly. Hopefully, I guess this bakes into the point that this is a you know we need to.

A

We need to optimize this, and so you know after we do. This profile will hopefully get a sense of where, where we're spending a lot of time- and that will give us insight into how to do this- the GPU.

A

D

Ghosts are you you're profiling on the CPU right? Yes,.

A

That's correct.

D

Okay, so informative to look at where the CPU is spending time, because maybe it's a really different ballgame on the GPU. It's.

A

A great question, um so it is informative and it's non informative for necessarily the reason of why it's gonna tell you exactly where to go for the GPU or you know where you're gonna spend the same amount of time, because to your point you might not, but it's very informative for thinking about how to attack the problem. It's possible. This profile will tell us that certain things are not necessary for a first pass to put on the GPU. You know, for example, right now. We are time boxed.

A

We have 30 minutes until this session is over and in these 30 minutes we may not be able to finish the entirety of the workflow, but if most of the workflow is is happening in terms of the time spent in several of the key functions that is really informative for us to know. So that makes sense.

D

A

Awesome, this is probably my fault. I should have made this more like ten or a hundred, because this is why it's taking so long. We saw that this thing scaled linearly and we saw that one doesn't. This is gonna. Take a thousand seconds. I didn't think this through sorry, I'm gonna kick this off again using ten records, and so we'll have to wait a little bit longer. But so, while this is happening, I will go back to I'll, get a set up to do a rapid version.

A

While this is being profiles, so in the rapids version, we're gonna want most of the same things, because you know we are gonna, need these things, probably but I'm not gonna grab them from the same libraries I'm. Instead gonna copy them from this script and bring them up. We know we need this, get routers by color, so I'm, just gonna put it here. So it's easier for us to see it. You know we need that. We know we need this. Other function get tile number and there it is.

A

We know that this is something we need. This is getting a number from a regular expression pattern.

A

We'll call these the original functions. We know we need these. We also know we need that aggregate VC's function, which is right here. So these are our functions. We, these are the things we're operating with okay there. So this this finished great. So you can see now that I have a profile right here. I have a initial work flow dot proof. This is a C profile and it's it's fairly scary, to look at them if you're not experienced with it, but it's actually a very specific structure.

A

It's very straightforward once you understand it, which is that every line is essentially part of the call stack of everything, that's happening, and it's just measuring time in the call stack. So what I'm gonna do now? Is you snake this too, to visualize this profile? And so what I'm gonna do is visualize this initial workload, Prof I'm, gonna, sorry, I'm, gonna!

A

Not do that I'm going to pass in some configurations that determine what ports to use and, what's going on with the look, this is the basically saying the local host on port 8080 and it's gonna give me a webserver and so I'm going to go to this web server, and it's happening right here. For me, this is our profile.

A

So snake vis is going to make that icicle plot for us, so it's turned about 50 seconds. Most of the time was reading data. That makes sense. This was a very large data set. We didn't do that much compute, but this is also a fixed cost, so I'm, not that concerned about it right now, I want to see where we did things okay.

A

Well, we took some time sorting that makes sense. That was also a fixed cost. We had to do that once what about in our apply function right here? Okay, so this was our apply standard and we called aggregate VCS that took 2.4 seconds out of 2.8 and we did a little bit of other things going on.

A

We really didn't do that. Much like this get per routers counter is very little time spent almost all the time in this when the actual compute he applies. Almost all the time is an aggregate vc is in a tiny bit. You can get per router counter by color.

A

That's consistent with what we expected from the panda's version that we saw this this function scale very, very slowly, but now we can see clearly that this is scaling slowly because of all this set item calls and in particular, we're using these set item, calls to create new columns. So the creating columns is what's taking really a long time here, and so that gives us insight into this workflow in general. But really the key thing here is that the next 30 minutes, let's not focus on get per router counters.

A

We know that eventually we can do this, because we've actually done this before we've done this in the past, but let's focus now on aggregate VC's, because 98% or 95% of the time in this tiny example, and perhaps even larger in the real example 99.5. Probably if we did this at scale, is spent in this function.

A

So now we know that's great. We can kill this profile, we're good to go, we're ready to start so in this case we're gonna still import. This data set with pandas because it's a fixed cost and the pandas hdf5 reader, is what we're gonna use. Cuz qu DF doesn't get support hdf5 reading directly to GPU memory, so we're gonna read the data in just like before, and so we can get a little more setup. While that's happening again, we could sort this on the GPU. We saw that sorting is very fast in the GPU.

A

But again it's not super important, because these are fixed costs, and so we can just run them. I'm gonna switch this to not be an in-place operation, but still run the same sort and when we're developing this workflow, we don't want to work on this total data set. This is a large data set. We saw it was three million three and a half million rows. We want to work on a sample and so I'm gonna call this quarry sample and I'm going to just take the first. Maybe ten rows.

A

I'm gonna make a copy that way. We don't end up actually mutating the original data set. It's very it's very often the case that we accidentally mutate datasets and that's what causes that setting with copy error that I mentioned earlier in the in the day. So I'm gonna take a sample of this and I'm also going to have to import cootie F, because I want to do this in the GPU and I'm going to maybe I'll corn I'll call Corey sample GPU as qu DF from pandas I'll.

A

Do it the next cell, but I guess could be a data frame from pandas Corey sample.

A

There we go so this is again. This is going to be quick. It's going to be oh right, okay, so, and support this time, Delta timestamp, which is the way that pandas wrote read in this data. We do support date times. We don't support this time Delta, so we can find that column. This time Delta we weren't using this duration column in the work flow. So we can temporarily drop this. We could, of course, just cast it to a different time. We could do different things with this.

A

We could change the structure to be instead of a time Delta. We could use a start and end, but for now I'm, just gonna I'm, just gonna, get rid of it and say in a drop duration. Axis equals 1. Now this is the same way: you'd drop a column in panda, so it should look fairly familiar, and now we can put this on the GPU and I'll call this quarry sample GPU.

A

So we should, as we know, be able to just run these functions like we should just be able to run this. So if we run quarry samples not apply aggregate VCs.

A

This should just work and as expected, it does work. So that's great, so we know we've set things up correctly, so we want to mimic this function on the GPU. We need to start thinking about things in a different manner.

A

This is operating on a row by row basis doing operations on independent rows, one at a time, but it's doing independent operations. These operations addition and then summation right here, that's what this is to add and sum and set equals. U.

A

These are independent parrot reductions that could be done in parallel and or binary, ops and reductions that could be done in parallel, so I suspect that there's a way we can take this logic and instead of having it work on a row basis, we could have it work on a column basis and I suspect that would actually apply both to the CPU and the GPU. That's like the suspicion from looking at this function, that's sort of like the first way.

A

We approach this, because if we can do that, we'll save a ton of time setting the column's once per column rather than once per column per row.

A

So, let's define a function, justify an aggregate VC's, but let's friend colander aggregate feces, and we don't want it to work on the rail. We want to work on the entire data frame so that this is what we want. So we we kind of have a sense of what we want to do. Let's operate on columns, not rows. That's like this loose sense that we have, and so we want to sum across rows across specific columns in each row. Actually, that's what we want to do right.

A

We want to put into this new column a bunch of different information here that, if coming from these four columns, then coming from these four columns that are defined by these loops, so we want to operate on columns. We want to sum across specific columns in each row and we can probably do this in a couple of stages.

A

We can probably do this in two stages: let's do this in two stages: let's first collect all the columns of interest, and so if we can collect all the columns at once that we want to care that we care about, we can then do a do.

A

A single row, wise, binary, op and reduction addition plus some, and so this in theory, it seems like we'll avoid a lot of the pain that that's coming from all these repeated calls, and so let's try to do that so with Garrity's past, so we still need to generate the same columns. We still need to generate the same output columns. We need to do that, so we probably can't avoid this loop now.

A

The first thing about this is it's you're gonna want it's common for myself and, for others, probably to say loops are our bad loops are time-consuming? Can we avoid loops and of course we want to avoid loops where we can, but sometimes we just need loops, and so, when I'm looking at these workflows and thinking about how to profile and how to port them, we don't necessarily have to move the waterfall. All at once, we can move the waterfall inch by inch, and so we can start by saying probably need these same output columns.

A

So we might, we might need these loop still. So let's keep the loops and you know maybe we need this try. We can still wrap these in a try, except if in case, there's an error, but no that's that perhaps is not necessary, but it might be so we probably need to do a lot of the same stuff. You know we can start with the Flitz. So let's, let's start with the Flitz, and we might need all the same information let's find out. Well, we know we want to create these output columns.

A

So we probably we probably need these still that's what we're creating. So these make sense. But do we need these counters? Do we need to initialize a counter to zero? If we're gonna do a single row wise, binary, op, this binary op, we can use the pandas API or the cou DF API to do DF columns of interest.

A

We can do that. Some row wise. This is what we can do, so we probably don't need to initialize these. We probably instead want to think about it a little differently. We definitely need these columns I feel like, but we want to get the columns we need instead of initializing one of the time. Let's get the original columns, we need for Flitz installs, so we these kind of separated again you have to do them separately for all these different things. Right like we need this freak label to be this wreck thing.

A

So we know that we probably are gonna want this loop still, but we want to do it differently. We need the columns of intro to create these columns. So maybe what we can do is say, get all the columns at once and add them to this list and then we'll do the same thing for for this. This is that stalls VC rec, so we'll do the same thing there and also. Please stop me. If there's any questions, if then or if Lori, if you think you can save them, that's fine too.

A

Just let me know, but so we're doing this and we're essentially trying to unroll this loop right look loop unrolling. This is the loop we don't want to do, and so I want to instead append them to these things. I want to say append.

A

A

And I want to say.

A

Append this column and I guess I, probably wanna actually append them, rather than just do it. So this is gonna. Do the same thing except it's just gonna, it's not gonna. Do the computation is just gonna collect all the columns we care about so okay. That makes sense. So we've done something good. There I think, and we probably need to do this again, though, because we had a second loop.

A

So we have this other loop over here and we probably have to do it again, because we need all the information to be the same, but so we probably should create some more columns. We should probably create the Flitz VC RESP columns and then do the same thing for SVC respite and we probably again will want to do. These are pens instead. So we'll probably do got append we'll put this in.

A

And then do this for this stalls, BC, reps and so all I'm doing is just adding not that all I'm doing is just adding these to a list to keep track of them. So at this point we actually haven't again: we've done no computation that we've collected the things that for a given iteration of this loop, we want to sum across. So that sounds good. That's actually really good, and so now.

C

We can probably get it quick. Sorry I got a quick question here at Nick. The append wait. What is those are just lists right, so a Penn returns. None right! Oh.

A

Maybe it does I might just be wrong. Yes, it does thanks for all yeah sure great. That makes it much easier. Oh.

A

Crap, sorry, oh sorry, bad words anyway, so we're starting with the aggregate VC's. This is our original function that I accidentally just deleted, and so we can just append which is in place, and so that was even simpler and now we can probably do the sums, but we don't want to do it inside these loops. You want to do it only where we need it, which is the same level of this try. So let's do the sums and maybe we'll put in the accept and just pass or or whatever.

A

So let's do the sums. So there's a bunch of sums going on here. You know we need to kind of capture all this information.

A

This is original information that we kind of have to capture, and so we want to add it to these rows and so f, which is Flitz so Flitz vc req is gonna, be this freak label, so we can probably say that DF freak label because we've defined the freak label up here. We still have this, so we can do all the rows at once and say that DF freak label is probably just going to be.

A

The sum of all the columns for Flitz VC wreck these row why's that that looks pretty good, so we can probably do the same thing for Flitz RESP label, except we, of course, will have to use the correct columns, the flips of VC RESP, and so that looks pretty good to me. We also have this new label one though so. This new label is the combination of flips, VC, rec and flips. We see rep now.

A

Presumably I could actually now just do this by summing this thing, I've created and with this thing, because I've done that work already and in this case I'm gonna just make a little more explicit, but we can probably optimize this so I'm just gonna be more explicit and say this is just the some of these, because FV CX is the sum of both of these and that's what this is gonna be. So this is the sum of both of these.

A

Now again, we could optimize this by using these already computed sums, but for the sake of it, we'll just keep it for now, and so now we have to do the stalls. So we've got these, so we can probably do the same thing because it's consistent logic. So we know we need this stalls. Req label, which we've created up here just like before, and.

A

Now we can do the same thing. Presumably this is going to be the Stahl's BC rec some axis equals one, and you know we're gonna make three of them again and I'll. Just preemptively fill them in.

A

We know that again, we're gonna want to have two gonna have to go to the stalls, VC RESP for the second row, and then the combined is probably gonna be the same as before as well, except we're gonna use the stalls version so we're gonna put in the stalls, VC wreck and the stalls we see rasp. So this is looking pretty good.

A

This is the same. The same computation we've done it in a way that is not operating row, wise, we're operating column by column, so we're using the entire data frame. It wants to take advantage of pandas and really num pies, built-in factorization and the reason I started like this. Is you know you might notice that nothing about this looks like it's on the GPU. The beauty of Rapids is that this isn't code that is specific to the GPU I'm, writing generic PI data code, and so hopefully, at the end of this.

A

At the end of this double for loop, which I am now at the same level of I'm gonna return, my data frame and so I hope that when I run, this I'm gonna get the same results as when I do this, and so, let's, let's take a look and see what happens so you know I'll, just you make it even smaller sample. Maybe I'll take the two rows. This is our original function. You know we can see that we added a bunch of things. Let's take our new columnar version, it's very likely.

A

We made a mistake and it's very possible. We made a mistake because when you pour things in a you know interactively and iterally iteratively, you often make mistakes. So this you know this function just takes in a data frame. That's all it takes in stock string sort of explains the logic that we try to to do we'll take in the data frame. But in this case we just will take the first two rows to be consistent and there's no TF.

A

It's quarry sample, and so you know I'm getting a study with a setting with copy error, because I'm mutating this thing in place, which makes sense not an arid sorry. It's a warning, but I'll just avoid this by calling it temp.

A

So you know we can see that okay hmm looks like we might have done something incorrectly, we've got the same rose. We've got different values, but wait a minute. These are different columns. The ordering of columns might be slightly different depending on how we did things it's possible that we have changed the ordering of columns unintentionally, so let's actually make sure we're taking a look at the same columns so res. So we'll see this as a result. Columnar we'll save this as res original.

A

And let's take a look now and let's, let's look at this: let's look at this column that you know, hopefully, we've created.

A

Let's see if these are the same okay, so it looks like we didn't create something this doesn't exist. So let's debug this, why does this not exist? Let's find out. Is it possible that we aren't appending something correctly?

A

So this is the process of debugging, so we'll go back to the original code and it will take a look and we'll see. Okay. Well, maybe sorry, I have this I have the original code open in a new tab and I'll look at it here? No, maybe there is something going on that we missed I'm. Is it possible that we don't have these strings formatted correctly, because we're missing a column? You know we wouldn't expect to be missing a column.

A

So, let's take a look, it looks like these could be formatted correctly, but let's actually be you don't be explicit and take a look.

A

So we'll get rid of these and you know it looks like they're okay, but you know we still might want to to double check. These could be verified more more closely, but for now we think you know these. Are these look like they're, probably good? So maybe we have a logic error somewhere else.

A

So perhaps we have a logic error in the column operations. Maybe we are doing a sum that is, you know not actually creating. So, let's maybe maybe we're getting an error. So, let's see if we got an error, perhaps some of these actually error it out looks like a bunch of them air it out. So let's, let's actually see why this error down. This is really important.

A

It's just part of the process, so when you're working with someone else's code, you're trying to port it in you know a live session or any session. No, it's not always clear what's happening, so we got some errors. So let's catch the error, let's just catch the generic exception as E and let's actually print E and see. What's going on.

A

Okay, look at this. These columns don't exist. So what's our problem here? Well, we've got our quarry sample. We've.

A

We seemingly aren't able to index into this for really any of these of these routes like if we print out our C and E, looks like for all of these rows, we're not able to index in so why is that? Let's take it, let's sort of try to understand that, so we've got these columns and we know that this is going to be part of sorry, not that we've got these columns and we know it's going to be something that comes from our data frame, and so, let's see what this would be.

A

If we set R equal to zero or say our CV C equals zero, zero and vc is also it could be zero. So this is, you know something that we have and we have perhaps quarry sample.

A

Okay, so this seems like it should have this right. It seems like we should have this, so something is off. So what are the ones we don't have? We don't have this set, so why do we think this is the case I wonder if perhaps it's because we're doing the lists incorrectly? That seems like it's me. It's a very likely candidate list handling is something that you know it's easy to screw that up and there we go. So what are we just learned, I?

A

What I just did was I looked just to make sure we had two columns. I saw that we did so then it eliminated all of the work that wasn't operating on combinations of columns kind of, and so then I double-checked that by getting rid of the lists- and so in this case, presumably I just need to instead of using a comma to do this with like double-double things right here, I presume at least need to actually just do this and instead come. These are our a lists.

A

I can probably just combine the lists like this and I suspect. Now we will not get any errors, but you know fool me twice, you know and there we go success so obviously in the real in the real workflow. We would not just use two rows to verify that this is correct, but you know it's nice to see that it looks correct and the logic made sense so it shouldn't it should be correct, but we would of course verify this actually and so at this point, we'd say: okay! Well, why do we do this?

A

We didn't do this to improve the CPU code, which is nice I mean. Hopefully this has improved to CPU code. So let's actually take a look at the CPU code, speed and then we'll actually run this on the GPU. So we have this Quarry sample and you know- let's let's say we know it works now. So let's just take this again. Let's take this farther down where we have some space. So what we're gonna do is say, maybe look at 100 rows and to run this. The original one with a hundred rows.

A

This should be pretty quick, hopefully, but we know it's, you know it's kind of slow.

A

We saw that it was I guess in the other notebook we expected it to be like roughly 1.6 seconds for every 10 rows with a single process. So this is probably 10 sec, yeah 14 seconds. So let's say the res. The result for the columnar is gonna, be you know, hopefully a lot faster.

A

And as expected, it's much faster, that's great, because we are also going to run this in the GPU in a second so with 100 rows. It took this long. I'm not gonna, run this anymore because it's scaling linearly so with a thousand rows. It'll we'll be here forever, but like with a thousand rows. This should be pretty quick less than a second Plus through on the GPU. Now this code will run on the GPU. All of these api's exist on the GPU. We can run them. So let's do it.

A

We can do the same thing that we did right here and say, result columnar GPU on the cory sample GPU.

A

So the first time we ran this, we run this can take a little bit longer because it's gonna get compiled. But so, let's we'll run it again just overload the digit compilation. So it's 1.8 seconds. So it's actually a little slow with a thousand rows. The panda's version was faster, and so why is that? Well it's the way we the way we ported this. We are doing 40 iterations within each iteration, we're making a call to some six times.

A

So we're doing 240 separate kernel calls no matter how many rows were doing so if we do one row we're making 240 separate kernel calls if we do 1 million rows, it's the same thing. So there's there's an overhead of that of those kernel calls. So let's see what that means. Let's go to 10,000 rows, so the pandas version it's losing vectorization. So it's it's gonna be much faster than the original.

A

In fact, it hopefully will do a thousand in the speed you know about ten seconds or maybe so the pandas one took less than a second, and now it took about ten seconds. So it looks like when we scaled up from a factor of a thousand to a 10,000. We scaled up linearly by a factor of ten our time scaled roughly linearly by a factor of ten. What about on the GPU, so we scaled significantly significantly faster. We can keep this time way down. You know roughly in this case, 1.8 seconds again.

A

What about a hundred thousand so I'm, not gonna, run this on the CPU, because I can tell you that it scales linearly like we know that and it's gonna take a hundred. It's gonna take 120 seconds, so I'm not gonna waste the time, but with a hundred thousand rows the GPU version will still take. You know two to three seconds and with two hundred thousand rows, the GPU version will still take very little time, and so we know the amount of time it's actually gonna take is gonna depend on how you how you poured it.

A

You know we know from above that you know this is not the optimal port and we could probably optimize this. We've already done some of these calculations. We don't need to do a kernel call here. We can actually do a binary operation between you know these two sums that we've created. We can also probably potentially even unroll some of these loops. If we're clever about it. But you know in general, we just went from a thousand rows in 12 seconds, with pandas, with the improved CPU version to 200,000 rows in four seconds with the GPU.

A

Now, obviously, that's great you can do this whole workflow that we estimating was gonna, take two and a half hours we can. We can actually do it in like 30 seconds, which is just awesome, but it's also great because this code scales I'm, not gonna, pull it up on a big cluster, but I'm. Just gonna show you to demonstrate cuz we're almost out of time that we can put this on a desk in a desk data frame, so I'm using the desk data frame API that we just used.

A

It's gonna ask me to set a number of partitions. You know this is a fairly sizable data set. It's got 200,000 rows, it's not too big. I'll put this in ten partitions. It's not super important just need to do it, so this is a desk. We can run this same code on that gas dataframe, because none of these api's are anything unique. This is gas compatible. So you know when Taylor mentioned earlier, that is, can you know his work? His actual workload is in the terabytes of data.

A

We can call this on the desk data frame and it will run now to be clear. This is going to run and it's gonna put it's going to print out. So I need to stop this from printing out a lot of stuff, but.

A

This will run on the data frame. Now the data frame has ten partitions and because we're not using any parallelism, it's going partitioned by partition. So in this case using desks like this is actually gonna be slower. But if we had a lot of GPUs, we could use this and split up this work incredibly efficiently. I'm fine I just use one partition. It would be as if we were using the ku DF data frame rather than a desk.

A

Who do you have data frame and it would be very quick, but I wanted to sort of show that this same code will run on the GPU, with both ku DF and with tasks. This is taking too long, I'm being impatient, so I'm gonna actually recreate this as a single partition. Data frame and I'm gonna run this again because I'm impatient, but in general this is you know, tasks add some overhead because it has to do orchestration and it's gonna. Do these sums a little bit differently, but it's gonna run and it's it's actually.

A

It seems to be adding a decent bit of overhead in this case, but in general and it's running and it's taking a little bit more time because of the the overhead, but we could also run this with the map, partitions, API and just pass this function. This is a way to pass functions to the underlying objects in memory and there so that you know that was just as fast and so the result of this is a data frame, but now it's a data frame that has our results in it. If we call persist.

A

And we can see that the length if we do res dead.

A

Eventually, it will finish well as the risk of live coding, but in general this pattern will work, and you can see that you know task is, is we're adding some overhead, but it's allowing it to succeed and so yeah. So that's how we would take this workflow and we didn't port the second portion. Yet we ran a time, but we ported the most important part that took up 95 99 percent of the time and we made it go from two and a half hours or you know hours in the in the seconds.

A

So that's how we think about this. We try to unroll loops where we can. We don't over optimize before we get our speed ups and your wins, we think about columnar operations and aggregating all of our compute into as few operations as we can and we try to use standard api's.

A

So hopefully this has been a good experience or a useful experience to see that, and hopefully you can apply these kind of same. You know profiling, experiences and porting experiences to your workflows as well.

B

Thanks for showing that Nick I just want to say that was really awesome, appreciate it yeah.

A

Absolutely I'm, so it did finish, it took a little longer. The overhead is probably because this is a shared machine and desk is probably being a little silly, but it does run so.

D

Major props for doing that all live! So, okay, if you had to distill down what you just demonstrated into like three or four steps, what would they be? Yeah.

A

So I think in general we always want to operate on columns, not rows. So if we can distill the logic down, I would say something like try to rethink your operations to operate on columns rather than rows by doing that, it lets you use existing api's and that that often is gonna help your CPU code. You know we saw down here that by restructuring this we can improve the pandas version quite a bit.

A

You know it doesn't solve the problem because it's still gonna take too long because it took ten seconds for a freedom, a thousand rows or whatever versus four seconds for hundreds. Thousands like that's, not gonna cut it, but it still made an improvement with GPUs. That improvement is times a thousandfold. So that's why it's so important. That's one number two! Don't try to over optimize before you get some some wins. We don't want to move. You know you don't move a waterfall all at once.

A

It's inch by inch every year the waterfall moves backward like five inches. That's like yeah. What happens? You can still get big speed ups by making the quick changes that you know make sense. You know we did this all on the span of realistically about 30 40 minutes. You know we could of course optimize this and in fact it probably if I had, if I had actually done that before I wouldn't have made the mistake of trying to combine two lists incorrectly.

A

But you know little syntax errors happen, but don't over optimize like you know, maybe eventually now we could say well. Do we even need these like do we have to do all these separate stones? Can't we do this in. You know in some other way that combines these kernel calls and fuses them. Maybe yes, maybe no. We don't know that until we try, but we don't need to try until we know that if necessary, this optimization makes the entire workflow on, like three million three-and-a-half million rows run in you know less than a minute ish.

A

So maybe that's fine. Maybe we don't need to get faster than that. Going from you know, two and half hours to a minute might just be enough, and so don't try to over optimize number number. Three would be actually measure correctness. You know we didn't do that here. You know we kind of just eyeballed it because we're doing it yeah. We were just eyeballing it, but it's important to do more than just actually say: oh yeah. These are the same. These are the same.

A

These are the same, actually measure correctness and we actually put we actually put out a blog recently on on the semi recently on the rapids medium page about you, know, measuring correctness and workflows, I'll, pull it up quickly if I can, if I can find it, but if not no issue yeah. So here's an example odd for for doing this, like verifying correctness, there's a bunch of things, you're gonna want to check things like the types are. The index is actually the same. Are the columns in different orders? Are you perhaps using different precision?

A

Perhaps some of the operations when they're done in parallel? It's not actually correct or incorrect, but parallel math can sometimes give you differences in. You know double precision, then serial math. It's not that actually one is more correct than the other they're just different.

A

Now it's important to think about that, and so we put together a checklist and I'll share this in the chat as well, but make sure that you really do measure correctness because there's nothing worth doing if you're trying to actually solve a research problem or an infrastructure problem in this case, if you think you solved it faster, but you actually haven't, and then you know, one thing, I would say is come back with a fresh mind. That's probably the fourth thing to this. You know right now: we've done a good job.

A

Hopefully we made a couple of mistakes. We found him pretty quickly. Eventually we might find a new way to come back to this, but we don't want to know the labor the point. If we can get a win, we got a win and that's great you know this workflow is now fully on the GPU, obviously that we didn't port the second function, but we did port the second function with Taylor originally and again it's it looks quite similar to this the the way we ported the second function.

A

Yet you know they get per routers, it's very similar. We took things and made them more columnar. We split up the logic of this this slot process.

A

We actually we took this and we took this process and made it happen once in a colander fashion, because we were able to leverage columnar, regular expressions, each thread on the GPU doing one at the same time, so we can do literally hundreds of thousands at once, and then we turned these if statements into essentially operate once on, filtered data frames where slots less than eight and once unfiltered data frames, we're slot is greater than or equal to eight, and we just did the aggregations like that, and so that sort of two-stage process is the same as we did just right here and again gave a very significant speed up just like this.

A

So hopefully that's been useful.

C

Time for a couple before break I.

C

Wanted to also thank you, Nick for the live coding exercise and for for Taylor for being so brave to share share his code.

B

It's just a show that I actually need something.

A

Okay feel free to ask questions during the break. Now we can come back in. You know: 10 15 minutes whatever the schedule says and then continue. The discussion yeah.

C

Okay, so what I want to suggest is that we do go and then we take that break and we plan to reconvene maybe a little bit ahead of schedule, maybe just before 3:30 but I did want to mention that we are going to get back together for kind of open, Q&A some closing remarks and there's going to be an attendee survey which we'll share in the chat here, and we want people to fill that out.

C

We're also going to email that to everybody, but I I do realize that for a lot of people on the call it's getting late for Nick, it's getting late for Zahra, if she's still on. But if you, if you do, plan to drop off here at the break, just watch for that email. But we would really encourage people to stick around for the last part, especially because there's probably a lot of questions and answers. But questions but yeah during the break be thinking about those but I'll reconvene us, maybe 3 325.

C

If that sounds, ok all right.

A

And there just a note wall, another break just started. um I just ran this on a million rows. Just for the sake of it. We did a million rows. I'm, sorry it's! Actually there was only a hundred thousand. Excuse me never mind. I am incorrect, but I will run this in a million rows and we'll see how long it takes. But.

D

Stay tuned stay.

A

Tuned, so in a in a different scenario, we would obviously not be doing all these copies. We wouldn't be doing it like this. We would just kind of read it in, but in this case 1 million rows, columnar, not perfect. You know we could have optimized it more. As we said like this is like this is a waste of computation like it is objectively a waste. We could just do this with addition, cuz. We already we've already done both of these sums, but that's ok. Even without optimizing.

A

We did a million rows just now in four and a half seconds, and it took us 12 seconds to do this with 10,000 rows in pandas. It took us one and a half seconds to do this. One and a half seconds do with a thousand rows so with a thousand roast in pandas. This took well ten thousand rows.

A

Ten thousand rows. It's gonna take ten seconds with a million rows. It takes half the time on the GPU, so we did it in one. You know fifty percent of the time or in one third of the time we could do a hundred times as much compute. So that's a 300. You know loosely a three time speed up, which is awesome. That's just really exciting and that's all I got.