National Energy Research Scientific Computing Center (NERSC) GPUs for Science 2020, 5 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Optimizing Data Preprocessing in Cosmo-3D: Developing Efficient Pipelines with DALI & Nsight Systems

Description

Thorsten Kurth from NVIDIA presents a talk on Optimizing Data Preprocessing in Cosmo-3D -Developing Efficient Pipelines with DALI and Nsight Systems. Recorded live via Zoom at GPUs for Science 2020. https://www.nersc.gov/users/training/gpus-for-science/gpus-for-science-2020/ Due to some data loss, this recording is missing the start of the talk. Session Chair: Michael Rowan.

A

Here um so basically, um you do a lot of like convolutions, uh it will extract feature maps and those, basically, you decode, using um inverse convolutions or some other upscaling, and these green lines basically tell you that you extract the outputs of certain levels of this encoder stage and feed them back back into the decoder.

A

So this is a very generic, a very generic setup, so they tweak this for for their specific purpose, and the important part is that it's all 3d and 3d usually means you have more memory movement, and that is what it challenges. So the input data consists of these simulation outputs and what they're using is they use a big cube of data like a 1024 grid sites, cube grid sites, times number of features, so number of features will be for the n-body part, um the density and the velocity in 3d right and for the target.

A

It will be uh these these three, these four plus the the temperature, so the total size of this thing is 38 gigabytes. So what they want to do is now crop out sub volumes of 256 cube. They want to essentially feed to the neural network, because this whole thing doesn't fit and also they hope to get some effect through, like uh technically, like a fruit is sampling of like a better con, better convergence and better um so that the network learns features like translational invariants and things like that.

A

So um they also want to do data augmentation, which means that they don't only want to crop out this block. They also want to rotate it. So uh the symmetry group they have is the octahedral group, so uh all sorts of cubic rotations and essentially also reflections. They could apply um to that to increase virtually increase, their their data set size and then load this onto the gpu okay. So that is that sounds pretty easy and you can readily go and code this up, for example, in numpy and pytorch.

A

So pytorch itself has a has a nice data, loader framework and the rest. You can do numpy right so what they do? They basically load all the data into the memory, which is a full copy per process. So it's like 38 gigabytes per process right. So if you are in a box with like eight gpus, you have like eight times that in memory, uh then you apply a random crop and rotation and then copy the buffer into a pi torch tensor, so the crop buffer.

A

This is important because when you basically slice data in numpy, it will create a handle to the original data with some and modifies the strides. It will not create a copy, but pytorch doesn't have all the information numpy has and what pytorch wants is a contiguous buffer with basically uh canonical strides.

A

So what you need to do is you need to create some additional data copy and return it as a pytorch, tensor, and then pytorch can use it and go okay. So that is what they had and uh I looked at the I o performance of this thing, and this is how it looks like basically measured in samples per second, when you scale out to 16 gpu on a ddg x2. So that's important here.

A

I just want to see how the the the intra node scaling is right, because uh when you scale out, you share certain resources. Sources such as like the cpu resources and as you can see, it falls over at some point and um the potential reasons for that could be is that um you saturate the memory bandwidth from the cpu to uh to the to the gpu, and you have some cache contention here, because the random rotations they apply. um So it's like uh basically consists of like reflections in memory and transposition, so it's pretty expensive.

A

So I think what what you have. What you have here is two things you run out of: maybe a fret parallelism and- uh and you you share a lot of memory resources between these these guys. So that is not not very good. So what you of course have when you scale out in that way on the box, you, you basically have more gpus. So why not put this stuff on the gpu? And then you don't have these issues right.

A

So how do you do that? This is pretty simple. You can create a gpu enabled pipelines with a software called dali. It's an nvidia software which allows you to build pre-processing pipelines for any data, analytics workflow, um not only deep learning, but it's it's mainly targeting the deep learning like applications and what you can do is you can build the pipeline for cpu or gpu or mixed, so you can run part of it on the cpu part of it of the gpu. What you think works best.

A

It supports various input formats, like um certain images for image formats, videos sound. What have you and you have a broad framework support. So, while the dali pipeline you build is pretty pretty much framework agnostic, you can feed it or they provide like wrapper functions to feed it into pytorch or tensorflow stuff, like that, and it comes pre-installed with the ngc containers you find on nvidia gpu cloud.

A

They also open source. So they have a github page. If you have a feature request or bug report or something you can, you can file it there and they might look at it at some point and they are pretty responsive. So it's pretty nice okay, so I did that so I just used it, and the first thing you need to do is very boilerplate.

A

You first look what kind of input readers dali offers- and you see that for this specific purpose there isn't any so they they offer something which is called external source which allows you to feed arbitrary data into your daily pipeline. So what you need to do is you create an iterator which feeds your pipeline and then the pipeline will take it from there to fit in your network, okay. So that is what I do. It's basically just some. It's basically the same thing as before, but without the rotation stuff.

A

Okay, so I still copy the data into memory and then crop out random chunks and feed this to dali and dali takes it from there and what you need on top of that is to get a pipeline, and this basically just is a big class which hosts all the operators you want to use.

A

In this case, we just do uh we use a lot of rotation operators right, so we want to basically sample the the the octahedral group- and we basically do this by rotating uh about the random angle, uh about one of these three axes. Okay, so um that is that and then you need to to tell it how these ops are tied together right. So you need to basically define what is called a graph, and that is done here. So we take this external source. um um So we take this uh this input and target.

A

So it's denoted by itself.input.target from the external source operator.

A

We upload it to the gpu, because we want to use it as a gpu pipeline and then we've we pipe it to all three rotations like step by step, and then we need to add an additional transposition, and the reason for this is here because dali wants to have the feature dimension to be last for the rotations, whereas pie torch wants it to be first okay. So this is the limitation of both frameworks in that sense, and they don't match.

A

So you need to add this additional transposition, but you can hope that this is just a memory movement of a 100 max or something so it should be fine, and then you return it. So that's it so it looks pretty much like an iterator. You define this graph daily will take it, it will compile it and from there it will basically uh create an efficient pipeline.

A

uh It will also apply some optimizations, we'll figure out what can run in parallel and what can't so once you have this pipeline, this is framework agnostic. Actually, so you can hook it into anything, and in this case you want to use it in pytorch. So what you do you do this? uh You implement?

A

You import the generic iterator from the uh python plugin from dali, so this is shown here and what I did in. On top of that you don't need to do this, but you can. I wrote the wrapper around that okay, so basically, I wrote a wrapper around the this iterator to make it a drop-in replacement for the pie, torch iterator. I had because this daily generic iterator is accessed a bit differently than this.

A

The standard pie touch one, but you can write an easy wrapper like this, like a few lines of pie touch code and you are a python code and you're done okay.

A

So that being said, this basically is now drop and replacement for what I have, but it does a lot of stuff on the gpu, and this was the original performance of the pipeline and with the new one. It looked like that. So there's still some scaling issues potentially by like when you, when you read from from memory the data from memory, but the overall performance is much faster because most of this expensive stuff is now on the gpu. So I was thinking okay great, so I'm done here right.

A

So, let's just uh connect the whole the whole uh a problem to it basically hook the framework um hook it into the framework and see what happens right. So I did this. This is the original original workflow scaling, so uh samples by seconds drops by a lot. But of course you have this expensive training set uh training step now and uh with the daily pipeline looks the same. So I was a bit frustrated here. I was like gosh, we did all this all this work to optimize this pipeline, look much better. So why is that?

A

So? The first reason could just be that io was not the issue here. So I was like okay, let's, let's optimize the whole thing now um and the easy thing you can do is use a mix precision, so the original network was run in uh in fp32, so uh basically float or single precision and uh pytorch supports something which is called automatic, mixed position or amp uh even natively when you go to version 1.6 and higher.

A

So when you check out the recent master branch, you will get it and this allows you to offload certain operations to the tensor cores, so I'm on the volta system here. So if you use ampere it's the same thing, it basically enables support for tensor cores for convolutions and dense operations, and since this network is like convolutional mostly this will help a lot.

A

It does some uh some additional stuff. It's not just just doesn't cast everything into fp16. It will keep. It will do some very uh careful uh things to uh to guarantee numerical stability.

A

It will keep copies of the master weights in fp32 and usually, if you apply this kind of things, you will for deep learning problems, you will see a speed up at no performance degradation, so that is kind of good and the right hand side. um Sorry, on the right hand, side you see how this is done. It's basically by adding a few lines of python code and the most important parts are these.

A

So there's this amp autocast option, which basically just casts the network to the forward pass and with that automatically also the backward pass to fp16 or parts of it and in the backward pass you need to scale the loss before you before you perform the backward, and the important thing here is that when you use fp16, you lose a lot of like dynamic range right, and this will basically make sure that your gradient magnitude is within the dynamic representable range of fp16.

A

So you need to be very careful if you don't lose that use that you will get a nan errors and stuff like that. So this is important, but with this it's pretty stable in most cases.

A

So I added that- and I was like okay look like perfect: we've got a speed up, and now we actually see the benefit of the dolly pipeline. You see a slight speed up here. It's it's not great though so it's like okay, so like what is going on here.

A

Let's, let's profile it, and um for that you can use ensys or inside systems, and it has this nice tracing functionality where you can trace qr codes, which is standard or cubeless, but also nvtx and nvtx, is a library which allows you to to annotate code with your own. Like ranges on the right hand, side.

A

This is how you do it in pytorch, so pytorch has a facility for that it has this auto grab profiler emit nvtx function, which tells you that please respect all the nvtx statement I put in the code inside this context manager. So basically you can uh around the forward and backward pass.

A

You can put put these ranges and and tell it to annotate them so that you would see them on the timeline, and this is how it looks like when you profile it and, as you can see uh on the on the in the middle of the screen, uh this gray stuff, you see like the forward, pass the backward pass and you see all the convolution calls made by pytorch, and what you also can see is this stuff. This is bad. This is dali and it's actually not overlapping with the forward backward pass. Okay.

A

So this was like what is going on here and you see this big gap between the forward and back between the iteration steps, so that is also bad, and what you also can see is another thing: is that there's this mem copy essence in the way, spanning the whole block, and that should trigger your alarm? It's nicely colored and red, because this is not great right when you look into this.

A

What happens here is that at the end of this long, cuda mmm copy essence region, you see that there is a mem copy from device to host into pageable memory, and that is blocking so there's, basically an asset member copy into pageable memory. Don't do that so now. The question is what is happening here and since this only happens when you enable amp, I was talking to to the guy who implemented this amp part, and it turns out that actually this is this line is responsible for it. This is kind of funny.

A

This looks very innocent, but what this v dot item does v is a device. Tensor of device. Variable and dot item takes the the value of it and crea and copies it down to the host into pageable memory. So don't do that. So we fix this line of code.

A

We basically use pin memory, download it into pin memory, uh and this turns out to be a more brutal python, a pie torch back so that, basically, if you ask for a non-blocking copy, it will not create a cpu, a pin, cpu buffer to copy non-blocking into so basically unknown. Blocking copies in pi torch can turn out to be, can turn out to be uh blocking if you do not pin the buffer manually. So there is a big bug.

A

We fixed this in this case manually, and I think there is a couple of like a discussion around it how to fix this in general. Okay, so, with these fixed results, we can profile again and, as you see now, great dali is overlapping. You see this. This red cuda mmm copy essence is just gone right and there's this cute event synchronized, but that is fine.

A

That is what is supposed to be there, okay, but you still see there's this gap and it turns out that most of the ios synchronous- and this is because dali is careful- because I feed external input- dali makes no assumptions about threat, safety and memory, uh coherence and stuff. It will assume that it has to take the data copy it in the main thread before it can do anything with it. So I was like okay, that is stupid right I mean dali is very careful here and it's the right thing to do.

A

Maybe but uh it doesn't help a lot with performance, so what you need to do is hide part of that overhead, so what I added stubble buffering. So basically I have two buffers. I can copy the data into from the memory and I do it using concurrent futures in python right. So there's a python function, calling a concurrent features which you can say copy the data now so before you process the actual badge you prefetch, the next one and then and the next time before you process the extra batch.

A

You wait for the badge to arrive, and that is very easily done with a few lines of code. I don't want to go into detail here, it's technically a slight modification to what I had before, and you need, of course, to have a second buffer around, because you need to need to have the actual and the next batch in separate buffers.

A

Okay, so that's what's the original scaling and with these fixes actually you'll get a little bit better performance. So this is the the orange line right as you can also see the the yellow scaling here. This is just with the amp fix, without overlapping the the I o a bit, and this is because, even if we fix, if he fixes mp issue and dali, starts overlapping with the block, it previously overlapped with this still synchronous part of the io.

A

So it overlapped with something not the stuff we wanted to overlap with, but it overlapped with the oh. So we couldn't see a benefit here, but once you overlap part of the ao2, you see a benefit, but when you profile it, it's still not great. There's still this gap, and it turns out, as I said, that dali copy still doesn't a buffer copy of the of the buffer. I provide into internal storage and that actually, you cannot get uh uh get around with without modifying dahlia accordingly.

A

So uh that is uh basically, we are stuck here and we need to wait for more dolly features to arrive to fix this.

A

So what I um presented- hopefully some educational uh like um uh step by step, how how I optimized like a pre-processing pipe uh uh pipeline for pytorch for the for python, deep learning application and how you can use like inside systems to basically analyze your timeline and see where you, where you lose performance and what calls should be asynchronous, but actually not, and um also we found a certain daily limitations.

A

We have those some of those are going to be fixed soon, so they will add a zero copy option, for example, to this external source, where you, as a user, have to make sure have to make the guarantee that this buffer won't go away uh for the next couple of steps. So, in my case, uh in my setup this is the case, but you have to be careful when you do this, and also what I demonstrate is that, with just a few line of python code, you can uh actually get a lot of speed.

A

Ups like about a 2x or something for this pipeline on a dgx2 machine. So this is where the data was was gathered. So thank you very much, I'm out of time now. Are there any questions.

B

uh Thank you. Thank you, torsten. It's very interesting to hear about uh how you're using m-site systems and uh for this optimization workflow um do we have any uh questions in the chat.

B

So one question I have is like you mentioned how you use that site system for profiling. Have you it's sort of a generic question, but have you used like unsight compute, to find any more optimization opportunities here.

A

um No, I first looked at the at the I o part right, um so the python part looks pretty much optimized to me. So, like there's everything I mean, I think that everything overlaps, what what you can overlap. uh I mean we found this issue in the in the pytorch in the in the amp right, so that that is one thing you can see here.

A

um Besides this, I think once the once you overlapped io completely, you can have a look, a deeper look at the actual computation part, but right now so this was not the main task here and um digging deeper.

A

Also is a bit it's a bit more tricky because at some point you go into very deep into the framework level where you then look at the at the kernel calls for convolutions and stuff so that, of course, I didn't do because they are like low-hanging fruits somewhere else right, but it's nice with nvtx, especially since it annotates you can annotate any part you can annotate cpu and gpu parts alike.

A

uh It really helps you to understand what kind of what, what of which of your function calls, are actually uh sequential or not right. So it's not as it not it's, not as it seems. Normally, you should look.