National Energy Research Scientific Computing Center (NERSC) SC20 Deep Learning at Scale Tutorial, 21 Dec 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Optimizing Deep Learning Performance - Josh Romero & Thorsten Kurth

Description

SC20 Deep Learning at Scale Tutorial
https://github.com/NERSC/sc20-dl-tutorial/

A

Hi everyone and welcome to the tutorial. My name is josh romero, a devtech from nvidia, and I will be copyrighting. This talk with my colleague, thorsten kurth. In this talk, we will cover some techniques to optimize the performance of your deep learning codes on nvidia gpus. As the demo material is in pytorch. The talk will use pytorch code examples, but the concepts and best practices we will cover are general across the frameworks to begin here is a brief outline of the talk to start.

A

I will discuss how to profile your code and how to use profiling to guide your optimization effort after this thorson will do a deep dive into optimizing. Your data input pipeline a common bottleneck that arises during dl training.

A

It will cover both the native pythog data loader, as well as the use of nvidia's, dolly library to offload input pipeline operations directly to the gpu.

A

Finally, in the last section of the talk, I will cover methods to improve compute utilization, with a focus on accelerating training using mixed precision and enabling tensor cores using a profiler is an essential step in optimizing. Any code and deep learning is no different. Ensight systems is a profiling tool developed by nvidia. For this exact purpose. Insight systems is a tracing tool that can generate a visual timeline of your code, giving you an easy to digest high level overview of your workload.

A

This timeline view is very useful in diagnosing bottlenecks in your workload from stalls due to slow data input pipeline spotting, slow kernels that are limiting your compute utilization or debugging, complex scheduling interactions or unexpected synchronizations, for example, tracking down a call in pi torch copying data unnecessarily to the host, causing a sync, a basic command to generate a profile using insight systems is on the bottom of the slide.

A

Loading the profile from a simple pi torch training script in the insight systems. Gui looks like the image to the right. There are a number of rows here, split into a section where cpu work on the top and gpu work on the bottom. In the cpu section, you'll see rows for all active cpu threads and within each you will find entries for cuda api calls like cuda malik, free cuda device, synchronize and kernel launches, as well as operating system calls for simplicity. This profile image doesn't include operating system calls, but those are enabled by default.

A

In the profiler in the gpu section, you will find information on gpu, kernels and memory movement.

A

While the raw profile provides loads of useful information, it can be difficult to decipher what this information is without adding additional context to add additional context. We can add: nvtx ranges to annotate the timeline in pytorch. There are convenient functions available to push and pop named ranges and also a context manager, autograd.profiler.emit nvdx, which will automatically add nvtx ranges to annotate model layers and vtx ranges can be used to annotate large general code sections like steps and epics, but can also be used in a targeted fashion to identify sources of saws and idle gpu time.

A

Essentially, these ranges can be used to identify sources of the white space in your profile, so you can address them.

A

Here is an example of a simple trending loop in pi torch over some number of epochs, and here is that loop with nvtx ranges added to annotate the epic boundaries, the training step boundaries and the exposed time for the data loader to return a training sample. I say expose time here to emphasize that typically, the data loader will prefetch data in an ideal circumstance by the time next is called on the data loader, the sample is fully pre-fetched and the time it takes should be negligible.

A

I also point out that I've added calls to cuda synchronized before popping the step and epic ranges. This is because in general, pi torch operations are run on the gpu and do not block the main python thread. Thus, the syncs are added to make sure the trace ranges are accurate.

A

Here's what the profile looks like with nvtx rangers added this section, labeled nvtx in the profile with the gray boxes, shows all the annotated ranges captured in the profile with a label to identify them compared to the unannotated profile. Adding nvtx ranges provides a lot more context and enables you to better focus your optimization efforts.

A

For example, if we zoom in closer, we can focus on a single training step and start identifying and addressing notable bottlenecks. In this simple example, we can now clearly see that the data loading is leading to a lot of idle time on the gpu and should be a high priority target for optimization.

A

With that, I will hand off the presentation to thorsten to discuss how to optimize your input pipeline to address these types of data pipeline bottlenecks.

B

After this short profiling introduction and obtaining your first profile, you might ask yourself what will be the the parts to focus on first and in my experience in many cases, this is actually the data loading. So how do you load the data from disk or from memory into the gpu in the fastest and most efficient way?

B

So, on the right hand, side, we have an nvtx, annotated profile, and you see all this- this blue stuff, which are basically um cuda kernel, calls and those are nice and there's not much. You can do about those. What you can also see here is that between the backward pass and the start of the next forward pass, there is a gap and the nvtx markers show you some turquoise or green regions, and these are actually memory copies host to device memory copies and the only reason for copying.

B

Data from host to device here is to feed the gpu with the new input batch.

B

So the good news is, though, that most frameworks come with efficient tools for mitigating this problem, so basically helping you to feed your data to the gpu in a very fast and efficient way, and most of them allow you to work with arbitrary python code. For example, you can use your preferred file format like h5pi, netcdf or pillow for images. What have you, however? There are still some caveats and pitfalls.

B

You need to be careful and worry about when you do this when you apply these tools- and I will talk about them in the next couple of slides.

B

So, let's, let's see this example, this comes from cosmology, so we have this huge volume from a cosmological simulation and we cannot feed it to the network. We're planning to train on it in full. However, physics helps you here because of translation and rotational invariance of space. You can basically crop out a sub-volume, rotate it and feed it to the gpu, and that is what we want to do here, so crop it out, feed it to the gpu, rotate it and then feed it to the neural network.

B

So how do you do this? Let me walk you through a simple example of how to implement this data pipeline with the pytorch native tools. Python basically provides two classes for achieving efficient data loading and one is the data set and the other one is the data loader.

B

The data set does mainly do the heavy lifting so in this case, extracting sub cubes from the bigger cube, rotating them and feeding them to pi torch, whereas the data loader manages the batching and shuffling of the data set and and and things like that- and it's also responsible for most of the the performing high performing aspects.

B

Let's see how you implement this, so this is a random crop and rotate data set based on the pi pytorch dataset class, and it first implements some scaffolding, so keeping track of like class uh parameters like the cropping size and the random number generator, and things like that.

B

The most important part of this is the get item function in that you can use any python function to to open the file load data from it process it, for example, with numpy. The only important thing is that you emit it or return it later as a torch tensor. So in this case we we return two torch tensors, because this is from an image segmentation problem where you have a target and an input.

B

So once you have this class, you create an instance of it and wrap it into a data loader object and that we see here on the lower left and this data loader object gets some performance, relevant parameters, so first there's the batch size. This is basically to tell the data loader how many samples are in a batch and it will batch it for you.

B

Now. Workers tells the data loader to spawn that many sub processes to load the data. So the default is zero. It will basically load everything in the main thread. If you specify a number bigger than then then zero, it will uh spawn processes uh to do this um concurrently to the main main process.

B

The work in it function allows you to specify um a specific worker in that function. For example, setting in the seed per worker and the pin memory option is very important. So this pins the host memory for fast hostile device transfers. I think this you should always enable if your data data set emits cpu-based tensors.

B

So this looks simple, but there are some pitfalls and I want to discuss them briefly. So, first of all the pre-processing in python, it uses python functions and it's as fast as the python functions can be so it can be pretty slow.

B

Also, pi touch does not create an instance of the whole data set class per worker. Instead, workers are launched after instantiation when get item is called right.

B

The issue is that this can lead to some very complicated data sharing issues, for example. Consider the right hand side this one is a perfectly valid single single process. Data set reader from hdfi file, so in the init function uh we create we open the file and we extracted the data set from it.

B

We want to load from, but the issue is this will be shared between workers, because this is in the init function before actually the workers are spawned and then in the get item function we extract data from it that looks pretty harmless, but this function might not be thread safe and in the case of hd5 it essentially is not, and there are other caveats, for example, in this forking process.

B

Mpi doesn't really like that. So mpi doesn't like when you fork processes. Some api implementations are more resilient to it, but some others will barf and maybe crash and and display a long segment message on your screen uh related to the data loader and you might think what the hell is going on. So, if you're using mpi with your code, you might need to work around this and there are fortunately some some tricks you can play. uh So for that. I have prepared this link. So it's a long. It's it's a long article.

B

You should maybe read it, but it's basically uh tells pytorch or the multiprocessing module, which is called by pytorch to fork. This process is to use a different fork or spawn method, and that can help to mitigate some of these issues, not all but some.

B

So how does a good implementation of this hf5 data set for multi-multiple workers look like and that we have here it's a bit more involved, but not much more involved. So, first of all, we need the length of the of the data set, so we need to load the file at least once and check out how long it is, and that is what this is doing and I'm using the context manager here in this case the good thing about this context manager is that it will clean up after himself.

B

So after leaving the scope of this context, managers, there will no remnant be um still a memory from that from that opening process. So that is great. It will leave no false, shared falsely shared stuff around, and then I have this initialized variable which is initialized to which is set to false.

B

When you instantiate the class and on the first invocation of get item, I actually open the file. I want to read from extract the data set and set this initialize to true so that in the next invocation of get item it will. It will skip this step and this allows you to open the file per worker once per worker, and and by that you basically mitigate the limitations.

B

Some libraries put on you so, for example, hd5, does not allow multi-process or with very strict restrictions, multi-process access to the files, and that will work around this issue.

B

So this was the data loader and it is very good. It helps a lot to create, like simple uh performance input pipelines, but what, if you need to do a lot of pre-processing? For example, if you have an image pipeline and you want to do rotations, flips.

A

B

Color color saturation changes stuff like that, so for that nvidia developed the library called data, loading, library or dali in short, and it's very fast and flexible.

B

It has an easy to use python api and can talk to any framework in the back end. Basically mxnet pytorch tensorflow. What have you and the good thing is? It can run on the gpu and on the cpu or in both. So you can put part of the pipeline, the cpu part on the gpu and will do everything as concurrent as possible for you, and you can also use custom python operators, so you can hook it up to any like input data format you might want to use.

B

For example, this is a very simple and short pipeline reading images from from an image store in jpeg format, decoding them and then rotating them. Okay- and this is usually the best example for for uh so this will give you the best uh performance in dali, since um this is a natively images, for example, are natively supported by dali and allow you to to implement very efficient file readers for that. So you do not need to create a python wrapper around some other reader class, as I will show you in the next few slides.

B

So what does it do so here in the first line? It grabs the encoded images from um from the from the file, so it loads the file and the labels in an encode in an encoded way. Then it throws the decoder at it and decodes the image into a for example, array type you can use later in pi torch and before we emit it.

B

We do a random rotation, for example, for data augmentation, and then we emit it as a dali tensor object, so that first of all is not very useful for python or for pytorch or tensorflow or any other framework, but dali supports uh iterator wrappers, uh which are like tailored to each individual framework. So if you then want to use this pipeline and hook it into your your favorite framework, for example pytorch, you can uh implement a specific class and and pass this. This pipeline object to that class and it will basically create a data.

B

Loader object from it and that way you can create drop-in replacements. For existing data loaders, which use dali in the backend and are very efficient, you might ask yourself what happens when you have files or file format which is not supported by natively.

B

In that case, you can use the so-called external source operator to work around this limitation, and this essentially allows you to implement your own data feeder, which will not feed into pi torch directly, but into dali and dali can still take care of all the pre-processing on.

A

B

Left-Hand side, it is shown how it can be done and it's almost a one-to-one copy of a pytorch dataset implementation. If you wish so first of all define some class members for the scaffolding so like, for example, the crop sizes, random number generators and so on.

B

You have to define some metadata info for dali to use, because this is not a native dali object. It needs to know how how big your data set is, for example, and where to start so that you can do it with this function. Here I implemented the random crop function as a class member, um so not in line with the next iterator.

B

You can do it as you like, but essentially it still uses numpy syntax to extract a random crop out of this bigger array- and this is the most important part it's the iterator next, which is very similar to the get item for the data set loader. It's just that here. It's called next and essentially, if you take the old data set, you've wrote before you can almost one-to-one translate it to this construct and you will need to create a graph. So dali works with graphs.

B

You basically specify all the pre-processing operations you want to do as we've seen on the previous slide. So this one is a bit more complicated.

B

uh This one extracts the external like the dali tensors from the python external source object, and then you can decide if you want to upload them to gpu or cpu. Most daily operators have a cpu or gpu back-end, so you can can execute them on both.

B

And then you apply a series of rotations around the x y and z axis with some random angle and in the end you return a tuple of dali tensors so and this one can then be consumed by, for example, by the dali pie, torch specific iterator to feed into pytorch. So that all sounds very simple.

B

But of course there are also pitfalls here, namely about threat safety again and zero copy, because this external source of data generator you implemented on your own dali cannot make any assumptions about threat, safety and buffer persistence right. It doesn't know if the buffers was still around long enough, so that dali can just do in place operations on them, and that can be a significant performance overhead. So, for example, this one this profile on the right hand, side shows um shows a pipeline um accelerated with daliao.

B

For the for the example I showed and what you can see, the rotation itself nicely overlaps with the uh forward pass and part of the backward pass of the network, so that's great, but still between the backward and the next forward pass. There is this gap, and actually you see two things: an nvtx annotated region and a gap, though this nvtx annotated region comes from inside my external source operator.

B

I implemented itself because that that has some overhead and that will be called in the main thread, because dali doesn't know that this thing is fret, safe right and then the other white gap you see. This is an internal dali copy. Dali function being called to copy your data, this external source would be provided into an internal buffer.

B

Fortunately, now the more recent dali versions, they have a method or like a parameter, you can specify named zero copy, which basically tells dali to not make a copy of the buffer, but you, on the other end of the end user side, have to make sure that this buffer stays around for long enough. So, for example, you can in your class initial initialization, you can create an empty numpy array.

B

You read your data or copy your data into and just make sure that this array is always there and doesn't go out of scope and gets destroyed, and as long as you can guarantee this, you can specify the zero copy option and you will get rid of this white gap for the other gap. um There's still some some issue here, um like dali, doesn't cannot spawn threats to do prefetching.

B

In this case, what you could do is use python python, concurrent futures to and double buffering to implement some uh some pre-fetching in your external source implementation on your own. It's pretty simple! To do that. I don't want to show this here, but um in general this is very straightforward to do, and by doing so, you can remove this this gap almost completely.

B

Let me summarize the optimization part of this talk. um I introduced two frameworks to you: the pytorch data loader data set, which is a very efficient way of feeding data to your neural network, and it is basically integrated in python. You don't need any additional software for it, and this is very flexible because it can basically read what whatever python can read.

B

So, basically, every file you can come up with there's a potentially python package for it to read and it allows for some transparent prefetching uh when you employ multiple, I o workers and it can also do fast host to device transfers when you tell it to pin the cpu memory but be aware of the multi-processing pitfalls here I talked about earlier and also, if you have a long data pipeline with a lot of compute in it, a lot of operations doing this in python functions is maybe not the best idea.

B

In that case, you might want to look into nvidia dali, which is an open source framework designed for feeding data efficiently to deep neural networks, and it allows you to build efficient pipelines for cpu and gpu.

B

It works best when you have a natively supported format like most image formats, audio or video, but you can also use um non-natively supported file formats through this external source operator, which is especially for scientific applications. I think one of the most important uh parts of this of this framework um so that it has some performance caveats. I talked about that you can work around most of them.

B

There's one um important thing I have mentioned before is that dali does not do in place operations most of the time. So that means that um if you have a long pipeline on the gpu, it might consume a lot of additional gpu memory on top of what your net neural network already consumes, especially if you have a long, prefetch queue, you can just define this in dali.

B

This can be a significant part of the memory. So please keep this in mind and.

B

Be aware of these limitations and then you can also play around with putting part of the pipeline back on the cpu way of much more memory available to counter this issue.

B

Okay, I want to hand over now to computational optimizations, which represented by josh.

A

Now that we've covered the data pipeline, let's talk about how you can improve your compute utilization. One of the best ways to improve your compute performance on nvidia gpus is to try out mixed precision. Training mixed precision, training combines typical single precision compute with lower precision, fp16 compute, where applicable.

A

A primary motivation for this is that on nvidia gpu since volta, there are special hardware units known as tensor cores that greatly accelerate fp16 map mole operations, which are the basis for many of the compute heavy operations. In deep learning.

A

Mixed precision, training has numerous benefits: math bound operations that can utilize tensor cores can achieve up to an 8 to 16 x, speed up relative to equivalent operations in fp32 for memory bound operations using fp16 data storage can yield up to a 2x speedup due to the 50 reduction in data size for fp16 versus fp32.

A

In addition to this, using fb16 storage for weights and activate and activations can free up gpu memory, enabling training with a larger batch size per gpus, which in many circumstances is more efficient.

A

However, mixed precision training does not come without some challenges due to the limited range available in fp16. Special care needs to be taken to ensure that only operations that do not require high precision or large dynamic range are run in lower precision.

A

Some examples of these types of operations are long-running sums with small values operations, where the magnitude of f x is much much greater than the magnitude of x and loss functions. Besides, this small gradient values can be lost to zero in fp16, impacting network training and accuracy.

A

The loss magnitude needs to be carefully controlled in order to avoid this issue in the end, dealing with these complexities manually can be difficult as a user and can be error prone, leading to mixed percentage implementations that either don't work at all or suffer from a big loss in training accuracy.

A

Fortunately, much of this handling for mixed precision has been automated via the automatic, fixed precision or amp feature. Nvidia has contributed to the major frameworks. Amp makes applying mix, precision easy by automating the conversion of existing single precision, training, graphs to mix precision converting only safe operations to fp16.

A

In addition to this, amp automatically handles scaling the loss to keep fp16 gradients within representable range to maintain convergence and accuracy. Amp is available in tensorflow, pi, torch and mxnet.

A

Adding amp to an existing training script is very straightforward and requires just a few lines of additional code on the right is an example of adding amp to a pytorch training script with added lines for amp highlighted in green to break it down. The feature is split into two components: grad scaler manages the automatic loss scaling dynamically, adjusting a loss scale during training to keep gradients within the range of fp16 and skipping weight updates when necessary.

A

The autocast contact manager applied to the model takes care of converting safe layers to lower precision. With these lines added. Your script is now set up to run in mixed precision and take advantage of tensor cores, enabling amp alone can result in large training speedups. Even if you see a speedup, though there can still be room on the table for improvement, don't forget ann bell's law, since amp only speeds up operations on the gpu.

A

Your final speed up will be limited by remaining work, not impacted by amp, like your data input pipeline or remaining work on the cpu. The speed up achieved for gpu work will depend on the relative proportion of compute bound to memory bound operations in your workload as a first step. Reprofile your training script and re-evaluate bottlenecks.

A

Some things to focus on are if your data pipeline is still fast enough to keep up with faster compute or more generally, but you are keeping the gpu well utilized with minimal idle time in your profile, if the gpu seems underutilized, try increasing the training batch size or beefing up your network.

A

Besides, this look for memory bound operations in your network and try to minimize their impact. One way of doing this is leveraging fused implementations, for example.

A

Besides this, there are ways to make your network more tensor, core, friendly, specifically favor, multiples of eight for sizing, linear layers and convolutions, while kudian and 8 and kublas, and the kuda 11 toolkit can now use tensor cores on problem sizes. Outside these size constraints, they are still more efficient if you follow these rules, finally avoid small gems with dimensions less than 128, as these are memory bound.

A

To conclude, this section here are some miscellaneous pie torch tuning tips. The first step is to enable cootie and auto tuning. This allows qdn to benchmark and use the fastest implementations it has available for your workload. The next two tips are useful for reducing the impact of memory bound operations.

A

The first is to swap infused optimizer implementations available in the nvidia apex library. The next is to use pytorch's jit functionality to fuse pointwise operations, which limits memory traffic and also reduces the number of individual kernel launches.

A

Finally, for networks using image data, try using the channel's last memory format in pytorch tensorcore operations, including nnn, operate using this memory layout, so ensuring your data is already in this format, avoids transposes between channels first and channel's last layouts and can improve training speed.

A

While we couldn't cover all the details of these topics in this short talk, here's a collection of some additional resources that we encourage you to check out for more information on profiling. I o optimization and mixed precision training thanks a lot for listening. Please let us know if you have any questions. Let's move on to the next part of the tutorial.