National Energy Research Scientific Computing Center (NERSC) Deep Learning for Science School 2020, 13 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Week 8 - Distributed Large Batch Training - Swetha Mandava

Description

Swetha Mandava from NVIDIA talks about Distributed Large Batch Training at the Deep Learning for Science School 2020.

More about this lecture: https://dl4sci-school.lbl.gov/swetha-mandava
The Deep Learning for Science School: https://dl4sci-school.lbl.gov/

A

Okay, uh so welcome everyone to another, deep learning for science uh lecture. I'm very pleased to have sweta mandava with us today um to give us a lecture on distributed. Large batch training um in pytorch and sweta is a senior deep learning engineer at nvidia. She develops optimized, deep learning, algorithms for applications in nlp and computer vision.

A

Sweater received her masters in electrical and computer engineering, focusing on machine learning from carnegie mellon university sweater. Thank you so much for joining us and very excited to hear your lecture for everyone on the call. Please remember that you can ask questions in the q and a part of the zoom and the slides have been posted to slack I'll, also post them again to the chat on zoom here. So thanks sweater, please welcome.

B

Thank you so much for the introduction, mustafa um hi everybody good morning uh and welcome to my talk on distributed large patch training. uh Let me again start off by thanking the organizing committee for inviting me to give this talk and all of you for joining this morning, in spite of the crazy times uh in in the bay area. uh So I'm shwata, I work in the deep learning algorithms team at nvidia.

B

We work on scaling and optimizing multiple, state-of-the-art, deep learning, algorithms using both algorithmic as well as framework and hardware optimizations.

B

Let me start off this talk by setting up some motivation as to why it's useful, by looking at some of the landmark deep learning models that we have today.

B

So I graduated about two years back and one of the first deep learning, algorithms that I had to code up in school was to predict digits in the mnist data set. It's a simple cnn network to predict numbers given the image and to do a simple, hyperbolic meter search took me about three days of compute on my computer in comparison, let's look at the scale of deep learning models that we have today. Alexnet and resnet that came out a while back have 60 million parameters to predict the class of an image from imagenet.

B

This is already a few orders of magnitude, bigger than the model that I used and since 2018 we have seen an exponential increase with gpt with 110 million, but with 340 million megatron with 8.3 billion, and it's only september, and we already have gpt3 with 175 billion parameters, almost 3 000 times the size of alex net and looking back, if I didn't take scaling seriously, it would have taken me the entire, uh a considerable amount of time in my education to just run one of these algorithms.

B

So my talk today is about scaling your deep learning model both to make your application effective, but also to improve your own efficiency and enable an experimentation culture so that um people can try out different models and different ideas.

B

So today we will go through um a bunch of simple tricks that you can all apply in your workflow. First, we will start off with a very simple deep learning network called ncf and optimize it within a gpu with a hands-on task.

B

And second, we will talk about a much bigger language model called bert and discuss tricks that we at nvidia use to optimize it, and some of these tricks are pretty simple they're as easy as using an api, and some of them take a little more time and effort. So the goal of today's talk is to give you at least a couple of tricks that you can take away and add to your own models.

B

um So the first part of the uh so the first part of the talk we will work with an ncf recommender system and we'll go through some of these optimization tricks, um and the reason I chose ncf is because we are using recommender systems every day today and they're quite popular. We see and use them everywhere and another reason to use this is because it's small enough to fit into our allotted time today. So neural collaborative filtering is a very simple uh dnn recommender system.

B

That's um that used the complexity of a deep learning neural network with matrix factorization to be state-of-the-art. So it's um so. As you can see here, we have users and we have items and on one side they are uh sent into a matrix, factorization layer and on the other hand, they are sent into a bunch of multi-layer perceptron layers and at the end they are concatenated and we receive an output score of whether the user will click on this item or whether the user will not click on this item.

B

So let me go ahead and go to um our ipython notebook to go through some of these tricks and but for today uh we will treat both ncf and the training algorithm as black boxes, in the sense that we won't code it up we'll just run through them to see our output. But I will share the link to um this repository so that all of you can play with it at home.

B

So let's go ahead and take a quick look uh in order to train this model. Today, I'm using stochastic gradient descent, it's a very common optimizer and in uh and in cell 5 over here, I'm just processing the data in the sense that I'm loading all the users, I'm loading all the items um and putting it in a required format. So if we look at this, we can see that we have about 140 000 users that is divided into test and train, and we have about 30 000 items in cell 6.

B

I will go ahead and initialize the model. So if you look at the output from the model, you can see that we have about four embedding layers and four linear layers, and the model itself consists of about 30 million parameters.

B

So I went ahead and ran in cell 4 the training with an arbitrary batch size of 4k.

B

So one of the things I uh I want you all to notice from the output is that after each epoch, I am returning the hit rate, which is the accuracy of how correct we are with our predictions, I'm returning the train throughput as well as the train time.

B

I trained this model for about 10 epochs, I'm gonna scroll to the end, and the thing I want you all to notice is the best accuracy that we reached after denipox. So we can see that we are at about 90 accuracy and the time we take to train the whole model is about 1300 seconds.

B

So the goal of this uh notebook is to retain the accuracy that we have over here without without reading the accuracy that we have over here by improving the time to target. That is, we want to reduce this 1384 as much as possible um yeah. So one of the simplest ways, as you know, to decrease wall clock time is uh increase the batch size, and this is because of multiple reasons. So, let's take an example of comparing batch size 1 to patch size, 10 and, let's say we're processing around 10 images.

B

So in the second case of batch size 10, we only need to access the weights once to process all 10 images.

B

So if your perf is limited by reading the weights, you can process all these 10 images with accessing the weights once and in other words it reduces the communication overhead and you can also increase the parallelism by the gpus if you use a big batch size by using computationally intensive routines like matrix multiplication, so one of the easiest ways to increase your throughput is by increasing your batch size, but in this graph over here, I've shown some relationship between the batch size and validation error for imagenet and, as you can see, with increasing batch size after a point, your validation error starts to go up as well, and this is because, uh as you increase the batch size, you are using generalization properties of your model.

B

um So let's go ahead and see if that applies to our network. Here, I'm simply scaling the batch size by 16 arbitrarily and initializing. The model and optimizer as we did before and I will train the whole network again.

B

So one of the things again, I trained it for just 10 epochs and you can see that in the end our accuracy fell down to 83.

B

But we can also notice that our time to target went from 1300 seconds to 170 sorry 170 seconds. So we can see that we um get about an 8 to 9x speed up, but we also lose uh seven percent of our accuracy.

B

So this um kind of sets the stage for our first uh trick that we want to use, which is the linear scaling rule. uh I try to explain the intuition behind linear scaling rule with three simple and, I hope very clear images. So in the first graph here let's say you have you're using the learning rate of one and that size of one and let's say you want to look at ten images.

B

So, as you can see after every image, you take a step of size, one and eventually you'll reach the global value of them and in the second graph over here, I'm using the same learning rate. But I have a batch size of two and in order to look at the ten images uh I will, I will only need to take five steps because our batch size is two and you can see that at the end of one epoch we only reach a global value of five.

B

So the linear scaling rule is quite simple. It basically says that um if you are scaling your batch size by k, you should also scale your learning rate by k. So that's exactly what we did in the third image here. So if your lr is 2 and path, size is 2. You end up at the same global value.

B

Of course we are taking a lot of assumptions here. For example, one of the assumptions we're taking is that the first two steps that batch size is equal to one takes is equivalent to one step that batch size is equal to two takes, which is not always correct, and we will see how to fix that issue in in the following tricks.

B

um So I went ahead and uh did what we just learned.

B

So I started off with that size scaled by 16 and learning weight scale by 16 as well, and um I initialized the model I initialized the optimizer and I trained uh the whole thing for 10 epochs again and as you can see, we retain our accuracy. We come back to 90 um and our time to target is at 172, so which is great. We already have a 9x speed up.

B

Let me go ahead and scale the batch size by 192., so in the previous experiment that we did, I arbitrarily scaled the batch size by 16, but a good rule of thumb to scale your batch size is basically to scale it till your hardware allows it, and in this particular example, I'm using a v100 and v100 allows me to have a batch size of 4k into 192.

B

So that's exactly what I did and as we've learned before I scaled the learning rate, also by 192, and I initialized the model and optimizer, and I started the training once again and, um as you can see after 10 epochs, my time to target went down from 170 something seconds to 130 seconds. So something I want you to notice is that we did not get the same speed up going from 4k to 16x, as we did going from 16x to 192..

B

So the um the lesson from this is that we get diminishing returns after a point by increasing the patch size, and um another thing I want you to notice is that our accuracy has fallen to 75 percent, uh even though we scaled the linear scaled, the learning rate using the linear scaling rule, our accuracy has still suffered from scaling out to 192x, and the reason is because of one of the assumptions that we made earlier that I spoke about.

B

So the assumption that we make with linear scaling rule is that the um the steps that batch size, 1 and batch size 2 take are equivalent, but it is not the case, especially in the beginning of the network, in the beginning of training.

B

So in the beginning of the training, um the model starts off from uh random initializations and uh changes quite rapidly, so the warm-up rule is uh really empiric empirically proven, and um the intuition behind this rule is that in the beginning of your training, you want to take really small baby steps, just because uh the gradients that we're getting is very, very noisy um and using that intuition they tried to scale the learning rate.

B

So, for example, if you have in our linear scaling rule, we scaled the learning rate by 192x, but the warm-up rule says that we shouldn't start off with one id to x. We should start off from zero and slowly scale up to uh 190 to x in the warm up iterations.

B

So that's exactly what I did uh in cell 12.. So if you look at the function here, I'm basically saying if your iteration is greater than the warm-up iterations, I'm just going to have the learning rate that we decided on. But if it's less than warm-up iterations, I will slowly scale up my learning rate.

B

So I went ahead and did that where my patch size is 182x, my learning rate is also 192x, but in my utilstar train function I also pass the warm-up function and when I do this, I can um scroll down to see that my learning rate has recovered a little bit uh as in it went from around 74 percent to 83, which shows us that there's still uh another trick that we can try to use to um to get back the accuracy that we've lost.

B

um So that brings us to one of the last uh rules that we'll talk about in this notebook called lars. um Actually, maybe I can stop for a couple of minutes to take any questions that you have before I go into lars.

B

um So I see one question here which says: how many epochs should you do warm up any rules of thumb? um Ideally, uh you can start off with one third of the uh entire training to be warm up, but um you can sometimes get away with much lesser warm-up as well.

B

um Okay, uh maybe I can go ahead with lars, but feel free to ask any questions that you have um okay,.

A

Maybe I can ask a follow-up question to that. Is there? um um Do you see in practice that you, you need to tune this warm-up period that like, if you is the essentially the final uh accuracy sensitive to how long you do the warm-up or is it or the warm-up is only about the stability of the training in the beginning of the.

B

Up we do treat warm up as a hyper parameter, um but uh the the good thing about warm up or like if we increase the warm up it allows us to essentially use a higher learning rate. So that's something we use to um converge faster, but I think the accuracy should be um should not vary all that much.

A

Let's see, okay, thank you.

B

Okay, cool um so moving on to lars, um so lars is lars or um layer. Wise adaptive, uh great scaling, optimizer is, um uh is a wrapper around the standard sgd that we've been using up until now, so the standard sgd, as we know, uses the same learning rate for every layer and every pattern, every parameter and that's an issue. So let's look at this update equation that we have over here, so your um x, k, plus 1, is basically x, k minus your learning rate into your gradients.

B

So let's take, for example, when your gradient is really really high for an outlier. um The your xk plus one is completely uh changed because of your uh really high gradient.

B

If your learning rate is not uh scaled accordingly, and especially in the beginning of the network, when you uh are prone to noisy gradients, this becomes an issue because even one stray update can completely change the meaning of your uh parameter and in the lars paper, they've tried to observe the magnitudes of these weights in each of the layer, and they realized that, for example, when you're training, alexnet or cnn model, the first cnn layers l2 norm of weights, is around six and the last one is around 1400, which is uh which brings about the point that we cannot have the same learning rate for these completely different magnitudes of weights.

B

So in the lars paper they try to have a thrust ratio lambda where essentially, they are um dividing the l2 norm of the weights by l2 norm of the gradients so take, for example, when your weight is really really small and your gradient is really really large. Your lambda l will kind of adjust itself so that um your learning rate becomes smaller and see uh same with the case for when your um weight is really really big. But your uh gradient is really really small.

B

Your lambda readjusts itself so that uh it matches the magnitude- and this allows us to uh this- allows us to scale higher with batch sizes. As you can see in the images here um with batch size, 8192 and lars, we can see that alexnet retains its top one test. Accuracy and- uh and the good thing about lars- is that uh the magnitude of the update right now doesn't only depend on the gradients anymore. So it um allows us to not uh diverge when we scale higher.

B

So let's go ahead and uh initialize our lars optimizer um and I'm again using the library to implement lars, and once I uh initialize the model and the optimizer with lars. I trained the model with the same data and same parameters as we've used before for 10 epochs, and we can see that we have achieved a 90 percent accuracy that they've been chasing and we went from uh 1300 something uh seconds to 135 seconds without losing any accuracy.

B

So we already have about 10x speed up here, um which is great, and it's so right now now that we have scaled it up to as high as we can and um retain the accuracy.

B

I will move on to the next section, which is computational tricks, uh and this is one of our favorite checks called mix, precision, training, it's really simple to use, and the idea behind that is that all the training that we've done up until now has multiple tensors in the form of inputs, activations, gradients and weights, um and they have all been represented in fp32 representations, um 32 rep. Basically, that means 32 bits to represent each floating point number that we have, uh and in this section we check if that's really required.

B

So we look at how to use fp16 floats to train our network today, instead of fp32 and as always, the goal is to maintain our accuracy, but also speed it up and using fp16 allows us to do just that, because not only does it give us increased throughput right off the bat because we're using 16 bits instead of 32 bits, but it also reduces our memory footprint so that we can uh use even bigger models and batch sizes, and here are some examples of how amp speeds up training in some popular networks.

B

So, as you can see here, resnet gets a speed up of more than 3x and bert also gets a speed up of more than 3x by simply employing amp into your training routine. um So what is the catch right? Why haven't we always been using fp16 instead of fp32 um and look, let's look at the problems with fp16 training uh depicted in this particular graph.

B

uh This is a histogram of all the gradient values for a model called ssd, and everything on the left of the red line is not representable in the fp16 range and everything on the right of the red line is representable in the fp16 range, and all of these gradients are representable in fp32, but it becomes a problem when we use fb16, because 31 of the gradients or all the gradients that are left of the red line becomes zero, and when we zero out 31 of the gradient values, we um make it uh diverge.

B

We make the model diverge, um but the interesting fact is that we see a massive uh area of the representable range that we have not been using at all. So everything on the right of the blue line is actually representable in the fp16 range. It's just that we have not been using it because of the um properties of our gradient values.

B

So one of the new tricks that that was discovered is that a really easy way to represent all of these gradient values is just to move this mountain a little bit to the right, and we can do that very simply by multiplying the loss with a loss scalar. So, for example, if you multiply the loss with x when you back propagate this loss, all of your gradients are also multiplied by x, and essentially you will be moving all of these gradient values to the representable range of your fp16.

B

So um we can still converge with fp16 precision. um So now the question becomes: how do you choose this law scalar value right um and for some models? It can just be a hyperparameter. It can be a static law scale value that you can multiply your loss always with, but an easier way to do. It is dynamic loss scaling. So, for example, you can pick a a value of law scalar and let's say if your mounting overflows. That means it moves too much to the right that uh it is overflowing.

B

You can just reduce your loss, scalar value by 2x, and if your model has not overflowed in, say 1000 iterations, you can try increasing your law scale value iteratively, um so that kind of sums up the idea of mixed precision training. um So in order to enable mixed position you you need to basically put your model into your mixed position, type and uh scale. The loss scalar perform last scaling before you do the back propagation. um One quick uh note to see is that sometimes porting the model to fp60 in position is not safe.

B

Even if you do the law, scaling say, for example, in batch norm layers. So it's important to put only layers that are safe for fp16 um and the good thing is that there's an api for this, you don't have to actually implement what's safe and whatnot. What's not uh in most of the frameworks today like pytorch, mxnet and tensorflow, there's a simple api that you can use uh and in this particular uh code, snippet uh I'll show you exactly how so in in here.

B

You can see that I'm wrapping the model and optimizer with uh amp, so I just say: amp dot, initialize the model and the optimizer, and it simply it simply puts the safe portions of the model into fp16. And then uh I implement law scaling by just saying: um amp, dot, scale, loss of the loss and the optimizer. So I pass both the current loss value as well as the optimizer with its gradient values.

B

So it picks um a loss scale accordingly, and then I just um call backward on the stage loss and um again I try to just train the model with the current model and the optimizer along with our warm up and scaling function, and you can see that at the end of epoch, 10, our accuracy is around the same.

B

It's 90, but our time to target has gone down from 130 to 70, which is a really simple, 2x speed up without um by just using an api call um and that kind of sums up the um the notebook section of our talk today. uh Maybe I can just pause for a couple of minutes to take questions, um so I see a question that uh all of these strategies limit the excursion size. If I understand correctly, isn't there a danger of finding a local minimum rather than global minimum?

B

That is, in effect to be dependent on the initial conditions.

B

um So that is true in the sense that um so, if your problem is non-convex and if you increase the batch size by too much, then um you do run the risk of going into local minima, but um and and we do have limits. For example, when we tried to scale birth, we were not able to. uh The original publication uh came out with a global batch size of 256 and we were able to increase that to about 96 64 to 96k.

B

But beyond that we do see a loss in generalization and we do see, um loss inaccuracies. So I think all of these tricks do have uh um a saturation point beyond which it's it's still hard to uh scale.

B

um Is there advantage to explore mixed position where weights and activations are fp16, but accumulators are still kept as fp32? Yes, that's one of the tricks um that amp actually uses, for example, uh it ports all the um layers that are safe into fp16, but um puts all the um all the layers that are not safe. Still in fp32, like you mentioned, for example, accumulators, it still tries to keep them in fp32.

B

Cool um okay, awesome, so um let me just summarize the learnings that we have so far. um The idea is that um it is very easy to get.

B

uh It is very easy to get a throughput increase by improving increasing your batch size, but increasing your batch size comes with its own issues, of losing a generalization and accuracy like we've discussed and um some of the tricks that you can apply to alleviate.

B

This issue is lr scaling warm up and layer, wise adaptive optimizers to allow you to go to a higher path size and another trick that we learned about today is amp, which is automatic mixed precision, training which allows you to retain the accuracy, while also getting about at least a 2x speed up.

B

Cool um okay! So uh now that we've completed that uh the um ncf training on a single gpu, maybe we can move on to multi gpu and multi-node uh experiments with um word.

B

Okay, so in this next section we will be talking about the bird model that has been a landmark model in nlp when it first came out. um The original publication took about four days to pre-train the model with um a global patch size of 256, and our team and nvidia tried to showcase optimal design techniques by scaling it up to take only about 47 minutes, which is a huge accomplishment.

B

So let's go ahead and discuss some of the techniques that we used, um even though they are being discussed in the context of birth. The techniques I talk about today are generic enough and can be applied to any deep learning model.

B

um So, on a high level to have a successful, highly performant multi-node system, you need three things. The first is the optimized software stack, um so optimize system, design and data center management. So let's take a minute to understand each of these techniques. So, first and foremost, we have algorithmic optimizations.

B

um This is everything we can do within a single gpu to have a highly performing model and we've already discussed some of them using ncf as an example, um and then we have the system design so consider the case of using more than one gpu. We then have to think about the communication between the gpus gpu to cpu ratio, etc, and then, when we take it a step higher to multi-node systems where we need the whole software stack to run on a cluster.

B

So there are multiple aspects in achieving a multi-node success and we'll mostly delve into the algorithmic optimizations today and I'll briefly touch upon the others as well um yeah. So first we have convergence and stability.

B

We have already discussed some of the optimizer text in mix and mix precision already, but now uh let me introduce another adaptive, optimizer called lam that we used in bird we've seen in practice that, while sgd works well for computer vision, tasks, adam is the go-to optimizer for nlp and land can be seen as an extension of lars applied to adam instead of sgd.

B

So here, for example, they compared lars and lamb side by side. uh On the left hand side. You see that, as we've discussed before um on the final step, we basically scale the learning date with l2 norm of your base by l2 norm of your updates and on the right hand, side. We do something similar, but we do it with the first order, momentum and second order, momentum, mt and vt values.

B

um So this is something we had to use in work to scale up the model from using 256 to 64k patch size, um but we also had to make some changes to this optimizer to actually get it to work. On the left hand, side we added gradient pre-normalization.

B

So, for example, uh before we do anything with the gradients, we normalized the entire gradients of the model by the l2 norm of all the gradients variants, and we saw that is actually quite important to do this. Otherwise our model would diverge pretty quickly and and the reason we think this is necessary is because in large batch settings where the direction of your gradient is largely preserved, we don't want um the the gradient values to be too high, um and- uh and this also alleviates the uh exploding gradient problems and on the right hand, side.

B

We show the results with bias correction, even though the lamp paper does use bias correction. They mentioned that without bias correction, they were able to converge, okay and but that's not something. We noticed. We see that the implicit bias of beta, 1 and beta2 is actually pretty strong uh without bias correction. We see that it diverges pretty quickly.

B

B

So that's a kind of that kind of wraps up our work with optimizer tricks today, um so we can move on to the uh the software stack section of the optimizations, um so in a regular uh uh back propagation for model training. We see something like this, where you basically have your forward prop and your update of the weights and your um uh backward prop, but, um as you can see in the as you can see, with the green portion of the timeline, uh we're wasting a lot of gpu time.

B

By simply waiting for this, uh I operations to complete and we've noticed that, just by pipelining, these uh not pipelining, sorry overlapping. These uh I operations with computation. We see a pretty good speed up and a high utilization of your gpu. So this is something you can try out as well um and the next thing that we've noticed that really helps with performance is fusing kernels. um So the the thing about a lot of the frameworks that we use today, like pytorch and tensorflow, use pretty low level uh operations.

B

So, for example, if you want to use the gel, u activation, it results in seven kernel calls.

B

But if you can reduce all these seven kernels into one kernel, it reduces the overhead of launching all of these kernels but also improves the memory locality. um So this is a more uh um complicated trick to implement than the ones that we've discussed so far, but we've seen that it actually does help. So this these are the results we got from students at the vector institute that kind of match the results that we got as well for burt.

B

So if we start off with the baseline model and uh apply fp16, we see about a 3x speed up for bert, but if we also fuse some of the kernels like yellow, we see that we increase the speed up by 3.7 3.75 x, which is awesome.

B

So the next part of the talk is scaling to multiple gpus um and the simplest way to scale to multiple gpus is to use data parallel training. So, for example, if you have x gpus, we provide a batch of data to each of these x gpus.

B

We perform forward prop uh locally on a particular gpu, and then we do an nccl all reduce to uh collect all the gradients from all of these gpus and nvidia implements an nccl communication library that does this already use efficiently, but we can see that when you look at the timeline of this already use operation, you usually have a forward prop a backward prop and then an all reduce between all of the gpus. Before you can do.

B

A bait update and like we have discussed in the communication timeline before this is inefficient, because all of the gpus are waiting for this already used to complete before they can move on to the next compute parts of the training algorithm.

B

So something that you can do to alleviate this is use is overlap the already used with backward propagation. So, for example, if you, if you're done back propagating loss through the nth layer, you can start all reducing it, as you continue doing, the backward prop through n minus one layer, and the good thing about this is that you don't have to actually implement this yourself. You can simply use uh apex or distributed data parallel wrapper to your model.

B

So all you have to do is say: model is equal to ddp of model, and it's taken care for you. So what ddp in the in the background does? Is it? Does the um it overlaps the reductions with your backward propagation? So it um improves the utilization of your gpus and it also does fp16 reductions if you've activated amp, so instead of um porting all of these gradients to fp32 and then or reducing it and putting it back to fp16, it directly does the reductions in fp16, which is pretty cool too.

B

So the next uh issue we see is that, even though you overlap your um backward with nccl already use, it still results in a significant uh time lapse between which you can uh before which you can do weight update, and this is usually the case when you have really slow interconnects or if your c gpus are connected with or if your multi nodes are connected with a low ethernet connection and one of the ways in which you can fix this issue is by using gradient accumulation.

B

So gradient accumulation is a simple trick by which you can do multiple forwards and backwards before you actually have to all reduce. um So in this particular example, we are say doing two forwards and two backwards before we do an all reduce, and what this essentially does is, let's say if your batch size for each forward prop is x uh and by doing two forwards and two backwards before you all reduce you are emulating a batch size of 2x.

B

But again, all of these uh this particular trick. Now that we are emulating a higher batch size. We also have to take care that our convergence is not affected by this trick, but we've seen that this really really helped us with birth. For example, like I mentioned before, the batch size that originally google was using was 256 and we scaled it up to 64k or 96k, but the maximum batch size that you can fit within a gpu is only about 64 for bird, because it's a big model.

B

um In that case, we used gradient accumulation quite extensively to increase the throughput of our model.

B

um So here we've tried to compare the results from parent accumulation so on the left over here, we ah so um on the left over here, we've seen the scaling um with gradient without gradient accumulation for one machine and one gpu, all the way up, till one machine and four gpus and on the right we are seeing gradient before gradient accumulation for one machine and one gpu, all the way up to four machines and one gpu.

B

So the difference between the left side and the right side is that on the right side, we have slower interconnect speeds because we're using we're using gpus from different machines and, as you can see, we really suffer because of the low interconnect speed.

B

um And in this slide we see with gradient accumulation and again on the left side. We have one machine, four gpus and on the right side we have four machines and one gpu. The right side is with a much lower interconnect speed and you can see that it scales quite well even with low interconnect, speeds.

B

um So, coming to one of the last multi gpu tricks that we've seen so usually for training these massive models, we have uh massive data sets. So, for example, if you have one input file with all the data, it is highly inefficient because each of the gpu loads, this massive input file, which is not efficient. So one of the ways in which you can optimize this is by splitting the input files into shards so that each gpu only has to load uh what it absolutely requires.

B

um So, lastly, to take it one notch higher and scale to multiple nodes. Like we discussed, we need proper input, node communication, and we should also consider moving data close to compute so that we don't suffer with low interconnect speeds, um for example, moving the charts that a machine needs closer to avoid data movement over the network or ethernet.

B

So you might also want to build the whole application and system software stack to deploy. Algorithms on multiple nodes, um manage job allocations and queues.

B

So to conclude, uh dl models will continue to grow in size and they require massive scale out. That requires careful consideration on multiple aspects. um Some of them are really low effort as simple as using an api, so we should really consider incorporating them into your stack to improve the perf, but also your own productivity um and nvidia does provide multi-node and deep learning solutions, and most of our work is open sourced in this particular github repository.

B

Please feel free to check them out and leave us your feedback um thanks for your time- and I will continue taking questions from here on.

B

uh So the first question I see is is ddp capable of launching multiple processes and multiple nodes simultaneously. In the example, I've seen, it seems like you need to spawn multiple processes in a single loop process. uh Yes, that's right! So gdp allows you to communicate between uh all of these processes, but you still need to spawn multiple processes from a single root process. So, for example, if you're launching a training algorithm on four nodes, you need to launch your so for pytorch.

B

For example, you can use distributed, launch to launch all of these processes and communicate with them using.

B

Ddp, um the second question is: uh how are results with gradient accumulation of two differ from increasing batch size by a factor of two. uh I think they should be similar, so um the gradient accumulation of two is essentially increasing the batch size by a factor of two, but they are using used in different uh scenarios.

B

So, for example, in your hardware, if you can increase your batch size by a factor of two, you should totally do that, because that results in only one forward and one backward prop, but gradient accumulation is used when you can't increase your batch size by anymore, but still want to um optimize it by reducing the um the uh the lag that we saw. So in that case, you can try gradient.

B

B

um So the the next question is that algorithmic capacity can limit.

B

Hyperparameter optimization, um I'm not quite sure what that means.

B

B

So what exactly do you mean by hyper parameter.

B

B

Adjusting layers.

B

um I'm not quite sure.

A

B

A

I can bring gyan to panels to ask his question, maybe with audio.

B

A

Yeah ian, you cannot talk.

C

uh Hello, uh sorry, for being so cryptic, you have discussed today many ways to improve conversion. So, and I was curious about this time, you have used the algorithmic capacity, so this would. How would this bring what what what different aspect does this entitled? uh Apart from what you discussed today,.

B

uh So, do you mean uh what are the other things you need to consider with scaling your batch size.

C

Not necessarily batch size in general, you have brought many tricks to improve the convergence, not just batch size right, so where the algorithmic limitations are, you think.

B

um Right so I think uh algorithmic limitations could just be uh dependent on the model itself like, for example, we're seeing uh multiple uh optimizations for the model, let's say, for example, going from um bird to uh gpt3. We see massive improvements in the accuracy, uh so that's another limitation of the model or algorithm itself that can be improved as we go forward. um Thank you. Yeah.

C

Okay, I was just curious. Thank you.

B

Thanks um so I see another question we'll be using bert in my company for topic: extraction, classification and customer sentiment analysis using pytorch. um What are the advantages disadvantages of python versus tensorflow in case of birth implementation? um So we have at nvidia open source both by torch and tensorflow. For that and they have comparable performance.

B

um I guess the only considerations would be of what fits well into your ecosystem, say. For example, pytorch is easier to experiment with, whereas um tensorflow I uh it seems like fits into. A lot of companies is original ecosystem, so it ah yeah there's no inherent limitation of either of these.

B

D

If I could follow up on that, um it seems like nvidia uh really likes to work with pytorch like in the mlperf results. It's mostly pytorch implementations, with the exception of mxnet for resnet. Can you comment on why this is it's just that nvidia developers have a preference for working with pytorch because it's maybe nicer to work with or are there actual like? Is it? um Do you think it's easier to get let's say like compute performance gains out of pi torch versus tensorflow nowadays,.

B

Right, um I guess I can talk for myself here. um I definitely prefer coding in python because it's much easier to work with, but that being said, I think uh nvidia has been coding in uh pytorch and tensorflow, and now it's um also developing in tensorflow too.

B

I guess yeah, it's just a matter of personal preference, because we don't really. We see that a lot of our customers have a preference for tensorflow or pytorch, depending on what they've been using up until now, but because uh for ml perf, specifically, we don't really um have to stick to one particular framework. It kind of just depends on the developer, I guess- or the team.

A

A

um If I ask, if I may ask about the hyper parameters for um the optimizers and for the models that you're looking at from your experience, do you if we are, if I'm, for example, going to to try to scale a completely different problem, something that is not standard, not using resnet, not using any of uh bert or any of those, but like a custom architecture?

A

And we want to scale it from a few gpus to let's say hundreds of thousands or thousands of gpus there's a lot of optimization that has gone into making actually resnet or or bert work at that scale.

A

um From your experience like have you actually seen cases where people are trying to apply the now golden rules for how to scale things and do a warm-up and use certain optimizers and all of those things to a completely different um domain on architecture? And is there something that you can say about that.

B

Yeah um so, for example, even changing uh the data sets for bird, we will have to redo the hyper parameter optimization.

B

So I think a good rule of thumb is to just start with a single gpu to see that you are using the maximum batch size that you can and then um scale it up to how many other gpus that you want to use it with, and uh I guess we usually start with um some uh known hyperparameters like, for example, when we were trying to do biopert with biomedical data.

B

We started off with hyperparameters that were used in bert, but, um as you might have expected, they don't work off the shelf for different data sets or even different models. So I think again it comes back to just um doing the hyperparameter search, um but starting from a point that we know work for similar models or similar data sets.

A

If you, if on this example, if you get it to conversion like to get to a reasonable accuracy on a single gpu, of course, you might not even pass through all the data and all that, but and then you want to scale it to multiple gpus which of the hyper parameters. Would the the the model be monsters or conversions be more sensitive to that? You think that one needs to optimize those at scale right.

B

Right um so I think, uh like we've discussed today, the hyper parameters that I would first search for are learning rate and warm-up steps. um Momentum and betas usually don't affect all that much. Maybe those are hyperparameters that you want to tune in the end for very small games but yeah. I would start off with uh learning great and uh warm-up steps too, with.

A

B

A

One follow-up question on this: I I've noticed early that lars or lark before lamb that they work well with a large batch size, but they don't tend to work well with a very small batch size. Is that something that you've experienced?

A

So we ended up having to do adam on small by size and then using lars on a very large batch size.

B

um So I have not tried lars on small batch sizes, but I have tried lamb with a small batch size. So, for example, I did uh lamb with a really high batch size like 96k, but I also tried lamb with a global bad size of 256 and they seemed to work uh um just almost similar.

A

I see so it could be um a reasonable strategy. At least you would think too. If I'm, if I I've, designed my model design everything, then I'm using adam, then the first thing I should do is at the scale of a single gpu. I can switch to lam uh optimize the parameters for lab and then try to scale. You think that that's a sound strategy.

B

Yeah, I think that should work.

A

C

A

A

Okay, I mean: are there uh any more questions.

D

uh I guess I could ask another one, so um you know it's great. That nvidia is, uh you know working on so many different aspects of deep learning and really um kind of pushing on you know the software, the hardware and also the methods um nvidia, has you've shown like great recommendations for things like optimizers like lars and and lamb and nvlan and stuff, like this um nvidia kind of puts things into you know apex or deep learning examples repositories to make them available for for folks to use. I'm just uh sorry.

D

This is overly windy to ask a simple question: what's nvidia's strategy for, like um is nvidia pushing to have things like large lark optimizer or the new nv lam like centrally available in the frameworks like pytorch and tensorflow? I know like lark is right now in apex and lamb is in apex, but I don't think heat land is an apex, and it's only in that repository right.

B

uh Right so I think um the lamp that is in uh apex is actually the the lamp uh version that I mentioned with the tweaks that we made, um I'm not sure. What's the process like to go from apex to uh pie, torch or tensorflow, but I know it with mixed precision: training, for example. It first went into apex because I think that's the um easiest part and then eventually it goes into the framework. So I'm assuming um lamb will as well.

D

D

B

um So a lot of these tricks are a sort of boilerplate. Do you have any experience with libraries that handle these? For you example by dodge lighting? Amp is a good subset, of course um yeah. So I think apex handles some of these tricks. For you, uh um a lot of the optimizers that I spoke about today, as well as amp, is on apex, as well as the distributed data. Parallel that we've discussed um yeah. I think uh uh apex should be good for some of these tricks.

A

Maybe I can so while we have you, it's really good to talk to someone who has done a lot of this in practice. So thank you for answering all the questions, and so, but maybe one more question is about um batch norm or normalization layers um are there I mean there are multiple proposals for how to do this in a distributed, setup right and um are.

B

A

Like certain recommendations that you would make for like how to how to actually do that, um the first thing that you would try, for example, for a vision system for a computer, is like a computer version task.

B

um uh I'm not actually aware of the different types of patch, nor maybe I can look it up in direct.

A

Confidence yeah, I think it's just like this communication of stats. Right, like you, need to all reduce the stats across the batch because there's a different examples: dependence, okay, um uh a different question is like do you? Have you seen any of this? Are you aware of any of these tricks up being applied to graph neural networks for scaling yeah like distributed training, large batch training of graph neural networks.

B

um I'm not aware actually, but I yeah, but that's super interesting. I should look it up.

A

Yeah, okay, so it sounds good yeah. I think. Like a lot of there, we have a lot of applications that are doing that and now we're seeing. Many of these applications have uh extremely large amounts of of data that you know. Training on on 8 gpus on a single node would take days to to do one pass through the data set, so those we definitely want to explore how to scale them, but conversions at scale is still the main question that we have yeah. Okay, thank you.

B

Yeah awesome: okay, thank you so much for having me and all of you for joining the talk.

A

Yeah, thank you so much sweater and uh hopefully we'll meet in person after all of this uh ends yeah. But thank you for for agreeing to do this and for the great lecture and great material.

B

Yeah, thank you. So much have a good day. Everybody.

A