National Energy Research Scientific Computing Center (NERSC) SC20 Deep Learning at Scale Tutorial, 21 Dec 2020

Previous Meeting

⏯

youtube image

►

From YouTube: Scaling Deep Learning Training - Mustafa Mustafa

Description

SC20 Deep Learning at Scale Tutorial
https://github.com/NERSC/sc20-dl-tutorial/

A

Hi, I'm mustafa I'm a deep learning engineer at nursk um in this part of the tutorial I'll cover some concepts and distributed training of deep learning models. After this, we will go to the demo where we show you how to scale a particular model for a particular problem.

A

So in this talk, I'll yeah I'll cover some training, parallelization strategies and and then I'll delve into more into large batch training um and then I'll talk about the challenges with that and some some ways to to avoid um yeah the problems with large batch training. um So why do we first? Why do we need really to to paralyze deep learning uh training so um to the left?

A

Shows you um some uh responses from our users on how long it takes them to train their deep learning models and, as you can see, it takes hours days or weeks, sometimes to train a model, and that is definitely challenging when you're doing prototyping for for a new model to solve a new problem like, for example, imagine that every time you need to compile your code, it takes days or weeks to compile uh the code and then test it.

A

That's definitely not not efficient to actually solve any to develop any um code or model right, so we need to reduce this time to a reasonable time span of like minutes. Maybe hours is tolerable days, sometimes, but definitely not to have the usual case being days or weeks. The other. The other challenge is that to train the models that we are training right now, which is these large, deep learning models you uh you typically need a larger and larger data sets, especially in science.

A

The data sets are very large they're in the hundreds of of gigabytes or terabytes. Sometimes, so you need a way to process this data faster right. You can't just do it sequentially on a single gpu.

A

um The other thing with deep learning is that to solve complex problems, you need bigger models, so the more complex the problem, the bigger the model that you would need to actually solve, uh uh solve it using deep learning and we have seen uh an increase of the sizes, especially the depth of of the of these smaller, the models that are deployed to solve, for example, vision, tasks, the same thing for nlp tasks, and it's the same thing for all scientific uh problems to the right.

A

You can see here, for example, um the um uh versus year, how long it takes or how what's the computational needs for training uh some of the major models now uh on the market this is compiled by opening. I you see, you see like the the the increase in the curve is exponential and it's very steep, so um we need to to be able to to use computational resources efficiently in parallel, so that we can reduce the the time that it takes to to train one of these models to solve every problem.

A

So how do we do? uh How do we actually uh parallelize training of a deep learning model? So there are different modes? The first one is data parallelization, where imagine that you have a model that trains well on a single gpu, for example, or a single worker can be a cpu or a gpu.

A

When I say gpu here I also mean just a worker. There could be a cpu too. So imagine that you have a single worker here and it works well. It processes a batch in, but you want to um to to essentially finish the training. Much faster.

A

One thing you can do is you can replicate the model itself among many workers and then each one of those can get their own batch of slice of the data set, and then they process the data set in parallel right and then this this achieves scaling and you can process the data faster, and this is what we call data parallelism and that's what we'll be talking and focusing on today.

A

But other modes of parallelism is, for example, if your model no longer, if your model is large, very large, that it no longer fits on a single um gpu, for example, or a single worker. In this case, you need to the distributed, distribute the model itself amongst multiple gpus, for example right, and there are several ways of doing that. One way, one one way is to um to do layer, wise parallelism, where you take one layer and you split.

A

Let's say like this is the first layer that was here like a big square, a big cube layer and you split that layer itself amongst multiple um uh workers yeah. The cube here is just the activations, but yeah, so essentially that the layer is is split amongst multiple workers, and this way you can split, you know parallel, you distribute the entire model amongst multiple gpus. You can do this the same for for the other layers as well.

A

Another way of doing this, instead of splitting the layers or distributing the layers amongst multiple gpus, you can do.

A

Pipeline parallelism, where essentially every layer is, when I say on a particular uh worker, or every few of them can be on on a particular gpu or a particular worker, and this creates a pipeline which we call pipeline parallelism. um The most common uh parallelism mode is data parallelism, uh but uh pi. You know model parallelism in general, whether it's layer, wise or pipelining, is becoming increasingly important as our models becoming are becoming larger and larger. So you will see more of those in practice.

A

We are seeing more of those in practice and we expect to see more of them um in practice in the near future. So I'll focus on data parallelism for today and that's what you'll see in the practical section.

A

So now, how do you do data parallelism? As I mentioned, we split the data amongst multiple workers right. We replicate the model itself and then split the data amongst multiple workers. But what do you do so in in deep learning training you? We train them using sgd right like gradient uh descent and then back propagation.

A

So there are two modes there's in in of operation: there's the photo pass where you're going from the input, all the way to the loss function and then there's the backward pass where you're calculating the back prop operation right and you're, calculating the radius going back to update the weights in the forward pass during data parallelism.

A

You don't do anything differently from what you do. Usually there's, no communication whatsoever. You just take your model here, it's just schematically as a single matrix. You replicate it on the multiple workers, p01 uh zero one two and then you take the data itself and you split you slice the data and then each worker takes a slice of the data and produces its own output, and that's it that finishes the forward pass.

A

The backward pass is when you need to start doing communication so the backward passes. You calculate the the gradient of the loss with respect to the to the weights.

A

You do that locally, so each one of the workers does that on its own gpu or cpu, and and then once the gradients are calculated locally, you need to do all reduce over those gradients before you update the local weights um and that's uh that's the place where this that's the only place, actually, where you do communication um during the data parallel training- um and this is so this is. um uh This- is how it works. There are pros and cons to this. um Some of the pros is that forward.

A

Pass is completely local and the communication only happens during the backward pass, which means that and since you're doing it layer wise back prop, you can also there are a lot of opportunities to overlap the um the all the communication with the computation operation. So, for example, if you you calculate the gradients for the last layer and then you start calculating locally the gradients for the previous, the penultimate layer and then, while you're calculating the gradients for the penultimate layer, you could be doing the all reduce over the gradients of the previous layer right.

A

So you overlap the communication with the computation. This is very important, and this is essentially what enables us to scale um data parallel training to a very large number of gpus.

A

Now some of the the cons of of this data parallel training is that, essentially, you need to um increase the batch size right like you, can do strong scaling where you take one batch and then you split that batch amongst multiple workers and then and then your batch size batch size doesn't change, and you can certainly do that. But that's that has diminishing returns right with the you with the um um you can't increase the number of workers beyond the certain limits.

A

First like maximum of just a local by size of one, but there's also like if, at a certain point, the local computation becomes too little that the communication overwhelms or becomes the the largest um um overhead. Then uh in that case, you're you're, essentially you're not really reaping any benefits from during parallelization. So you have to do weak scaling right where you you're, increasing the batch size.

A

I'll show up some schematics a little bit in a little bit, but essentially, if you're, initially, you have a say, you're running on a single worker with a batch size of 64 and you decide to do 10 workers, then your batch size becomes 640 and this is weak scaling. um And uh this achieves you pass through the data much faster. However, there are challenges with training with large batch with large batches and we'll talk about those in a little bit.

A

So um another thing to know about this is that I kind of implied that when we do all reduce so we have, for example, if you have eight workers and you're doing all reduce um amongst the eight workers that this is a synchronous or reduce, which means that um um you need some.

A

You need to wait for all the workers to finish their local calculations of the gradient um to be able to actually um to to to be able to make an update, which, which means that when you go to a very large, very large size clusters of like workers, then you might have more and more stragglers um and those could block the uh the training okay. So let's talk a little bit about this large batch training, so just to recap we're doing this. We have.

A

uh We have a where we are increasing the number of workers, so we and if we're doing weak scaling, then we're just each one of the workers gets its own fresh batch with the same size. So if we go to n workers, then the effective batch size becomes n times b.

A

So- um and this is uh this- achieves of course, like you process the data, much faster, as we said, but you the there are some challenges with. How do you actually tune the sgd uh parameters so to to account for this increase in the batch size? So let's remember how actually sgd works first, so this is the plain version of sgd. You have the you: have your weights, um a certain parameter, you're, trying to minimize the loss, uh the loss right over your data over or over your batch.

A

So what you do is you calculate the gradients of the loss over the entire batch and then you update the weights in a way where you are.

A

You take one step in the direction opposite right, so you have a negative opposite to the gradient, so the gradient points into the direction where it increases the loss. You want to walk in the direction where it minimizes the loss and then you're averaging the the the gradients over the batch, and then you multiply it by a step size or what we call the learning rate right.

A

So now, if you decide to increase the batch size, what do you do to the learning rate? That's the question that we need to answer so one way is to think about it. Is you say like okay? So when I take three steps with batch size b, if I increase the batch size by a factor of three, then maybe I should also increase the learning rate by a factor of three right. So that's one way to do it, which is linear scaling and the way that it would look in um in equations.

A

Is that imagine that, instead of taking let's say like previously, if you want to take two steps, so here's one step and here's the next step that starts from wt1.

A

Now you want to take it and you want to increase the batch size by a factor of two you're, comparing these two equations- and you say: okay, so the the b here it was multiplied by two I'm going to linearly scale, the learning rate, so that the total has the same scale right and that's linear scaling. Of course.

A

One assumption here is that the gradient at t and the gradient that w t plus one are very close to each other, that you can make this comparison and make um uh this reduction right, uh but this sometimes breaks as we will see in a little bit, but generally, this is the intuition beyond the linear scaling. So essentially, you scale the learning rates in such a way that you make this factor here. Constant.

A

Another way to to think about scaling is to say, like look. The.

A

One, it's not important necessarily to keep this factor constant, but what is important is to scale to keep the noise in the gradient about the same and uh and in this case, you'd see that if you look at the noise of the gradients or the covariance of the the gradient gradients, so you look at the covariance matrix you'd see that the covariance matrix on the diagonal, for example, is a proportional to eta, squared, which is the learning rate squared divided by b, and if you want to keep this, uh the uh the gradient scales or like noise uh scale to be constant or approximately the same.

A

So what you do is you need to scale the the uh the learning rate by square root of n right when you go v equals to n. You need to to just scale um eta by n, and that way the the square of this gives you n times e to the eta and then and cancels out right. So that's another way of doing it. Now, in practice, we actually see anywhere from sub square root, uh scaling of the learning rate to linear scaling of the learning rate um and um you'll see like, for example.

A

These are two works where uh people apply uh try, they scale uh the resnet training, and you will see that in news uh sub, like it's a square root scaling, while in goyal um two of these seminal papers, actually these ones and go out, they do linear scaling.

A

We'll talk a little bit about this in a second and there's a study by um open eye, it's actually now no longer recent, but uh where they show only from our optimization perspective, not from like generalization uh perspective, but only like from like the the picture of trying to do local optimization of what is the best direct.

A

What is the best learning rate to um uh and batch size, um as as we are increasing the batch size to achieve the the uh to essentially to uh to minimize the loss function, and they see that actually, the scaling will depend on the batch size. So the optimal learning rate depends on the batch size and when your your batch size is small, then the scaling it might make sense to be more linear.

A

But when the batch size is very large, then the scaling of the learning rate might be more closer to a square root sort of regime. So um that's at least like some um theoretical analysis. That's it has comes with a lot of caveats, but it's it motivates uh these different scaling, scalings of learning rate.

A

So coming back to um okay. So talking about some challenges with with the scaling of learning, let's say like we just scale the yeah, we decide to train with multiple workers. We have a much larger batch size. Some of the challenges are that, if you, for example, let's say that you scale the learning rate linearly, let's say that you are scaling your training with a single gpu and all of a sudden.

A

You want to train with 100 gpus so and you, if you multiply the learning rate by 100, then this assumption that essentially the gradient at wt is is very close to the gradient. The wt plus one breaks right like you're. That's no longer the case, especially at the very beginning of the training right when the when the, when the lost surface is still not very smooth you you're, starting from random weights.

A

The last surface is not very smooth and you you decide to um to scale the learning grade by a factor of 100, then you're, taking these very large steps on a surface that is very unsmooth right and um this uh this is essentially like makes the training completely unstable and everything goes haywire.

A

So one way to to get around this is to um um yeah and then in a second I'll talk about how to to get around this. But one way uh another issue with with training with a large batch size is that if you train, for example, this is an example of training with a batch size of 512 and then, if you train with a batch size of eight thousands to do 512, it seems that the models- don't um don't generalize. Well, so you have something called the generalization gap.

A

This is different from generalization gap that we usually talk about, which difference between loss and um like the training loss and the validation loss. This is a generalization gap of training at different patch sizes right, so it seems that training with a larger batch doesn't achieve the same generalization that you would get from a smaller batch size and there are motivations for why. That is the case.

A

One of them is that um essentially, when you're training with uh with a larger with a smaller batch size, so it says essentially like the minimizers or the minima that you find when you're training with a large batch size, they tend to be sharp minimals like something like this.

A

When you're you're training, when a smaller batch size, you you tend to get these flatter minimals flatter minimals are more stable to perturbations in the data right like so, you can imagine here if a small perturbation to the data gets, you like makes a lot much higher, while here the small perturbations, don't change their performance much so the intuition behind this again is not necessarily like the the the most rigorous announces of this, but the intuition behind it is that when you train with a smaller batch size, there is more noise in the gradient and that noise kicks you out of these sharp minimals, which helps you go to a regime regions in the lost surface, where the minimals are more flat.

A

But when you have these very large batch sizes, then the gradients are much much smaller. The noise in the gradients is much smaller right and that forces you to go and like just jump into the nearest uh sharp minima, and there is not enough noise to kick you out of these sharp minimums.

A

So this is the intuition behind this um people have done a lot of studies you can see like, for example, in this paper by uh uh uh yao um you, you see like people showing they.

A

They show, for example, when you train with a batch size of 64 versus when you train with a batch size of 2000 all the way to 2000, and you can see that the some visualizations of the lost surface after training for a long time, and you can see that, when you're training with a smaller batch size, you get to parts of the loss function where it's very flat, while larger back sides, they get you two parts of the function they're there they have some curvature and they're sharp.

A

So in um so how do you get around these issues? uh First, the instability in the beginning, and then this this generalization gap, so in in the one of the first works to actually show that they can scale the training of um fresnel to a very large uh scale or, in this case, to a batch size of eight thousand. um First, they introduced the idea of learning great uh warm up um or they use I'm not sure.

A

If they're, they introduced that, but they used the learning rate warm-up where essentially, instead of immediately starting with your target learning rate, let's say that you're scaling by a factor of 10, then linearly scaling the learning rate and starting by, like whatever 10 times your original learning grade. If you start with that, we said that you'll get still. This lost surfaces, um it's not smooth and you get a lot of instabilities. So one way to get around it is that you warm up the learning rate over a few epochs first and that's what they do.

A

So they warm up the learning rate from the original learning rate, all the way to their target learning rate over five ebooks, and the other thing that they did is that they showed that linear scaling seems to work for um for this particular problem and, of course, the the also the paper goes through a few other subtleties in in that are common in the implementation of in like distributed training.

A

So after they fix this, they show that essentially, they close that generalization gap between, for example, a patch size of 256 and a batch size of 8 000., and this seems to work for uh different problems, doesn't necessarily work for all scales. As you can see in the original paper, uh it works up to batch size of 8 000. But if you want to go to patch size of, for example, 32, 000 or larger, um you don't get necessarily it doesn't.

A

It still doesn't address the problems there, so there are still challenges with training with larger and larger patches.

A

Another another idea is to, instead of doing um increasing the learning rate, you can um increase the batch size of the gradually increasing the batch size itself.

A

So the basic idea is that initially, when you're training, when you're in the beginning of the of the training you're still in this area, where there are a lot of sharp minimums, and in that area, you use a smaller back size and then, as you increase the training, you can start increasing um the batch size and then that should get end while fixing the learning rate, and that should get you to areas where you um get you like, similar performance to training on a single batch size. Of course, this is this.

A

This idea is related to the idea of learning great decay right like so. We usually decay the learning rate gradually, as we are training um here. The proposal is that, instead of decaying the learning rate, you can increase the batch size um in this work. They take this this idea and combine it with this. This, the other intuition that or the empirical also studies that show that the loss function is.

A

Less flat, when you're, when your batch size is larger, so which means that the the curvature of the loss surface could be a good indicator that okay, now it's a good time to increase the batch size while you're training, and so they combine these two ideas and they introduce some measure of the lost surface curvature and they use that to adaptively so automatically increase the batch size as while they're training, um and they show that this works. So this is this is their their work.

A

They show that this works um really well and essentially, instead of trying, like you, pre, uh determine the the points where you increase the batch size. Like these points, for example, you can uh just let the um uh your your calculation of the loss surface curvature uh uh determine when it's a good point to increase the batch size.

A

There are multiple innovations for how to uh uh other ways of how to train um to essentially handle this large batch training problems. um Here's one paper, here's another paper, and these are not actually so. This, for example, goes trains, a a resonant 50 in 74 seconds, and this is compared to I think, um the uh this paper they started when they were doing with 256 it takes.

A

I I if I remember correctly like 10 days and now we're talking about 74 seconds training from the same network and the same data set on the image name data. So um I didn't cover everything that um that could be said about large batch training. um What I covered is the basic concepts and this basic concept related to like how do you? What are the challenges, and uh what do you need to do? uh What do you need to think about?

A

Most of the time you need to think about the learning grade scaling and how do you do that? How to avoid instabilities, and all of that there are other innovations that people have have come up with to to doing.

A

Essentially, large batch size training, one of them is essentially like people come up with several optimizers to do that like, for example, lars lark, and I encourage you to look at those the reason I'm not covering them, because today we're not going to do the very large scale where you would actually need things like clarks.

A

We're just going to to show you how to scale uh the training from a single gpu to four or eight gpus, but and at this level of scales you don't need um necessarily need, like clarks or or lark, and also for the short of time, shortness of time. So um before I um I close this talk, I want to mention some works by openai and google investigations and essentially.

A

Trying to understand what is the relationship between the batch size and the performance um uh or like the the batch size and like the other parameters, that there are there like the learning rate and the gradient noise, and all of that I'm not gonna, get into the details uh now, but I just wanna point out that they essentially um find that there is a relationship between the gradient noise and a critical batch size at which going beyond that batch size.

A

There is a it's a point of diminishing return, so you can't you shouldn't be training larger than that so and uh the other thing that they notice is that the more complex the problem, the more challenging and more complex the problem, the larger this intrinsic uh property, which is the gradient noise that you would get from uh from your data or the more complex.

A

Also the data set itself, the larger this gradient noise and then the larger the effect the critical batch size that you can actually use so which the reason I'm mentioning this, because this is an important idea. Right like we are hoping that deep learning will be able to solve more, more and more complex problems, uh right and especially in science, and so from these studies.

A

We understand that for those more complex problems, it is actually promising, because we we can use larger bite size to train those models and then, um and then that means like we can train them faster. So I I see this as actually great news. um I encourage you to look at this paper and the paper and also at their blog post here, so to uh um wrap up before we move to the demo. um So we talked that distribute.

A

We talked about distributed training, we talked about the different strategies of distributed training, um we focused on data parallelism and then we talked about large batch training and how um it could be unstable and also doesn't generalize. Well, uh we talked about scaling to modest regimes, um essentially, like let's say, like you're scaling by a factor of 10 from a single worker or single gpu to 10 gpus.

A

um You could be you. The first thing I would try is to do learning rate warm-up and linear or sub-linear learning rate scaling. At least these are this: is the regime where you want to what you would want to try um this this for this. For this, for this scale of like for this, for these scales are going, you know modest scales, these these seem to work.

A

But if you want to go beyond that, then you probably want to look at lars or lark or other optimizers, that people have come up with for training with larger backsides, and this ends my talk now we'll move to the demo a little bit thanks.