National Energy Research Scientific Computing Center (NERSC) Deep Learning for Science School 2019, 7 Aug 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 21 - Scaling Neural Networks Training - Thorsten Kurth

Description

Deep Learning for Science School 2019 - Lawrence Berkeley National Lab
Agenda and talk slides are available at: https://dl4sci-school.lbl.gov/agenda

A

Hi good morning welcome to the last day of the deep learning for science, school I hope you're still have the energy to go through another half a day, so today, I mean earlier. In the week we talked about training with gradient descent, and we talked how you know gradient descent. We do it stochastically, so we use small batches and how you know to go beyond that. The small batches has some noise and to go beyond that. We actually run into optimization issues, because the noise is important. A lot of us.

A

A lot of you actually asked about larger batch training, because you want to do training faster right, because we deal with very large data sets. So today, most of the program will be on this. Thurston will talk about. How do we do this training on multiple nodes or multiple workers and even going all the way to the HPC scale and you'll also talk about large batch training in general?

A

How the problems that you, you know what kind of problems we run into, how to get around those and then we'll have our hands on later today, actually doing that in practice on quarry machine Thurston is an application readiness engineer at at nurse. His main work is always on optimizing code to run faster on bigger machines.

A

Last year he was awarded the Gordon Bell prize for the first golden belt prize for a deep learning application running segmentation model, one of the models that you saw yesterday at an X op scale, and he will also tell us about that. After experience, all the theory and back are transfers. Thank.

B

You, okay, let's try that so thanks for introduction, Mustafa and I hope you still have energy for the rest of the day. So I will talk about deep learning, training scaling and the I will also do some to bother you a bit with that with some HPC back background about like communication, so that complexity of come negation, algorithms, and also that of why why we care about matrix, multiplications, deep learning you might know, but especially when you dis men, it is to be with that thing. You need to think about it more carefully.

B

So I will talk about this a bit so that that's my first gift is this motivation? It's more like an introduction. While you do we're gonna, do this review to do deep learning, I! Think most of you are anyway motivated of doing that, but in general I think it's it's good to point out certain things which might help so I will talk about like the communication.

B

Complexity is a bit and about how you can distribute this the training, so the different method of doing that, and for that we need to talk about so I will talk about a lot about matrix modification and you might ask why matrix multiplication, because it's all about that.

B

Basically and I- will talk like briefly about this as well, so the polarization strategies will be mentioned, especially with the in light of deep learning and then I will talk about what most of our mentioned the large wedge, training, what you can do to get convergence better convergence, and we also talked about things like accuracy improvements, especially when you do this ability blurring and have a small local batch size. You might think about okay.

B

What does batch normalization actually like help me here and you can basically get around certain limitations there and then we're gonna brief summary. So that's the first part of my talk and then in the second part, I will present the works from the from the gone Bell last year.

B

Okay, so why do we need to scale deep learning right? So this is a survey Moustafa conducted in 2018 2018 and we basically looked at how long the models, typical models need to train, and it looks like with a lot of like people like 60 percentage, which which train for like a couple of hours right, but we also have like scientists. We want to train for like days or even weeks, alright.

B

So basically, this is still doable on a very like very small scale, and you don't need to bother about like conversion all these things, but for this kind of like like cases, you basically want to have a better method of like you want to get conversions faster, because you want to do first, reddit module ready prototype right.

B

You don't want to train your model for two weeks, just to learn that it doesn't work very well and also the problem scale can be quite big and then you look at the data sets, especially there can be quite large, so we have like about, like 20% of the folks, have like a terabyte or bigger. Ok. So that's quite big, but even here, like 40 percent of 25 percent have like 100 100 gigabytes data set sizes so yeah. This was a this.

B

So if you think you just asked questions about like it was a survey which targeted machine learning, especially so we just asked. So what do you run? What do you? What kind of models do you run and what kind of like public is a data set? How big is your? How long do you train yeah? This was the data set size like maybe I just asked them. So what is the typical training data set size? You want to train on fine, so I mean, of course, this this.

B

This can, and this can include anything like the HPC experiments can be petabytes. They might not want to train on everything, but this can be everything right, so this like tube in which captures all rest and the data sets itself. So this this this might look like. We have a lot of samples, but the data sets itself can be very, very complex, so they can have like very high dimensional data.

B

So that means that technically is a single sample can be of order, 50 megabytes already easily or bigger, so that is kind of so that doesn't mean you have many samples, but you're chefs ever like the number of bytes you need to like load and process is quite large so and also, as you know, models gets get bigger and more compute-intensive. So this is like an outdated model.

B

Analysis me eg, but when you look at bird like this transformer models, for example, they are like- billions of parameters right, so in the end they want to, and you want to tackle that much more complex tasks with them. So you you want to somehow you want. You don't want to necessary, restrict yourself to a single node or single GPU what-have-you. You want to basically think about like splitting these models up in the future, and this is a plot by open.

B

Ai like this is a study like how many petaflop per day, how many soy peda flops per second days. You need to invest to train a model so and when you look, for example, at the deep mind once this reinforcement learning here, I forgot I zeroed. They need a lot a lot, a lot of flops right. So you need to paralyze that you cannot do that on a workstation anymore.

B

So and I think this trend will just go, go up right. So this is your DOTA, which also like it's like better flops for 10 petaflop stays alright. So this is kind of a like technically I'm quarry for like 10 days need to train it thing.

B

Ok, so I say talk very briefly about matrix multiplications. Why it's important and it's very easy. So when you look at all the deep learning primitives you have it's basically exactly that. So, for example, when you have a fully connected layers, the most obvious right. So this is just like you, you, you basic connect. All the input features with all the output features right, so this is just a matrix multiplication right.

B

You know that when you coded up in terms of flow, for example, it's exactly you write down death right so, but also convolutions can be casted into matched modifications. You might not know that because it's not that obvious. If you do not work with the underlying kernels really, then you might never see it or might not be aware of it. So what you could do this is one method. I, don't say it's the most efficient one. This is called convolution, lowering or image to column or triplets matrix approach.

B

So when you want to have when you have a convolution, so you have like this, this 2x2 stents and you apply it to this like input image, which is here like format for image, you basically can map this onto matrix multiplication, so it can basically unroll the image, basically take all the variables in this stencil and put them into like the columns of the image for each filter, right, sorry for each channel and then do the same with the stencil, which is then technically just the vector, and then you can multiply these together and unpack it again into an image form, and then you have the convolution output so also that is a matrix modification, the other algorithms which which are not like this here.

B

So this was for a long time, the the most common one. There are no more modern algorithms, but you know they basically rely under the hood for matrix multiplications, even if they are like small sighs and these big ones here. So this is not very memory at first and then late. Lastly, when you look at Ellis TMS, it's the same thing. So when you are, when you basically look, we have an input, sequence X at the time T you want to produce an output sequence H of T, and you have all these gates here right.

B

So these are like activation functions. These are elementwise multiplication, but what you essentially have is you have all these made modifications with the input vector right, so this X is a feature vector and you basically act with some weights on it and compute outputs, and you do this like a couple of times for like a typical lsdm there. So that means like it's all about maybe some applications and so in the end you want to make those fast in a distributed setting.

B

So one one that set, though, usually the feature vectors here are very small, so you might not make sense to distribute that so just this, this is so it depends on the size. So the way you distribute the training depends on the size of the objects you're dealing with and I will come to that next at first.

B

Let's talk about the collective communication, so wire communications collect communication primitives important because, as you will see for like distributed training, you need to communicate data between nodes right and you can do this in a clever way, and you can do this in a trivial way and mostly, fortunately, most of the frameworks already do it in a very efficient way for you.

B

You don't have to care about this yourself, but it's nevertheless good to know if you might be at some point more interested in looking at the layers underneath what actually is going on on the system. So this is a very HPC, a part of the talk now and if you're not really interested in doing some. Looking like an analyzing communication behavior of application, you might not need to pay attention that much, but it's kind of interesting to know about, like certain things: okay, communication complexity.

B

So, let's talk about this briefly so usually when you have, when you have a training setting, you have a number of workers or ranks or processes, we call P and they need to communicate data right and communicating data costs. You like bandwidth right for every package to send. You basically take a big chunk of the bandwidth of the network right, so you have a latency right. The message needs some time to arrive to its destination and you have some overhead.

B

So, for example, when you, when you want to communicate something, you might need to phone from a GPU, for example, you might have download the data then and send it off to the interconnect or the the interconnect can grab it directly from the GPU, but nevertheless you have some overhead to basically pack it to the to the interconnect and then shift it off from there and- and you have a message size s right so, and you can't care about three different things. You can care about.

B

Runtime I think this is what most people here might care about, because in a practitioner setting runtimes that you just want to want to get to make your communication go fast. We can also think about memory right. So some of these communication primitives need a lot of like additional memory to what une our half.

B

This is usually not a big deal for deep learning, for example, what you can do if you have GPUs you can do the training on the GPUs and all the communication on the CPUs concurrently and the CPU is much more memory and you read your objects. You want to communicate, have a much smaller than the whole model class Waits, those activations all these things right.

B

So this is usually not an issue, but of course, if you are in a kind of setting where you are like way too, like a distributed edge, computing or something, this might actually be an issue, and then the last is energy. So how much energy does my algorithm consume? So there's difference within static and dynamic static is basically like the baseline of this algorithm, but you have some algorithms, which do some have some communication networks which are more intense than others.

B

That means you have like spikes in your energy consumption and that can be important when you join a company like I, don't know Facebook and you want to have operated data center on top out under the power envelope. You might want to look at that stuff. Okay, so just to explain how that works. You have like you, have a process, p 0, who wants to send a message to process P 2, it packs up the message: does this overhead?

B

Oh and then it should shoot off like one package, another package, another package right so assume we do like multi-threaded communication. You can split this message up and basically the whole thing is s times. So every every small like fractious, basically a unit of bandwidth, gene and the latency- is now the time, for example, when my last message left interconnect to the when it arrived so like once.

B

The message needs to interconnect to when it arrives at the destination in the nick at the destination, and then you need some investor over here to basically take the message: unpack it and use it. So this means that communications, complexities or like the the time it takes to send a single message, is like the latency plus two times the overhead, because you need to pack and unpack it plus the size times, the bandwidth. Okay. So that's like the communication model. This is for sending a single message to a single single worker.

B

So and then you can, you can think about so what kind of like communication primitives do you need and there are route at once and good, that's one! So what's the difference route at once is something like a broadcast or gather where in one process send stuff to every other process or one process gathers data from all the other processes. That's routed because you have one node, which all one worker which sticks out.

B

So that is, for example, done when you do a parameter server in deep running right, then you have these kind of communication, behaviors, routed communication. So and there's like things you can do so that's the very, very simple one. I think this is what most deep learning frameworks implement in the beginning is like it's like a flat tree. It's very simple. You basically send like messages or receive message from every worker individually, and you can think about how that scales scales with the number of workers right.

B

So that is like the most simple one you can do, but it's still important because this this thing is like part of a lot of other communication items, then you can also do treats right so like the next. Better thing to do is a tree where, like you, have one root, node and then you can, you can think about if you wanna, for example, assume you send messages to the left branch first, you can just compute the whole communication complexity by just looking at how long does it take to node number.

B

Six in this case receives the message right. It's basically a- and you do not need to wait for this thing to to finish right, because once you she wants not zero shoot of this message and no, you know two receives this message.

B

It can basically directly send a message to note six okay, and what you see here is that you feel very high latency its gates with the depth of the tree, which is very bad, but still the in terms of message sizes just gates with the depth of the tree and not with the number of workers. So it's definitely better when you have bigger messages, yes assume that yeah. This is basically basically what you the idea here. Is you shoot the first message to so you shoot the message to one two, three, four, five.

B

Six, so as you have to wait, you have to wait till till note. Zero sends like a message, not one two free and you don't need to wait to receive it right. So this is the overhead of packing up, like P minus one messages, plus like the the bandwidth like the the size of each message right- and this is the overhead of impacting it on your side, less s, not six. Of course, many of congestion or some of these paths can be longer than others in the net in a real setting.

B

So you basically take the longest the longest so like the maximum latency. So, for example, when you go to supercomputing system like quarry, it has a it has a aires interconnect which is like a dragonfly topology, which is diameter five. This means that you can send a message from one node to any other node in five hops max, but they're like connections, we already can have one half able to the same Chelsey's.

B

You might have one hot or two hops so, which means that, of course, when you have a tree spanning the whole machine, the the latency will basically be like the relevant latency is the latency to take it between, like this five hop connection, it's technically the Layton's? No, this is technically. The thing is that you send a message down this tree: okay, this means that you don't this tree, doesn't need to wait for that tree.

B

To finish like when this message is sent and your and your your hoop, your hoop node receives its message can start casting so this tree. This branch goes in parallel to that branch and at every every sub. Every sub tree can go parallel to next to any other sub tree. So this is a binary tree, but you can do it with K nodes. It's ok, Airy tree, and one thing this is non personal, which means that basic it's like a reduction so mm or like a like a broadcast.

B

So every note gets the same message right. You can also think about the personal one where every note gets gets a personal message, and that's that case you basically need to send more packages in the first branch, because for every note you need to have a different different message. Okay, so that then the communication complexities a bit bigger, but it's still better than the direct sent, possibly okay.

B

This is you said you do a collective communication. All the notes communicate collectively they don't run different stuff, they don't they run. This assumes that you need to pack. You need to send the same data so to same data, to all the notes. For example, when you do model broadcast at the beginning of your training, you do model broadcast. You need to copy the weights to all the notes. So when you do that, then actually you would use something like that.

B

So basically sent your model along this tree, so that every note gets the same model and you need to wait for this thing. Like till all, the notes are done, and that is usually when you anyways, you assume that you send, through the left branches first or the right as mirrors like assume you send through the left branches first. This is the time it takes for the data to ripen arrive at note, 6, assuming that all these communication legs are equally fast.

B

So this just this is just technique an upper bound and then on in a real setting. You need to run simulations to get like this. This thing right, but this is just like to see what kind of algorithms are better when you run it forth in different settings all right. The good thing here is, you don't need, like you, don't have this many to one that the thing is here if this, if you have a lot of lot of lot of nodes, doing that you basically fresh the buffers of the interconnect.

B

So what will happen is that you have mediate network congestion. This left that's right, because every every node is sending to one node, like all the links of this node will be totally saturated with data and the message queues in the interconnect which receive the messages and put them in a queue if the. If the queues run full, it will not receive further messages and tell the communication like on the other side, wait to send more and then you will get basically like back pressure on the in the network.

B

So if you have a lot of lot of nodes, this thing will actually break down. So this is why these trees are better come to that. So this is the personal one. So just personal ones that every node gets a different message right. So this is why you need to send three here: one is consumed by this root node and then it sends one down the tree.

B

So there's also one thing as the the which is a rooted, one is a pipeline that is actually important because I will sign in so and the next slide. Why? Because there was like recently, I would say. Like two years ago there was some breakthrough and distributed deep learning where they implemented an algorithm for this muted training, which is actually based on a pipeline, but in the HPC world this is quite old stuff right. So here the idea is you're. Basically a non personal perso doesn't matter you broadcast.

B

You want to broadcast these messages to all the nodes and assume you want. Every node should get like should get like the same thing. What you do is you inject a message to node one, and this node passes it on down the tree right for the for the personal case. You basically sent a message which is destined for the last node first and then the next step.

B

You see you send it for the second to last and so on, and then the notes just pass it down the pipeline and in the end, you can compute that this is basically just two times the length of the pipeline and when you close it so basically a seven feeds back to zero. You have a ring, and that is actually there was this paper. Bye-Bye do I think where they implemented a ring reduction algorithm, which was like in the deploying community, was like.

B

Oh wow, that's awesome, but actually that's a pretty pretty old concept and it's not more very efficient because it scales with the number of the of the nodes you have in the pipe with twice the number. So, but still, if you have a few nodes, you can actually nicely hide like a lot of the communication. So in general this is not a bad algorithm, but you should not use it at that scale.

B

So this is something, for example, assume you have a have a system where, where your GPUs are connected in a linear fashion or like an inner ring, you can you can easily use that, for example, to do reduction within the in the box. But then, once you go out to like like multiple boxes, this might be inefficient right, so I said so. This is not it's not a bad I prefer but like for for like scale it's it's not very good, so there's also the route.

B

So this is a rootless example, so everybody gets everything now so so this is, what's the same so like when you have to swing technically, everybody gets every message. If you do it right- and this is the direct sense assume you have like an all-to-all connection. For example, you have you have a DJ Xbox, which is basically a box with two CPUs and eight NVIDIA GPUs or in the more modern version of DJ x2. You have basically sixteen GPUs in a box connected in all-to-all fashion.

B

You can basically do that because you have a certain amount of bandwidth between all the all the GPUs like be bi-directional, like 25 gigabytes, a second or something like that, so you can basically shoot of messages to everybody at the same time. So, like everybody, sends to everybody and everybody receives from everybody. So that is, if you have enough bandwidth, that is awesome right and it just scares with the number of processes. So, first for a box, that's totally fine.

B

A more clever algorithm is butterfly, so this is actually implemented for a lot of all communications. If you have a power of two in the notes, if not, then then it's very becomes very complicated and I. Don't want to talk about this, but the idea here is you start at the beginning, and he just sent basically to your to your so like note 0, since the one node 1 to 0, and you have basically discussed communication, and then this is the first epoch in the next epoch. Note 0 tends to like.

B

No, so, basically, you send out to the two nearest, their nearest neighbors and then, in that case they like the next group. The cool thing about this is that it's guys with the binary, lock and P, and you can basically like just like doing all all communication or like in this case it's more like this is a non personal. So this is like everybody gets the same same message, so this forum all reduce in like lock to P time. So this is quite efficient.

B

But of course, as you see, you need a lot of like interconnectivity here. Otherwise you you you, you will basically like run congest the network, so this is very important algorithm to think about and I think most Fourier transforms are like, like big, like big, all reduces and stuff like that. I implement it with this. So there's also personal version of that where your basic message size grows when you go along the the tree. Ok, so one thing two men remember: the optimal collective communication depends on use case.

B

So if you are, for example, aiming at best run time, you should relook at the run time complexity. There are all these algorithms have memory. Efficiency is so like memory complexities, so this is basically you need to take. Take this into account when you are like in the sorry. So they have a memory complexity and an energy complexity and I have like put them on the slide, because I don't care that much about it.

B

But you should really think about these things when you like design a cluster which should operate under power constraints or memory constraints. So, and the most important thing is look in the HPC literature, because a lot of stuff is getting reinvented in the deep learning world and actually it's well known stuff in HPC for decades now so, there's a lot of like like fancy algorithms, even like more fancy verse than one I presented where you combine trees with pipelines or like butterfly pipelines. Things like that you can.

B

You can do a lot of like crazy things and there's like if you have a certain communication pattern for somebody to implement a library or something that you use you get stuck on something you choose should look in the HPC literature. There's a lot of stuff I in the end of the of this target will get a list for I suggested reading about these kind of things, and so, as I said, the elytra sometimes to try to try to reinvent the wheel, yeah, don't don't just fall into these traps?

B

It's just most of the stuff is old news, and one thing I want to say it sounds with dated, but like MPI is actually very optimized. So when, when your library makes use of MPI, for example, you can be sure that it basically implements the most efficient algorithms you have, because normally, if you have a cluster and the MPI would ships with that cluster is usually, it makes already the right choices for you, for example, on and on and on HPC system we have a tweet MPI.

B

It makes use of like a lot of like features like hardware features like maybe, for example, hardware Atomics, where you can accumulations force of small messages very efficiently in the network. Hardware or like it respects topology. So can it can switch between like a butterfly the tree algorithm when you have, when you cross different, like long distance links in the in the network, topology so like with lower connectivity, and things like that. So so.

B

These these libraries are usually very well optimized for number of processes, the message size as anthropology, so it makes the right choices for you and I recommend using libraries like that not necessary that MPI itselves the best, but there's a lot of things like NVIDIA nickel, for example, when you use GPUs, just use that and make sure that in video guys do a good job or like API for Intel or what it's like for like any like commodity cluster. Okay, so now more specific to deep learning.

B

So we have like a certain predation strategies. There is like data parallelism where, basically, every every process is running, its own model running sorry running the same model, and then you reduce the gradients. I will talk about that later. There's model parallelism, but you have a single model which is split across the ranks, and then there is layer pipelining where your partition by layer so for example, process one does a chunk of the model process.

B

Two doesn't chunk off the model in a pipeline fashion, so and I would go from from behind because easier to explain so layer pipelining. So how does this look like?

B

So the layers are distributed across the ranks or the worker, and that is basically what was implemented in Google tensorflow by default. If you, if you use like the whiff device so with, for example, the which device scope and then you put it on different GPUs different parts of the model, it will essentially use this kind of trellis so idea. So this is the extreme version of it. You basically have let's say a wait and an input vector you put it on the on rank 0.

B

Then you, you compute this thing: compute the output pass it on to note no to sorry not one while you're multiplied with another wait and pass it onto note, note 2. So the thing is that this is the most extreme version. Of course you won't do this. You basically have a couple of layers on rank 0 a couple of layers on rank. This is like the just for illustration and the good thing about it.

B

You only need to have nearest-neighbor communication right because you only send it to the next guy in the pipeline, so no collectors, which is great. On the other hand, it comes to certain drawbacks and I will talk about this, like the next slide. You'll see that so for the backward pass.

B

You need to compute the gradient of respect or activation functions, because you want to have to gradient with respect to the to the output, basically and the gradient respect to the weights right and the thing here is that this you need to update the weights and this you need for your back propagation. Ok.

B

So when you go back, you have basically from your from the previous node from node 1, so from no.2. You get this this activation gradient and then you multiply it with the transpose weight and that you sent back to the 2 node 0, where it's like dotted with with the transpose weight of that node. Ok, so basically you just do the whole pipeline backwards. That's all what you do so that is for computing, the the the gradients of the activations for the gradients of the weights in order to update the weights.

B

You basically just do the same. You just take this vector, but test time you don't multiply it with the weights, but instead with the other activation, which is not local. So it's the same communication pattern. You just multiply it with a different different vector, that's it and then you get the gradient made. You can incorporate it. So there's still no comment. No collective communication necessary everything here. You just need to pass this these these gradients of activations around along the pipeline. So that's it there's one thing, though, so first there's a very simple implementation right.

B

You can just just just do it. On the other hand, while you pass a batch down this pipeline, you do not want to wait for it to come all the way back and for the backdrop and then integrate the gradient. So this would be like fully synchronous training. No, no. You want to really like have a pipeline where you, for example, you feedback zero here pass on the results to the other node. And while you do this, you feed in batch one and then you posited another node, and so this expects 0.

B

Now this is fetch one and then you here and when you look at it and you go basically all the way down and go all the way back. You have the gradient of patch zero, but you already fed patch five to the system, which means that once you incorporate that guy into into this model, all these gradients here in the pipeline will be outdated already right.

B

So that means you have some kind of like a synchronous. Training I did the deeper your pipeline is the more problematic it becomes, because these things become more and more outdated, deeper. You make it yes, this is an issue if it's, even if you do go to the extreme right, I mean if you have like a pipeline of one or two, that's usually fine, but if you make this like a thousand nodes long, it will not learn anything because then you have like a thousand step outdated gradient, you incorporate it. Just doesn't make any sense.

B

Okay- and this also, the issue is here that the load balancing is very tricky right, because here it look very even when I draw it, but the thing is that usually most models basically pull down data, and even if you have, for example, outer encoder, you pull down the data and then you pull it up again or you project it down to a smaller, smaller size and project it up again, and you want that in order to to basically have that, you do not want the notes to run idle, so the computation time, the processing time per node should be constant.

B

That means you need to chunk up your model in a way the layers in a way that the computation time is almost constant between them, and that is quite tricky. So this is the load. Balancing here is very hard, so this is why I do not recommend that if you have like two GPUs, fine or free- but if you have like, if you go back to the extreme, it won't work model.

B

Parallelism is a bit more balanced, because what you do there is like all the layers basically are on the like parts of all layers on all nodes, so that makes load panics the easier you just split up the layers.

B

Okay, so you can, for example, split in the feature dimension for like a fully connected layer, and how does this look like so we have this matrix, W and then process zero earns the upper half process, 1 the lower half and the feature vectors learnt by everybody right input, vector X. So this is basically the number of features.

B

This is the batch size and you dot it in, and then you get an intermediate intermediate result and then, in order to produce a feature vector which is shared across all the nodes, you need to like gather it right, because node zero needs the results of node 1 and node 1 its result, if not 0, and if you have like a bigger vector, you basically need to gather the results from all the nodes. Okay, in that speech of before this is basically a rootless, personal communication.

B

Okay, so this means is, all gather is necessary for the forward pass, so the forward pass is not local. So for every step in the forward pass, you need communication, ok, which is bad because technically you need to wait till all the nodes are done with this to cast or gather to basically grab the results.

B

Backward pass is similar so for computing, the gradients of the weights that can be done, total locally. So you just have to you. Have this the transposed inputs? You have to output right because you get at it and then you just do local metal and get a gradient of the weights, and you just take your chunk to the chunk. You need fine you're done so that's totally local, but the gradient for the activations. That is order for the input.

B

That's actually more tricky, because here, when you transpose the weights and multiply it with the input vector, you see that you get an intermediate state where you have just a very low rank right when you do matrix multiplication of a very small fraction of the data, and you need to all reduce that on a big group. So that means in order to do the backdrop, because this guy is needed for the for the previous layer to do the backpropagation.

B

So that means why your backdrop for your network, you basically need to communicate so in the forward pass. You need unity and all gather and the backward pass you need and all reduce. Only the gradient updates can be done locally, so you cannot overlap really nicely these things. Okay, so that is quite quite bad.

B

The one thing is: if you have a very large model, you can split it up right. You don't want to split up split up too crazy, because the issue is, if you have like, if you, if you run out of parallelism on the node right. So if you have like a matrix which is thousand 24 by thousand 24 say, and then you you do this, you chunk up this, this dimension thousand 24 by say, Hannah, twenty-eight.

B

That means that you don't have a lot of parallelism on the node, so you cannot make use of your multi-core CPU or your GPU very efficiently, because the matrix modifications are small and you get a lot of overhead. So you cannot do this crazy, like you're limited by model size, how we can scale it out. The good thing is you don't have this like grow of batch size because the batch size is still local batch size right?

B

So you don't need to tweak your hyper parameters if you, if it works on a single node, and you just read them all across this like this, you can still just use the same parameters and it will just work forward a backward pass which were expensive like what say like ruthless communication. So this is quite quite bad, collective communication and then also like, especially for the backdrop you cannot, when you are back dropping at LK lay okay, you cannot go to LK minus one without waiting for this collective to finish, so it's hard to overlap.

B

Communication with computation to setting the thing is also that the batches, because you need to fulfill vectors on all the nodes which means that the batches, so the input is also shared across the nodes. So, like everybody gets the same input vector and then you can do you. Can everybody reach the same data from the file system, which is kind of like bad for the file system or we'll just one note reads the data and distributes it, which again is another communication step you might want to avoid?

B

Yes and you need to bake models to do that. Actually, you also need to store the full activations per rank. This can be quite expensive because the activations is usually much, for example, bigger than the size of the wait. So you save just the number of weights, you store per node, but you want to keep the activations around and these are much much bigger, usually so because if you have a sparse network like convolutions, the weights are like kilobytes and this can be easily a megabyte or something. So.

B

This is like from memory footprint vices, not very good.

B

There's another way of doing model parallelism so I'm, sorry for the for me, I don't have a nice picture here. You can draw nice pictures assume you have an image in a set computer. You can map convolutions two-dimensional application, but assume you don't want to do a direct convolution.

B

You basically have the stencil moving so like the filter moving over the image and you compute the weighted sum assumed, and then you think it from an HPC perspective or it's like a stencil operation, which is basically like a like a for example, differential equation, kernel operating on chunks of a data set. So what you could do you can have. If you have a big image, you can chunk up the image and then so basically, this input image or just chunk it up into domains and then compute the output per domain.

B

The issue there is that so the good thing is, you can save the whole input vector because you chunk it up, but you do nearest neighbor communications, because the issue is when you, when you have is usually the when you when you see this year. The the filter has an extent and when you hit the boundary right, you technically need data points from your neighbor. So that means you need to. You need to do some nearest neighbor exchange. This is quite common in HPC, where you basically have like partial differential equations.

B

You need basically in order to compute the derivatives at the boundary. You need your. You need to halo from you from the neighbor. So that's exactly the same thing and you need to take care of that in forward and backward paths.

B

So that's quite that can be quite costly, but you can do this if you want so I think this only makes sense if you have a huge input image like a Giga, pixel panorama or something so like otherwise, this this won't help you a lot just to just as a node and the other thing you can just flip up the filters. So the number of output filters. Sorry, the number of input filters.

B

You can basically split up and also technically the number of output filters, so that, like in general, this this, this G, which is like the kernel. So this is the output filter, that I mentioned the input, filter dimension and the height width of the kernel. You just play split up this thing and different nodes, compute DIF different chunks of it, but also here you need an already use in the end, because every every node wants the whole output right and an allagadda.

B

So this is not not very efficient either and you don't save much because these guys cost nothing okay. So just just in case this just comes up. It's some people thought about it, but I think it's not really feasible data for us. So that's the most important one, because this is what basically all the frameworks do. This model part is, in part, is very hard to implement framework. Wise I would say so. This is the most the way it's done today and that also causes these issues of large batch training.

B

I will talk about after that. So how does it look like so assume? What you do is you just distribute the batches among workers right so like when you have an input vector X. So this is the global input vector like processor 0 holds all the features, but only like a chunk of the whole batch okay and then you multiply it with W and what you see it's a local matrix multiplication. So that means that you don't need any communication for the forward pass, none which is quite nice.

B

The thing is that all the weight matrices have to be replicated across the workers. Okay, that's fine I mean. Usually these are not very big. So there's no communication for the backward pass a bit more tricky, but it has nice features too. So, first look at this. So when you do the back drop in order to compute the gradient of the previous layer you need to so you need to basically do a local matrix multiplication this time with the transposed weights.

B

But since the way it's a local, that's fine, but on the on the derivative of the activation, which is still like process local here, so you can do a local matrix modification which has the following impact that when you back drop when you compute the result so like the menu example, you back drop for your of your network and you are layer K. You do not need to wait for any communication. You can just do a local backdrop of this. This gradient here and then like go on to layer, k, k, minus 1 right.

B

You do not need to communicate anything. While this is happening, you can compute the weight updates which require communication. So, as you see here, you have basically the gradient activation and the input features, and then you dot them together like that, and then you need to all reduce so that everybody has all the weights, like the whole, the whole weight. So that means that the only communication required here is for the weight update, which is basically the reduction of the gradients across all the notes. That's actually what it is.

B

So this is a pretty nice scheme. Actually, so the forward pass is completely local, the backward pass. You can proceed without locally without doing any communication, except for when you want to update the weights all right, only those guys need to be communicated. So you have a lot of possibilities to overlap. Communication computation here and the activations are split across strength, so it can reduce the memory footprint as well right.

B

You don't need to store all the activations for all the batches split up your batch, the weights get duplicated, so you need to form all on every rank. Okay, it's depending on the size of your model might be bad, but it's not usually super bad. The batch size grows right, so you can of course say. Ok, you have like 20 56 fat, a batch of 256, and then you go to 256 with notes. We have a Bachelors of 1.

B

You can do that, but then you run out of parallelism on your node and what you'll see is your communication overhead gross like dramatically, and your your local parallelism is very low, so this means that this is not efficient. So what you usually do is you have a local batch size, a meaningful local bachelor's of 8 16, whatever you, whatever it's reasonable for, like performance wise when you scale it out and then you have like as a global batch. That's just the number of workers times the batch size yeah.

B

So that is actually a big problem. Ok, so you can also play tricks here. Right I want to talk about that. So like in the original idea, as I said, you need to reduce the gradients, so you can do this like in an all synchronous fashion. Okay, so you basically for every!

B

So when you do the backdrop, you can hide like the backdrop of the model with with with the computation or reduction of the gradients, but you still have to wait for the last gradients to be reduced and you can basically do it in the synchronous setup. We really wait till the stills is done, so the scaling can be problematic.

B

So if you have, for example, one node which is very slow or you have like a congestion and network have to wait for the last guy right, so that can really like impact performance negatively, and it's also said that the effect of patch size goes the number of notes you can do it completely synchronously.

B

This was I think this was coming for a bit at the beginning, where you sent your weights to a parameter server. The way it's keep the the parameter. Server keeps track of the model, so it has basically the latest weights the worker center gradients it incorporates into the model and sends back tomorrow. So nobody waits for nobody here. You just want when you're ready, you'll shift ship it off and it's very resilient. If you dies, Evernote dies, it dies.

B

No, nobody waited for this guy, it's just. It will just work, however, when you think about it, since, since this is very a synchronous, so like the gradients they receive are from different versions of the model, all the time right so assume, like one workers always fast and all the others, he will basically contribute a lot of fresh gradients, but all the others might contribute very old ones.

B

So that means like the more workers you have, the more all gradients you contribute of various ages, which basically can really heavily impact the convergence of your model and also, if you have a very bad Network, the parameter server can be a bottleneck.

B

You can mitigate that by spreading it out, like for every layer, I have a different parameter server, but still it's it's a bottleneck and you might waste like computation resource because you want to might want to use this node for training example and then more recently, there's the stale synchronous update, also called pipelining, and it works like that. So assume you have like you have like say. Two independent system, for example, have an accelerator and a hoss process, or you have a very powerful interconnect.

B

You can do the following: oh you have like a lot of like additional, like free, free compute, on your on your on your CPU. So if you have a multi-core CPU, you can basically only many cost CPUs or you can do I, don't know like on 64 frets. You can do the computation and we have four remaining frets. Of course you can do other stuff with.

B

So what you can do here is this, and the idea is that, while you wild workers, compute the fresh gradients for the model independently right, you can do this locally and push them into local queue. Why do you do that? You pop from the queue gradients from a previous step, reduce them and incorporate them back into the model? Okay, the good thing is that you can basically overlap the forward with the backward pass. That way.

B

The downside is that, since you all reduce gradients which are outdated by a couple of steps can be one step can be. Whatever steps can be also be made dynamic, you would technically don't have like optimal. You don't have to optimal like gradient and freshest gradient right. On the other hand, it's not as random as a synchronous approach where, where you contribute grains of different ages all the time, so these are outdated by one step by two steps by three steps, but I always find my bias by fixed amount.

B

Okay, so that is actually a much better, but also it's not very resilient right. If one node drops out here- and at least this step will fail right, so it will basically like stall on that. So like resiliency wise, it doesn't help, but it can smooth run time, durability right. So when you have like very lot of fluctuations in your network, you can think about making this dynamics. Okay, I cannot communicate right now. My network is like totally like jammed with like a communication.

B

Let's store a couple of more gradients before we continue, no I mean I, don't know so like every grade and every gradient once it makes it for Q gets, gets incorporated right. So if you yeah no, no, they they they. No! No. No! You just push them into the queue every say like like like like one, but you basically have just like two gradients right, so you have like one old buffer.

B

You use the grades from that and then you have a new gradient buffer and once the old ones are incorporated, you just copy them over to the new buffers you basically purpose. We can have a queue where you just line them up and once there once they turn they get it no matter what, but it's not receiving a sense that if one of your worker dies or very slow, you have to wait for that.

B

Guy all the time I mean that's that's, but as I said, if you have an HPC system, you can, if it's a let's say, mature HPC system in the sense that it was operated for. Why you understand it and it's you: can you can count on that? Almost all the nodes are equally fast. You don't have these like the big problems on a commodity cluster. Of course the performance variation can be much bigger and that can be a problem there.

B

Okay, that's batch training, it's all about that. So.

B

So, as I said, we I just considered the the synchronous. Already was the easiest case. The other cases are basically um like the pipeline. One we're just outta two gradients there they're similar from the idea, but it just won't discuss this one, because it's more it's the one like most people will use. So you have a low connections of P and the global batch. That's just the number of workers times P, okay, so when you think about stochastic, releasing I think you've heard about that in the last week. So we have liked wait.

B

You compute the derivative of the last respect to that weights, and then you use everage over the batch and incorporate it back. Okay and what it will do. Basically, stochastic Elyse is nothing for conjugate gradient here, which is like deterministic, but you basically go into the steepest decent direction. Right.

A

B

The idea is now, while you do this, because you have larger batch, you have a larger average or basically, if you have more batches, you have a more precise, gradient right, so ideas, because this average on this slide here goes over. Like more samples, you can think about. Okay, I know much better. What my actual gradient is, so you can think of instead of doing like, for example, free steps with like like step 1 step, 2, step, 3 I might think.

B

Okay, since I know my Direction better I might just do one big step with three times the size, all right, so I basically increase the learning rate linear, so hopefully, I will end up in a very similar position to where I would end up with, like the three smaller steps and that works when you look at it. So if you do like two consecutive steps of like batch size, be this, you basically summarize this of doing like a step of size like like to be at once.

B

So if you were scared, basically to do an equivalent step, and that of course assumes that the the gradient with respect to get as an application here, that the gradients are basically similar right, so that they are not like heavily varying with with each leg patch. Of course, if your batch size becomes a huge this, this summation or this assumption breaks down all right. So that is that is the problem.

B

Any questions here.

B

Okay, so there's also another motivation: what you what you can do is we scare the learning rate by square root of N, and here the ideas comes from the observation of the covariance of the weights, which scales with the basically with one over the batch size and since it's a covariance, it's technically, which means that you want to want to consider like the basically so so you know when, when something scared stochastic a one over any one to try like the square root of it right because yeah, so that's that's, like put it.

B

So yeah, the idea here is that okay, so the covariance based case of N squared over P, which means when you, when you scale the and the best ice skates, would factor, and the start see that if you try to scale your learning rate from square root of n, you might get the same thing as before. So that's the other idea right when it, when you think of not after the grade in itself, but of the noise or like the correlate at the covariance, the correlation between the gradients, the autocorrelation little.

B

So you can try everything and they're like different, like as a different approaches, and it really depends on them all your training. You don't have to try that out and I think this is this paper from open the eye and they basically look at the noise and they try to determine like an optimal learning rate depending on the batch size, and here you see it scales very nicely and then flattens off. So that means that here you hit a point where, like the noise in your gradient, is it's technically too low right for the specials?

B

You can? You cannot help, but it's basically won't. You won't work well, but for like up to batch size here, I would say, like a hundred. You can still get a good speed up just like tweaking the learning rate accordingly, but technically have to reproduce this plot for every model Arana, which is costly, on the other hand, right because you don't want to do this right. You don't want to make this plot, because when you can make this plot, that means you already trained your model.

B

Alright, so that's a kind of like hand and egg problem, but maybe there's a way of deriving more general rules for that. So this way I say: try lineal, scaling a square, would stick ailing it's about. So there's also thinking that in the initial stages, the gradients are very random. So, even if you average over large batch, you might think of. Like oh yeah, let's I know my gradient very well, but that's not true, because since it's very random, if you average, there was something like a lot of like very random qualities.

B

It's that not necessary better, and the thing is there that you do want.

B

Don't want to take big steps right, so what you do you can think about is some kind of like learning rate warm-up or something where you start with a small learning rate, and then you gradually increase it after a couple of epochs to the maximum value of n times the learning rate or like squared n times learning rate and then train from there, and what you see is that, with larger, with larger batches in general, when you just scale it out, you get this.

B

This thing here, culturalization gap which the maximal accuracy you can get is lower than the one you were. You can get with like a smaller batch okay, so that that is. That is a common problem, and why is that? So it looks like that.

B

Basically, the larger the batch is the more sharp your minima are, so that means and there that. So that means that technically, when you, when you have so the the black curve, for example, is the landscape of the loss of the of the training loss function? Okay, so the the the you see, if you have a minimum here and you have a very sharp minimum there.

B

Okay, so assume you train your model and you end this flat minimum so that you have to trim small batches, your minimize a bit like scuba, okay, and then you have the generalization loss so like to say the loss on the on the test set, which is the red curve, and it's a bit shifted right. It's not exactly the same, because the different set they're just part of the data set and then, for example, the optimum minimum will be there. But you are there. So what you?

B

You will end up being here, which is not that bad right when you test, when you, though, are in a sharp minimum, so you trained your model at a large sketch size. You are in this very, very, very sharp minimum here and your actual minimum you want to be is like somewhere like that shifted to there and you evaluated. There you'll end up evaluating it here and you have a very bad loss so which means the generalization is totally school. So this is basically like.

B

This is just the conceptual sketch, but this is actually what happens? Yes, if you, if you look at the I, think that's yeah. That's part reason. So when you look at the Hessian at the second derivative, you will see that it's very flat when you have small batches and more batches. If you have average of a more gradient, if you use bigger batches so average over more grades in step, you'll see that's very, very sharp around minimum I think the intuitions.

B

No, no it's so it stays very complicated. It's technically, of course, respect to the whole dataset right. I mean if your batch size is the whole dataset. You do conjugate gradient technically, you don't do any stochastic gradient, he said, and then it's also somehow like the does. Also size of the data set which is actually useful to you depends on the complexity of the underlying. What like features in the data you want to learn, so you can have a huge data set, but technically it's all they do.

B

They don't add more information to what you want to learn. So then it doesn't help you either, but determining that is much more tricky because you don't know how what the complexity of your data set is and then it yeah. So that's like the trader, which you read, yeah Lucifer I think the best is just to try yeah. This is really like this is this is the problem with that. So it would be nice to have a more like, like a more guided way of doing that.

B

Like start with this and increase this and stuff like that, but that's like no good intuition like okay, but this is actually what happens in this wire is a. She can gets good when you try that okay, so this is like this is I- think from cypher 10 from Yahoo Berkeley.

B

He looked at the the Hessians for around like selected, the dominant eigen value eigen vectors of the hessian that those two dominate one so delicate audit and for 64 up to like 2048, and here you see that the minima becomes like sharpen java, okay and then you can think about. When you go to 30,000. This will be like very great shot, and then you don't generalize that anymore, because this one do generalize nicely, maybe even that one, but this might be too bad, and it's also depends on what accuracy aiming for right.

B

If you want to beat imagenet like you, have to very beat the curacy of other folks. But if you do something where there's where, where you're like a scientific, scientific application, layer basic you go and say: okay, we still beat by far the the approach which is there, like maybe handcrafted decision tree or whatever, and you you're still much better than that. But you can train a model very very quickly and then you might be happy with that. But, of course, if you hand precision, that's that's like tricky.

B

So if you are a company who want to make makes money and makes money from, for example, natural language processing, the recognition rate has to be extremely high because people get annoyed. If you, if you only have like 98% accuracy, this is really bad right. I mean they won't have 99 point something right. It sounds. It sounds like a small difference, but for them it's it really makes difference of, like consumers are happy or not so yeah, but it depends so yeah.

B

These are like basically just the top 20 eigenvalues and s you see when you go to larger batches. Basically, you converge. You converge to much higher spectrum. That's its basically illustrates the point. I want to make so they're things you can try to do and to fix as a bit. So at the beginning. They do it like a linear warm-up, for example, on to the target learning rate and then decay the learning rate right. So that's this Facebook paper-like training imagenet in an hour so like what is what is the current record?

B

Seventy seven seconds or something mr. fun, original training, 77 seconds yeah, something like that, but okay. So this was so. This was like in the past so but they they show it very nicely that when you, for example, do a warm-up and then you do a learning rate, the case schedule you can get. Basically very nice excuse. So that's like something something which works, but they also showed that this is the validation error, one it will increase rapidly with when you go really beyond.

B

So there's another idea of like instead of the king, the learning rate just increase the batch size right. So so they started like batch size 8,000 for example, and then, while they train instead of decaying the learning rate, since they have like a Red Sea proc like relationship, you can basically say: okay I just increased my batch size. So then the idea is, if I'm closer to the right minima I might be increasing. The bachelors I make it sharper to have a faster converter.

B

So that's the idea behind all right, because if I already know where I am then I can just like drop into it and to enter the minimum and then, if it's sharp it's easier from and that's basically what what they do here so like over there over the time of the training they increase the batch size.

B

The thing is with this is very hard to implement, in the sense that most most frameworks don't support that very well, especially in a distributed setting when you when you, when you want to change the bachelors, you might need to dump your model and reload it from checkpoint, mostly which is, but this is a shortcoming of the frameworks. This is not like a principal issue and there's also this adaptive batch size scaling developed at Berkeley which to us this more like more like dynamically.

B

So the ideas, if you are currently so use a second order information for from from, like, basically the curvature of the loss of the lost surface at the point where you are in order to increase or decrease the batch size. Okay, so I don't want to talk about this very much, but they showed that. Actually, when you do this adaptive patch size with this is like I think they do some. It protects them also from some adversarial examples.

B

If you basically like converge to the right thing with the smaller backgrounds so in in in the same number of epochs. But since you have bigger batches, you get to that number of epochs quicker right so like in, if you plot it versus time it's faster.

B

So this is up to 16 K batch size and the same was used for this Sony paper, where they trained imagenet in 224 seconds. Actually so, like I said, this is already outdated right. This is from last year, I think so now it's like 77 seconds to record so like they try to like beat each other on the like training, time, side and accuracy, sides like maintaining accuracy, but then cranking down the training time by a lot.

B

So that's like just what they see they see us they use like batch batch size control, so this seems to be successful for these kind of tasks. Okay,.

B

So this is a paper by opening I, think yeah, and they show what most of us said: a relationship between the gradient noise and the critical batch size, and it shows tells you basically what what the critical batch, how is correlated to the gradient noise scale, and it looks like it's pretty linear along that line right and if you see that, for example, the the real Fossum only ones like the space invaders or daughter, they are pretty far up here. So you can use huge batches for this for training these.

B

But when you try to use outer encoders or like, for example, L missed you you, you cannot do that right. You just like you probably are stuck with like I. Don't know that size of like a hundred or something okay, if you want to recover the same accuracy.

B

So, of course, more complex data sets and tasks have a higher noise and you can basically use larger batches. So.

A

B

Can, for example, for these like reinforce learning things you can play like I, don't know thousands of games in parallel right, that's the idea so, which is nice, for which might be nice application for HP systems. In that sense,.

B

Yeah so I'm almost like through my time here, because I wanna have this other torus well. So this is the training time and hours versus the compute cost. So you, if you want to cut down on that, you have to invest more compute, and this is technically the front wheel where you maintain your accuracy.

B

What you see here, for example, when you compare that for like for like for one of these reinforced learnings, you have like sixty power players when, when you, then you need to train for like like hundred hours and then when you like, increase the number of players, you have to invest more compute, but also you have to train much longer sometimes, and then you have like this is 4096 players that might already be too bad because you compute growth so like here, you gain a lot right by just cranking up the the thank you.

B

You have to pay a little bit of more of compute, but you get a huge reduction training time by an order of magnitude. But once you hit this point, you basically are so here. This is a point of diminishing returns. You have to increase your compute budget by a lot just to get like a small reduction in training time.

B

So that's like you have to look at these like curse to, but first you have to map out these curves right, but maybe you have a model which is similar to a model where these curves already exist, and you can maybe think about like what you can do. What kind of like parallel parallelization parallelism is reasonable. Okay, so they are like other things. I wanted to talk about. Briefly, it's like batch colonization. As you know, this is you take basic an input, bash subtract mean divide out, variance and scale.

B

It so basically do an affine transformation on it and it has been shown for whatever reason so previous I think the initial one said that it reduces the internal covariance shift whatever that is. It's technically for another paper scene shown that this is actually not the case. It basically improves some kind of, like mathematical, condition, on the on the on the network on the loss function, which makes it much easier to Train much smoother. So the thing is still best: batchelomez ation decreases the training time and improves your bust nurse.

B

For example, you can initialize your model different inner sedation schemes, and you still get a good accuracy at the end and also improve some generalization. I think this is undoubted. The issue is, then, when you do this distributed setting, you technically have to reduce all these tensors and you have to compute these quantities in theory on the whole global batch, and that is quite bad right. You need a lot of overhead communication for that, especially forward pass so like these can be of size of X, so like this is like.

B

Don't can be of the size of the input right or like off the output of a layer or something. So that's quite quite big. So one thing you can try is goes back assume you have a local sighs, which is big enough like eight or sixteen, you can try to just update your during training.

B

You can just update your model using the local batch local batch mean and local batch Sigma, and then, when you go for the training phase or for the influence phase, sorry for the inference face for test validation, you can think about like taking basically computing these globally.

B

When you look into that paper here, a reference, they have like some weird update algorithm for like how to update these parameters on the fly, but I think it's not correct and there are like some typos. So I would just take the global averages of these things and should be fine so that that is one thing you can try and it seems to work well if your local batch size is big enough. So, like eight or sixteen or something so of course, batch, that's one doesn't help and the other one is weight normalization.

B

In that case, you just work on the weights directly, in the sense that what you do here is you split I mean it looks like a like a like a weird trick, but it works. It will split the weights into a direction and the scale okay. So this is a multi-dimensional Direction vector and the scale. So it's just three privatization, but then the trick is you update the greatest respect to that to that scale and direction separately?

B

So you compute these gradients and the idea here is when, when you look at it in a different way, I don't want to do the math here that actually, the the weight Direction updates are approximately orthogonal to the dominant eigen vectors of the gradient covariance matrix. So basically you don't fall into the trap where you step along this vector, but you go perpendicular to that. So you have a much more smoother like smoother convergence in that sense, so.

A

You can try that.

B

And that is totally local operation. I do use a little bit of math and do it locally you just or what you depend on at the weight gradients. Basically, you already have computed so I get instruction of deep learning of the computation and communication.

B

Primitives is a bit more like behind the scenes what's going on about like talked about complexity and how you can paralyze networks like molecules and data parallelism layer, pipelining what you can do when you are when you do not convert well a large Patras, and you need to choose hyper parameters, unfortunately, for that and how you can basically use pattern or more similar, like accuracy, enhancement techniques, even at large scale, without impacting your communication. A lot. So are there any questions here, then I will do some suggested reading you can find on the slides.

B