Cloud Native Computing Foundation KubeCon + CloudNativeCon Europe 2019 (Barcelona), 23 May 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Large Scale Distributed Deep Learning with Kubernetes Operators - Yuan Tang & Yong Tang

Description

Join us for Kubernetes Forums Seoul, Sydney, Bengaluru and Delhi - learn more at kubecon.io

Don't miss KubeCon + CloudNativeCon 2020 events in Amsterdam March 30 - April 2, Shanghai July 28-30 and Boston November 17-20! Learn more at kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects

Large Scale Distributed Deep Learning with Kubernetes Operators - Yuan Tang, Ant Financial & Yong Tang, MobileIron

The focus of this talk is the usage of Kubernetes operators to manage and automate training process for machine learning tasks. Two open source Kubernetes operators, tf-operator and mpi-operator, will be discussed. Both operators manage training jobs for TensorFlow but they have different distribution strategies. The tf-operator fits the parameter server distribution strategy which has a centralized parameter server for coordination. The mpi-operator, on the other hand, utilize MPI allreduce primitive implementation. While the parameter server strategy requires a right ratio of CPU (for parameter servers) and GPU (for workers) to reach network-optimal, the all reduce distribution might be easier to optimize network cost. We will share our performance numbers in out talk for comparison of those two operators.

https://sched.co/MPaT

A

Okay good afternoon, thanks for coming to this talk, my name is Jung Hyung I'm, an Seagal lead and the maintainer of tensorflow project I work for more behind, which is a company in saline valley. He sees again tongue, he is member of cookie flow and the involving several several kubernetes operators. Intensive in cookie flow arc is also maintainer of several machine learning frameworks such as tensorflow MX, net and XE boost he worked for and financial at the moment. In today's talk, our focus is to discuss about large-scale distributive learning with kubernetes operators.

A

We will briefly cover about cancer. Intensive law have some concept, incriminates operators and also come some concept in cookie flow.

A

So before we get started, I would like to briefly discuss about motivation to have a tradition for deep learning. Many people may ask the question of why the planning is so special in our tradition. There are several reasons for that. In traditional, continual tradition, your typical ad, always state least continuous. Those dailies continuous, are typically web applications for web applications. You can easily scale up and down container without any restrictions for deep learning. However, there is one big factor: the big factor is the tripi of involvement.

A

As we all know, machine learning relies on cheaply or nowadays without gvo. Your machine learning will will not be running efficiently, but gpo is expensive. That means you have to worry about the ratio of your GPU, CPU and I/o.

A

If you look at this graph, you can see if you, if your GPU is not a properly pipeline, you can waste a lot of time waiting for the data to be transferred from aisle to cpiotto GPO and that's great waste of resources, and the problem is the GPO is expensive, so you want to make sure your investment has been fully recovered when you do the deep learning, if you see the tag at the bottom of diagram, as you can see, with a proper pipeline, you could efficiently utilize your resources specially for expensive G feels.

A

So that's why the deep learning has a special needs of our tradition and it's very different from the dailies, continuous, like web applications.

A

Another reason for deep learning to have a proper uh tradition is so called distribution strategy, for were the years people had been proposing different ways of doing distributive learning and the most successful one in sofa is a parameter server model in parameter server model in a cluster. You typically have two types of nodes: the parameter server knows and the worker node. You typically have one with several parameters: server. You also have a fleet of worker node to the other gpo intensive work for parameter server.

A

It's a central place for you to update the parameters like in the diagram. If you could look at diagram, the workflow for distributive learning with parameter server model is done in this way. Worker node will first do the calculation based on a subset of the data.

A

The partial gradient you calculated will be pushed to parameter server and the parameter server will grab all the gradient from different worker node. Do the calculation to get a new weight? The new weight will be fully distributed to working out again to wait for the next round of update. So that's a whole lifecycle of a permanent server model.

A

As we mentioned, a parameter server has been a successful model in the past, but recently people proposed and the discuss about if there are any other ways of doing distributive learning the most promising one is so-called our reduce tragedy done in the past by MPI community.

A

The our reduce treachery essentially is not a new concept, but it's actually only have only been used in deep learning community recently before we discuss about our retail strategy. I want to discuss about reduce strategy for the council. Reduce is very forward. You'll have a several note. You let's say you want to do a summation of different numbers stored on each node. So how do you? How do you do that? So? Essentially, you can just aggregate data on different node and the researchers tab.

A

You'll do a partial submission until you reach the root node, which will give you the final answer. So that's the so-called reduce strategy.

A

For MPI the concept of our reduced strategy, it's actually similar. The only difference is that when you, after you get the summation from from the root node, you don't stop here. You do additional broadcast to to make sure the number is available on all nodes. So it's like this way. It was a summation, each node, you reach the root node to catch a number and the. Finally, you do the broadcast to reach to different nodes so that every node will get a number.

A

So to summarize, our reduce is a reduce plus broadcast model.

A

So one question people costing us is if a permit seller has been using in the past of. Why do we need another distribution strategy? There are several reasons for that. So, first of all, parameter server require a centralized location to save the data, and this data has to be CPU. Intensive.

A

A lot of people read the question say her paralyzed, our machine might be different from paralyzing a cluster and the parameter server model might be best fitted to have a module process that running on the same machine, but not necessarily to to perm to the distributor job in a cluster. So it's kind of controversy to say if parameter server is a best approach, but for over the years, especially for the past several years, people has invested lots of effort. There are lots of investment over parameter server.

A

There are different South Wales different packages and frameworks supporting this model, so it's really hard to throw away.

A

So that's why we come here to discuss about the difference, distribution strategy for distributive planning. If we look back, we discuss about our tradition for deep learning. We notice there are several characteristics. First, it has data for metadata. The state of omertà data need to be properly managed in your whole life cycle. We also noticed that we need a traditional framework. Of course we are here in kuvakin, so we also discuss about ways to do it with kubernetes. If we combine album together, the best approach as well, we could see is a kubernetes operators.

A

So, in this talk we're going to introduce several community operators for managing deep learning or tradition. The the operators we are going to discuss are under the cookie flow project. We will discuss several operators, the TF operator, which is operator for managing tension, flows, a parameter server model, the pipe water operator which is used for manage another popular, pretty popular machine learning, framework of pipe watch. And finally, we are going to discuss about ampere operator.

A

The ampere operator here might be misleading because we mentioned by the MPI, but really it's just- it's actually just mapi our radius tragedy operator. So it's actually could be applied to different machine learning frameworks. Now, just for MPI next I'm going to hand over my microphone to my caller to you and he is going to discuss about different operators and how to do that with kubernetes and the cookie flow.

B

So today we will focus on these. Three operators under could be flow, so Kiev operator. Obviously it works with tencel flow, but not other frameworks. It supports distribution strategies or different backends, whether its TF ductus tribute module, for example, MPI, which is the all. We do strategy that we that John talked about and NCR, which is a collective communications library, open source file, Nvidia and then parameter server strategy that you also talked about, and then the TPU strategy, which will make tensorflow work.

B

We are with more efficiently with Google stencil processing units and then Pat Hodge operator. It works with high touch users, can access different distribution strategies via touch that distributed module, which includes backends such as glue, which is also another collective communications library by Facebook and then NPR and NCC? Our last but not least, MPI operator supports all types of MPI based jobs. So, for example, you can run all kinds of jobs that can be executed on openmpi.

B

So, for example, ha road is a framework by it, supports different machine and frameworks such as tensorflow high Taj, IMAX net and Kira's, which can be either the original Kira's or TF that Kira's it you tonight, users can utilize the Harwood distributed optimizer to run MPI based distribution strategy. Note that it's only MPI.

B

Let's take a quick look at the different operators. Here we have TF job, which is, on the left hand, side. Tf job is accustomed resource definition provided by TF operator.

B

So all you need to do is to define your tensile replicas back here we are defining the worker spec, which includes how many replicas are we running for the worker nodes and we define the resources, that's required for the replicas replicated containers here and, for example, here we are using our four GPUs and we execute this tensorflow script here.

B

On the right hand, side this is MPI job custom, with the custom resource definition provided by MPI operator, it's very similar to the TF job definition. We will just defined MPI or replicate replicas back here for the worker replicas. The rest of them are similar, except for this last command to ask to execute inside the container. We run the MPI run command here, and then these Python scripts include is a mixed, tensor flow class horror would program which we'll talk about later.

B

First of all, let's take a look at a quick, simple example using tensorflow to run a basic training locally, so we import tensorflow here and then import tensorflow io, which is the easiest way to import. Different data sets extensions and file system extensions for tensorflow. So here we are importing Emily's dataset using the tensor for IO module, and then we map each image to image these features and labels to this particular float32 type. And then we can take batches of the data set of saba of batch size 1000.

B

Next, we can define the tensile flow model model using T of the Cure's module, and here we are defining a sequential model which is linear, a stack of linear, a linear stack of different layers, deep learning measures.

B

First, we flatten the input into this particular shape, and then we apply a fully connected, dense layer of width, 512 neurons with riilu as a activation function and the next we apply a drop out here, which means we drop 20% of the input or we randomly drop 20% of the input to avoid or reduce overfitting problem, and then we apply another fully connected layer with own interneurons just to get the different classes.

B

So once we finish defining the model, we configure the model for training here using SPICE specifying specifying this particular function, and this quick stochastic gradient descent, optimizer and then we fit the model. We train the model using the data set that we imported earlier and with this many appropriate approaches, and then we last we evaluate that it has. The model using here is just a very simple example, which you use. The data set that we imported earlier to evaluate. How well is the model performing on this particular data set.

B

This is a rod, tensile flow that can be executed locally, but next, let's take a look at. How do we turn this tensor flow program into into a distributed training in distributists settings so same as you as before? We import the data set and the model, and then we initialize a mirrored strategy using TF that distribute module. What this does is it nourished variables to distribute across multiple devices and machines? We then use all reduce to combine the gradients compiled convent ingredients across the devices before applying them to the variables to keep them in sync.

B

So simple, as that, we just define that we we just configure that model with the last function and optimizer as user, but within this mirrored strategy scope.

B

Next, let's take a look at how do we use harwood to run tensor Florine? We all reduce fashion.

B

So we here we import one additional library here, which is under a rule that Kira's robbery, so this harwood program would work with TF Dockers or kill the Akira's so same as usual. We define the model data set, and then we define the optimizer using Rodentia for API.

B

But what's special here is that we then apply a Hollywood distributed optimizer, we'll wrap this previous tensorflow optimizer that we need initialized, and then this will configure the distributed training for us using how rules define distributed strategy.

B

We apply the with an compiled model to configure the training, and then we define this additional callback for this pious model, which is the broadcast global variable callback. This callback basically broadcasts in the initial variable States from length 0 to or other processes. So this will ensure the initialization of variables.

B

The Atkins is consistent across all workers, for example, if the training is studied with random weights or the training is restored from a check of previous saved check point, this will ensure the initialization is consistent. So let's compare these two options. We have on the left hand, side is the raw tensorflow program using TF that distribute module and, on the right hand, side is the harwood plans, tensorflow program that we just saw so this this these boxes highlights the differences.

B

Basically, so if we add those additional eyes, if we switch this mute strategy, scope to this Hollywood distributed optimizer wrapper and then add our callback, then we to switch to use highroad and then we can so, on the left hand, side we can use TF job to run that particular program and on the right hand, side. The Harwood program can be executed using MPI operator.

B

B

And next, let's briefly talk about how do we use Harwood with Python? So so we just do a user, normal height watch program, we import patrasche module and we use Pathology's own data loader too late to load the data set, just as very similar. To previously we use tensorflow alloy to import the data set and we apply.

B

We initialize an optimizer using paid watch optimizer and then similarly, we apply a Hollywood distributed optimizer on top of the head post optimizer, so that we can use the disability strategy defined in Harwood and then, similarly, with the we broadcast all the variables using this horrid. That broadcast parameters note that this function is a valuable under hallowed that touch, but that's the only change you need to make in order to run hide watch using Hollywood, and here we are very similar to the pencil flow class hold program.

B

We use the rank 0 process as a route process, so we can so so we initialize the rest of the processes using the same variable States and then here we do under this nested for-loops.

B

We iterate data and run the hydraulics training here, step by step.

B

Okay, let's recall GF job versus MPI job. These are the only differences we need to check, but we need to make in order to run test flow program under T, repeater or MPI operator, but note that here the API version we want offer to you for MPI job is still a work in progress. This is something we work with the community to to ensure that two different, two different operators, or even more operators such as Patridge, can be the API can be standardized and similar to each other.

B

So all these are the only changes we need to make in order to switch one Operator to another, something we are working on. So these all these commonly all these code. All these reusable components from different operators for is a material operator. High-Touch operator MPI operators are happy to over refactored in this copy flow common repo. This also includes common and standardized API spec and there's also a base job controller in interface, so that new operators can inherit and implement using the utilities.

B

We have industry code and there's also testing utilities in could be follow common. So this is something we are working hard on to make sure the api's are standardized and the components are reusable. So, for example, we are working on a Tantra financial. We are working on an ex to boost operator which we are inherit. Are you actively from this common operator.

A

Sink, that's it for today, I'm just wondering anyone has any questions.

C

Not sure it is supportive use already in parameter service or TPO. So two questions. These are poverty peers, if you support abuse already tense of GPUs are already supported.

A

I think the question is about TP. Oh that's! So, first of all TP, oh it's only. We talk about ten shuffle of you talk about TP, oh and you probably, who could hear TP o aloud from Google guys. Tp o is essentially designed and released by Google, and that's only available no go crowd. So it's it's very much a vendor specific.

A

So if you use I would say if you use tensorflow Uncle cloud, you'll probably can experiment and use TP o, because that's as far as I know, there is a very good support from tensorflow team for to make sure our algorithm available for TP o. But if you'll try to for different reasons, cause the reason or for your company policy reason, you decided to use a different cloud vendor like AWS or if you decide to deploy your own GPS fleet, then TP o is now going to be a no not going to be.

A

The best choice, and also I would like to point out is that at least for tensorflow now, the other algorithm we always run on TP o, but tensorflow team has spent a lot of effort to try to make sure TP o is vertical. Also a first-class citizen and support and the other questions.

D

I, thank you very much for a talk, very technical, really really liked. I was wondering I among the net, the frameworks you mentioned IMAX net and as far as understood, you would support MX, not using gaurav odd for sure, but as far as I can remember and mcsnack has a built in MPI support for distributed. Computation I was wondering if you could integrate natively imax net into cue flow with any specific MX net operator. Maybe.

A

So I think the question: is we discuss about different distribution strategy and discussed about different frameworks? But we probably didn't talk a lot about XML MX net and your question was MX net actually have a support, our distribution strategy building so you're just wondering if this one is something we can see that.

B

It's very similar to 10, suppor and potage, which have their own built-in support for different distribute strategies, but so the answer is totally up to you to whether to use the beauty in this UV support or not.

B

There's also a horror would support horror, would integration with a match net so that you can run Amex not using her own as well under the standardized hallowed MPI stories very easy to change to switch one fluid for food if we move to another, so it's easy so sometimes if you wanna do different pieces on different frameworks, that might be helpful. Maybe.

A

I can add a little bit more comment. You feel having me involved in different machine learning framework. You probably notice that a lot of ways to do the same thing in tensorflow one dot X.

A

If you want to build a model, you can use TF, Harris TF, assuming that you have layers and that you have asleep eventually I would like to say, is something you have the freedom to see like any framework or any way to build your model or beauty or distributed deep learning framework available that, basically, you want to stick with the majority of the users, because that's as far as I know, many frame workers are not truly in production years or now, truly widely use, which actually you may encounter some issues, you feel use them less know less known frameworks or less less yield frameworks.

A

So that's my suggestion. Any any other questions, I guess: I! Guess that's it for today, thanks everyone for coming to this talk have a great day and enjoy your stay in Barcelona, beautiful, Barcelona,.