Cloud Native Computing Foundation KubeCon + CloudNativeCon China 2019 (Shanghai), 5 Jul 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Large Scale Distributed Deep Learning on Kubernetes Clusters - Yuan Tang, Ant Financial & Yong Tang

Description

Join us for Kubernetes Forums Seoul, Sydney, Bengaluru and Delhi - learn more at kubecon.io

Don't miss KubeCon + CloudNativeCon 2020 events in Amsterdam March 30 - April 2, Shanghai July 28-30 and Boston November 17-20! Learn more at kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects

Large Scale Distributed Deep Learning on Kubernetes Clusters - Yuan Tang, Ant Financial & Yong Tang

The focus of this talk is the deployments of large scale distributed deep learning with Kubernetes. The usage of operators to manage and automate training processes for machine learning are discussed. We share our experiences and compare two open source Kubernetes operators, tf-operator and mpi-operator in this talk. Both operators manage training jobs for TensorFlow but they have different distribution strategies, which lead to different performance results with respect to the utilization ratio among CPU, GPU, and network. Deep learning tasks are both network and GPU intensive such that a proper optimization for orchestration is very important. There could easily be an imbalance leads to idle compute capacity which is too expensive for GPU nodes (compared with CPUs). We will share our experiences with the hope to provide helpful insight for better economics with machine learning tasks.

https://sched.co/Nrnn

A

That's God's body good up. No thanks for coming to this part, my name is yang Tao I'm, a maintainer and the sigil lead for technical project. My co-speaker ian is a member of cash flow and approver and reviewer for three fellows. The MPI operator is also mentioned on several open source project and still, unfortunately, is not able to attend this. This talk because of the schedule conflict so and go to cover his part as well.

A

In today's talk, I'm going to discuss about large-scale sugar, deep learning, chromatic clusters I'm going to briefly discuss about the tensorflow, discuss about kubernetes operators and also discuss about free flow, which is the main focus of this talk, because the the operator will talk about the TfL, greater height watch operator and MPI rates are all belong to.

A

So, let's get started so the before we jump into the deep learning transition, a lot of briefly discussed about the background of tensorflow. As many of you already know, since the flow is one of the most popular open source frameworks for machine learning, it was originally developed by Google. It has been open source since 2015.

A

Since then, tensorflow has seen quite explosive growth force in open source and the machine learning community. At the moment, the most stable version of pencil flow is 1.14, which was released apple day several days ago, but the tentacle o 2.0 is actually up and coming and the will be released. So I want to discuss about the some of the change, the intensive load without all because they're louder for lot of things. That's change and that's actually are quite an impact to many areas will discuss in this 12 intensive low to Davao.

A

The biggest change is obviously is eager execution back in pencil, a 1.8.

A

The static go up was executed, which means if we are going to run a program instead of running it in an imperative way. You have to construct it'll, go up and wait for the grow up to be deployed, and once it's finished you get two reads out. This is obviously very good for production deployment, but it also has a disadvantage in that. If you try to debug it's very hard to debug, because you will not get resolved immediately.

A

So in terms of load to that, though, the biggest change was with the execution model is that eager execution has been enabled. You can't just run tensorflow like just like a Python program. Very lately play another big change in center flow to. That, though, is that curse essentially is recommended. High level API and our other model building API is pretty much drop intention for one dot. X people have seen criticized for 1 dot X in that for high level model. Building you'll have several different ways.

A

You can use PF estimator, you can use TF Kerris, you can use the chip layer, so you also have the TF sleep. There are four different ways to build models in 2000 2000. What's at TF carrots, the OBE is a recommended API and moving forward. How other method will be deprecated pretty soon? If you're looking to the principle of $2 a workflow, you can notice that everything started with data input and pre-processing, which is handled by TF data.

A

Tf data is a pipeline to take the data from other sources and import that in tentacles the grub. They say it's a wild, the most important step- and this will be one of the focus of this talk, because the data processing is very much tied to the ultra tradition of the deep learning. The second step after the data has been imported into tens of floats graph. It's a model building of the essay we talked about here, Icarus and Pierre transmiter adulty Icarus is a recommended way for training. Now you have the eager execution.

A

Also, you could use the TF function to define two predefined small graph so that the execution can be speed up and finally for saving in training and the inference you can you'll save the model to save them. Singam audio, build and reload the back. So that's tensorflow that'll work follow if your look into that you'll notice, that data input of pre process and model building probably are the steps that can be needed before you do a trimming. So people always ask questions, say: ok, why?

A

What's so special about data input and while they're building and how to apply that when you have a big cluster? So let's come up with topical for all tradition for deep learning. When we talk about our tradition, we actually talk in deep learning. It has a very special meaning. Most importantly, observations for deep learning is always tied to the gpo we are knowing. If you want to run deep learning tasks, you are going to deal with zile which could be input/output most likely. It's gonna be naturally imported cuz.

A

The data will be loaded from disk to memory pretty easily. Then. The second piece is a CP o CP o it's pretty much as still needed for deep learning, because not all the operations a feasible for GPO. There are some operation just naturally, now the good field of a GPO, so you're still needs a CTO for certain operations and also GPO is typically the way to interact with SIA.

A

And finally, you have a GPO, that's heavily fitting piece to be done, but if you're, looking into that, the biggest problem is that GPO is the most expensive one and the appropriately.

A

If we are going to run your ultra sation you'll notice that if the deep learning is not that scheduler very nicely, you are going to waste a lot of GPO and that's a big waste for your investment. If you look into the diagram showing in a slide on the top, you can see that you do a batch processing. You take the data from the input at the pass-through, CP o for pre processing. The next step is a passive GPO for training and the inference.

A

As you can see, until data has reached to a GPO, your GPO, it's gonna be idle and you waste a lot of resource on the bottom. That's the optimization of your observation and the scheduling in this case. You'll try to build the I/o pipeline such that the data will be delivered to GPO as soon as possible, and once it's in GPO most of the time you can do the training and to the I/o in parallel, such that your photo utilize, your chip, EO and that's- that's efficiency. We talked about for how tradition.

A

There are several different ways for ultra deep learning. Oxidation. The most common way is a so called a parameter, silver model and tensorflow. That's that's. Probably the that has been in place for things since almost the beginning, the parameter server model is essentially defined, parameter server, which could be a subset of service. That's dedicated for saving parameters. You also have worker node, which is GPU heavy. The the workflow for parameter cell model is that each worker node it's going to collect a subset of data and is going to calculate it's going to calculate the gradient.

A

The partial grading will be passed to the parameter server when parameter server collect all the gradient, it will do. The calculation gets a weight and the weight will be further fitted back into the worker node. To do the update this process will continue.

A

A parameter server like animation has been pretty much in place for quite several years, and people invest a lot of money. That's a lot of resource to improve that, but there are some debating about whether parameter server is the best way to handle deplaning observation. The biggest concern is that there are some people.

A

There are some discussion bother, parameter server, probably best, certainly a server model, probably best fitted for scenarios where you have your computer power, our cities on the same machine, so that your memory are shared by different GPUs, but in case of a distributed deep, many, your GPS actually scattered around. There could be a different host and they had to talk to a parameter server when they talk to a parameter server, they had to go through network transfer and the huge amount of data are transferred back and forth between parameter selves and the worker node.

A

That could be a really big button to to your training but anyways for in the old days in tensorflow premier server. It's pretty much the only only potential model you can drive for distributed. Learning.

A

Recently, people have been talking about that if we can use some other ways to do that, this should be a deep learning. Most people actually comes to most people actually resort to the MPI model. The MPI distributed.

A

Distributor parallel processing has been resolved in a way for a long time, and there are several ways to do. The there are several ways in MPI to do the calculation MPI. There are several models for reduce operation. Why the reduce another is how reduce so to show example of one of the reduce. So what is that reduce?

A

It gives if you're looking to a graph, that's going to show you exactly how the reduce, essentially just for each node, you are going to do the calculation and that you, if you look into this graph, you try to do a summation. When you do the summation, you start with a bottom node. You do the calculation to the summation and at each node you're going to sum up the feature step you toward reduced ones and you reach the root node. So if you show you that interactively it can be like that.

A

So the final step is that you reach to the root node, which is the summation of all the node. That's 28, its reduce.

A

Mpi also have other another up of Tsukuba I'll reduce. So what is our reduce now videos? It's it's essentially saying that the returns, except when the summation is linked. How could it reach to the root node instead of just stop here, you're going back and broadcast back to every node, so every every node knows your result.

A

So, if you're looking this way, you try to lose summation. You start with the bottom note 1 and the 2 that's 3 and the plus 5, that's 8. You do another summation. That's another note update and another summation. You'll reach the loop note. Once you reach the room note, the good reads out: 28 will be broadcasted to every node, so it's gonna be like set. So that's a so-called I'll reduce. So to sum, are the our radio seats are reduced, Plus broadcast and ferment for many people familiar with MPI.

A

They believe that it's a more efficient way of doing to she Oded calculations.

A

So let's go back to revisit parameter server because, like I said in intensive low for a long time parameter server, a permissible model is a standard way of doing distributive. Lending that debating is about paralyzed our machine, and the paralyzed in a cluster may be different things. Many people believe that parameter server model is only suitable for permit paralyzed our machine. It's not simple to paralyze tasking a cluster, and it's kind of controversy.

A

The the biggest burden for parameter survey is that the cross device communication cost is very high, but unfortunately, a huge effort is being invested over the years. So you probably can't really take said I spend a lot of time discuss about a permit server, mostly because if you talk about all the existing machine, learning or tradition frameworks, they actually have the assumption that the yield parameter server and that's not necessarily the best approach.

A

So that's kind of like making the decision was shooting the best so-called deep learning for depletion observation frameworks, kind of difficult because, whatever frame while you try to choose, it may not exactly fit the scenario you want to achieve to, because in this field a lot of things are changing very frequently. So many of the frameworks you feel, for example, you feels some Bob framework. You probably notice that they cannot even run within your workshop tensorflow. So that's going to be a big issue anyways.

A

If we, if we look into the tradition for deep learning, we notice there are several things. If we summarize the the past diagram, we mentioned about the parameter server model about videos and algorithms model, they all require a state of Omega data and they also require lifecycle management which come up to the situation where we feel like we probably can use committees for our tradition and also when we talk about our communities.

A

We also single- it's probably makes sense- to have covenantal operators for machine learning as well, because it's just a natural feed to menu stateful data and to manage a lifecycle.

A

So let's go to commercial operators in this talk and come through just terrific. How several operators and needs could be flow organization, the tr4 operator pipe watch operator and an FBI operative PI torches.

A

It's probably had some focus on paid watch which may have some computation which tends to flow, but I want to cover this part as well, but if operator, that's obviously is it hide to tensorflow it's support tensorflow. It also support tends to follow TF distributes strategy, so this one has a slightly better in that TF distributor.

A

It's a relatively new and TF distributor. It's actually supported different stretchers and the background like NPI and CCO permanent servant EPO model. The TPL is the chief said. The chip designed by Google that tells processor unit I torture operator has a focus port district torture, history of module.

A

It also has support for NPR and HCC O Hara wall- that's developed by by kuba Harlan, would actually have has a widest support for different frameworks, but I want to to point out that some of the the some of the distribution strategies that probably not exactly not exactly fitted for new versions of flow, but anyway they just show.

A

We just showed those TF operators and MP operators to be deployed down cookie flow. So, as you could see, they are pretty much very similar. The difference is that the first of all you have a different kind of for jobs as a thr when the MPI job, so the spec is either tier for rapid, replicas, back or ampere graphic respect. The command line is slightly different, but other than that, it's very much a saying. So the the container image are. Others are the same.

A

So let's, let's look at. If you have potential flow, let's start with tencel program. So how do we draw attention flow program, your installer tensor flow, and you can also install a tensile coil which give you the TF dataset support.

A

So the first line, of course, is to know the data into into data center. That's as a first line of what they said, equals M needs a date stand the data set that will do some transformation, which, which is necessary because you need to do some pre-processing so that your your dataset is in flux, a tool full of 32 data types. You also do a batch of 1,000 because batch actually the way to to help you to for your data to spit it into a GPO.

A

In one process it will pass the data one by one. It may not be very efficient because the data transfer from CP over from memory to GPO, it's also very expensive.

A

So next is the model building, as we mentioned now intensive flow to that. Oh everything has been consolidated into into curves, so you try to build a model. That's going to be sequential model. The first, the first layer is a flattened, and then you come up with a dense layer, densely a standard layer. The next layer is a tripod tripod. Is it's a technique to prevent overfitting?

A

Essentially, if you define a dropout with in the in this model, you have a dropout of wizard read alpha 20%. That is essentially means during each phase of the training for every node. The the child is now the will be dropped, and this this chance is 20%. This will help prevent.

A

We help prevent overfitting problem. You'll normally encounter when you build will build a new network. The last layer is dense layers field so, as we could see in 10 2011 simplified. So if you build a model, the next I will compile the model. That's one line the apps that it's modeled outfit essentially give you the training.

A

The data set has been passed in the last line. The ad you can see, that's evaluated, that's actually the inference phase. You can do that or you can optionally skip this step if you only focus on training so.

A

The this program we are talking about just a builder builder model and the write-down tensorflow. Now we talk about the distribution. Of course, that's a focus out for this talk. So how do we do distributed training? So there are several strategies. One strategy intensive flow now has mirrored strategy. It's actually consolidated. A lot with sameera strategy are going to do is just defined scope of four with mirror the strategy dot scope, the the model compound can be done automatically, which seems to be nice right.

A

We talked about a mere stretch. It also want to mention that in the old days you have when your duty, sugar training, you have other tools available that can help you, for example, like a, however, would, if you combined tensorflow with our world it's pretty much similar, except that you have to pass different options and use callbacks to to achieve that to achieve the result, once you point out is in this example, you have th in, but the action has been deprecated in touch below, to the other words real.

A

So it's kind of unfortunately, like I, said there are a lot of things in digital 2.0 so which actually makes things a little messy, but that's also that's also a good thing, because it is actually point. It's actually give you a brief idea how fast this area is moving every day like if you talk about a one daughter, just like a how long, how long I can go like one or two years ago and I'll suddenly champion toe-to-toe lot of same change. I could hear like a parameter.

A

Server model, like tier Petrine, has been replaced at applicator, even TF estimator, which used to be the father, the father. We are building a model for many programs. Now our has been gong tadada.

A

Let's do our comparison, but I know: let's do a comparison, so, as you could see in tensorflow, $2 was as a whole they're pretty much the same. You build a similar model. You know when you do the distribution, because it's in Python, so a lot of details has been abstracted and hiding. So you can see it trying to make your life easier.

A

We also try to cover the pit up a torch plus a whole world, because our real one thing with how load is that, even if it's not exactly a fit for 10 photo dot. Oh it still has some users with MPI and with tight watch. So you can see they behave slightly differently, but they are pretty much similar bill. So that's one of the positive side out for different of frameworks. This kind support different open-source Alphaeus.

A

Let's just go back to TF operator versus NP operator. They are pretty much very similar as real as you could see they. They they abstract out the details. So now you can see the even llamo specification for your kubernetes for run your communities operators very much similar, except for the last command line. That's slightly different!

A

So as we discuss this field, it's the 2 mm moving forward every day, lot of things change I mean. If actually, you feel if you try to install, for example, tensorflow, and they will try to use style nightly, and you probably noticed so that whatever you're trying to to run may not work out like a month and within several miles test flow to that all will be released and there will be further change. So it's it's really hard to to conclude.

A

What's a bad solution for you to select the moment, because whatever that is the you have to make sure your authorization frameworks can support the best software, which the latest version that's a stable right. So so so, what's a what's a conclusion coming with the to this, take the conclusion that if we want to focus young people any oxidation, first of all, you are the focusing on trying to photograph gpo because of the essential piece otherwise de plan, the oxidation.

A

It's probably probably not was a lot of effort, a tool to whisper out that another piece is if you try to people tend to this debating about. If you want to find the framework, that's going to support our platform or our frameworks, or you want to find a framework, that's going to dedicate for something. It's a really hard to say. Like I said initially, tensile only have parameter server part is sugar. Tensorflow and later we have the mirror stretching. We also have like our reduced strategy introduced in principle.

A

Tf distribute that actually simplify a lot of things. I think I think that's something you want to take take into consideration when you try to come up with a solution, that's going to build for so called a I hope, so so called machine learning kubernetes, because whatever you're trying to build it may not be feasible. After maybe just several months, I see a lot of people just smiling to think. Okay, what exactly deliver?

A

Unfortunately, that's I think that's a static call for for the kind of data for distributed, distributed, machine learning field, it's too tied to the sugar framework and it's probably hard to come up with stable solution at the moment. If you try to come up with stage surgeon, I'm, sorry, probably in several miles, you probably realize it's kind of like some of the latest version in. But one thing is once you have lectures recommend it's a lot of people mentioned. So what exactly are you going to do?

A

I think a cookie follow could be a framework that you can. You can try. There are several reasons. First of all, creeper is a very actively developer, which means the final framework mechanician recruit for will try to catch up as soon as possible.

A

Secondary fellow is, has a very big community, as far as I know has allowed out for developers actually countering through that they try to make adjustment as soon as possible. So you for the upstream frame or like tensorflow, make a change before tend to catch up where so yeah that's see. That's that's pretty much. For today's talk, the another thing I want to mention is the crew flow. Also has several shared API and passive practices.

A

They have a common standardized, API spec, they have base job controller interface and as a consolidate, the how that one in white repo, but they have different different tissue. Okay.

A

They they consolidated those two common common API spec, a common utility functions into white people, but they has a different way. So I'll deploy different framework like we have a key evaporator pipe water filter, and it also has a MPI operators. That's it's pretty much for today's. For any any question.

A

Regarding sir model serving, what's your recommended technology thousand or others,.

A

Okay, that's a that's another interesting topic: ax yo! If they talk about like you're talking about the serving models like a real quality, you have TF serving.

A

That tip serving is one that's coming out of Google as real, but I think they also falling behind with two Dada. Yes, it's it's not compatible with 2.0 models into another lot of single exchange. Remember intentional 2004 some! Well! You have either execution behavior by D father, secondly, intrude so when people you would be in the passive and wondered aloud 1 dot, X. A lot of the frameworks use TF estimator as a default promote. Now this framework is being phased out.

A

Of course, tensorflow is still supported, TF estimator, but that doesn't mean it's a first-class suit anymore. To be honest, so I can only say a lot of things will change the produce on 4pf serving as well and also in tensorflow, too, though, another concept is a so called TF function. Tf function essentially, is a small sub graph that help you to optimize, remembering in food or study graph as soon replaced by you got execution.

A

You need each and every step it can dry in, particularly this actually slows down the speed of what training dramatically the one way to improve that is. You can define a small function. A small function is essentially sub graph or a small one. It's go up. You can apply so that give you that advantage of coop optimization at the same time is still give you, the eager execution capability to allow you to do a knee back. So waste here function.

A

If you try to write, if you try to, let's say you try to write a write, a model you can come up with, you can write it in a legal mode and then for your stabilized and the decided say your model, it's pretty much free! You can convert that into TF function to speed up. So that's another thing. That's going to help! You like I, said a lot of things that interchange in to that. Oh but I. Think that's.

A

How could change because, like I mentioned for model building, you have occurs consolidated, you don't you don't need to think about which is the best way to build a model among the four different ways of building that we also with the execution that really helps allow that for debugging purposes,.

A

Any any other questions.

A

Okay, okay, Sox thanks. Everyone for coming here have a wonderful day.