Cloud Native Computing Foundation KubeCon + CloudNativeCon North America 2019 (San Diego), 22 Nov 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetizing Big Data and ML Workloads at Uber - Mayank Bansal & Min Cai, Uber

Description

Join us for Kubernetes Forums Seoul, Sydney, Bengaluru and Delhi - learn more at kubecon.io

Don't miss KubeCon + CloudNativeCon 2020 events in Amsterdam March 30 - April 2, Shanghai July 28-30 and Boston November 17-20! Learn more at kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects

Kubernetizing Big Data and ML Workloads at Uber - Mayank Bansal & Min Cai, Uber

Uber relies on Big Data and ML to make business critical decisions such as pricing, trip ETA, etc. Today, those workloads such as Hive and Spark are running on YARN. To save millions of dollars by efficient use of cluster resources, Uber is planning to use Kubernetes to co-locate BigData/ML and micro-service workloads. Kubernetes is the de-facto standard for running micro-services. However, in comparison to YARN, it still lacks many features like hierarchical resource pools, elastic resource sharing, gang scheduling etc. To bridge this gap, we have re-architected Peloton to be a set of Kubernetes scheduler and controller plugins so that we can provide feature parity with YARN. This talk will cover: - Learnings of running large-scale BigData/ML on Kubernetes with Peloton - Colocation of mixed workloads - Federation across zones - Feature and API parity with YARN

https://sched.co/Uaad

A

Hello hi everyone today, I'm gonna talk about a journey so far to try to cognize big data and ml workload at uber.

A

As you know, always actually become a pretty fast growing and a big company, we wouldn't IPO this year and the Carosa is pretty big. We have like 15 billion trips. So far. If you remember last year, when we had the talk, we are like 10 billion trips, so just like one year like we record like 5 billion trips, and we do like 15 million trips per day and we're a global company different compared to like our competitor. We are like about a global company wearing six continents and the 65 countries and 700 cities.

A

We have like hungry, mini active monthly users and we are serving almost close to 4 meeting active drivers. So a bigger data in ml is very important for a company like uber, because that's kind of how we can potentially become profitable in the future. So we are like a using big data and MMA in all different areas like uber, heats, etas, safe German cars and, like lots of use case, I'm gonna quickly go through a couple of them. For example, one example is ETA, as you know, whenever you open Google app, you can tell you.

A

What's the pick pick up ETA? What's the arrival ETA, so those ETS are very important not only from customer experience perspective. Also from like a pricing perspective, so we are using email models to predict the route based ETA arrows and use that to comprehend to the arrows. So that's actually very important for uber business and the next one is like a fundamental to call of the uber business is how we matching drivers and riders. So in that area there lots of em your algorithms as well.

A

Next, we also have a new business line, not regretting you, but it's like faster growing business file, which is overheats.

A

We use a lot of our models for ranking of restaurants and the dish delivery at times, as well as the searching ranking. So like every time we open the eats homepage, there's like more than Hungry's of mm models killed court to render the page. Now, of course, we also have a self-driving car units, as you know, like that's. A Miyamoto's is fundamental to lots of aspect of the self-driving car like a perception and also 3d mapping.

A

So let me quickly go through introduce the uber speak data stack on the right side. We have a TR the data like. Basically, we have a hot storage like HDFS and warm HDFS. Also as aware of archived and in front of that, we have some in memory database and then on the left side. We have lots of events which we from mobile app micro surveys database and so the party events and we basically use a cop car to feed all those in through our data wakes, and then we have a computer fabric.

A

Today, it's a combination of young plus politic on my sauce. If you're interesting about that, you can go back to our talk last year so, but our plan is kind of to go to unify others into become cognates, plus palatini lay on top of the computer fabric. We have our data processing engines which we could impose streaming as well as batch processing, we're using Frink spark and intense they. On top of that, we have our Curie engines, which is like real-time engines like an antenna X, as well as a high priest o.

A

Then, on top of the current engines, we have our data analytic tools like Piper, which is oobs working of airflow and the dashboards ad-hoc queries and the PA tours. So, as you can see, compute fabric is very important to Big Data because that's basically the core of the engine, where all the big data is rolling. Now, let's move on to next one is MA.

A

So in America we have a platform called McCann jela, we're specifically trying to platformer eyes the whole email work follows within OPA, including data preparation photo typing training as well as inference. So the idea is, like you know, every email engineers can just do a couple clicks and at the end they can specify. What's the data sets and what's the models able to use and then what's the algorithm, they're gonna use energy, we're gonna train all the models, deploy the models, everything you know automatic way in this system.

A

Actually, as you can see, the computer is also like a very fundamental to that, not only from an etiquette training perspective, but as well as like inference or real-time prediction, then there's also like we have to support different training engines like there's a fellow PI torch XG pose to Spock email.

A

So now the question is white cabinets. You know we can say okay, you can use young or missus prosperity and that cause of the needs, but why do we would move to companies? So there's a first important thing is: there are lots of features and extensions in companies for mixed the workload so Kublai's way. It's designed it was for mixed double cloud right, so, basically support like a deployment port stateful set a batch jobs as well as demons it and, most importantly, the growing community.

A

Okay, everybody here can see like how many attendees for cocoa it's growing every year and wide adoption of native wide wide adoption, as well native, integrations from open source projects like spark also like cloud native support and flexible extension models. Now the question is: can we just use companies as these to replace young and replace my sauce?

A

That is not really right. There are a couple like a missing pieces in today's community, which is not as popular as a big data platform as a young. Today, in large-scale production like over there's a cup of gaps like elastic resource sharing, basically, companies don't have a young cube concept, so you have like names, you have quota, but it was aesthetic if you specify and then some quota to my organization.

A

If it's an organisation he's not using a quota, you cannot really use that quota for other people now, also like the folk and Sterling for America loads and supporter like a batch and stick a bit smoke a little collocation. So this is not only just like put them in together, but also make sure the batch workload is not interfering. The real-time workload or real-time micro-service workload they're also like we want like.

A

If we want order place yeah are they we have to match the high throughput via, which is like more than a 1k per second Porter lunch. What we observe today in uber now, there's other like that. I make a porter location which is important for because we don't use really use DNS, because that's not gonna schedule like tens, unload clusters or more. Instead, we have something called you open, naming service, which is very similar to like being nice from book, so for those to walk we actually have to like for every container.

A

You have to like a dynamic, a look at the ports.

A

So before I go to the more details, let me quickly introduce like palliative, so we had a talked about a part in last year, so it is kind of a unified resource schedule to then for mixed workload, which has integrations on my sauce and also we are building the integrations with cabinets as well, but it's the current way. Actually, that was a not talked in on Tuesday, is basically just have culminates as the plugging through peloton, without touching too much of the high-level architecture of peloton.

A

So this has like one limitation, which is: we still have the build order to each driver for Politan. It's ok for like a micro services, because waiting over, we have our own deployment systems, which can use part an API and then for big data, and the Emirates can be challenged. For example, we actually have a spark driver for piloting. So that's what we have in production today, but we have to maintain like different versions for every spark. We support stuff to point 1, to point 3 to point 4. It's like huge maintenance cost.

A

If we can move like. Basically, if we can move through supporting native companies, the API, then that can unlock lots of use case and also can reduce our maintenance overhead.

A

And also with peloton as a plug-in, we can also support an elastic resource sharing and scheduling and high throughput with some performance tunings of the cognate system. So let me quickly introduce a concept called a hierarchical resource pool. So the idea is that it's very similar to youngkyu. The idea is that different organizations we have a resource pool and then you can assign three parameters to each resource pool. Why is the reservation? Why is the limit? Another is shear and then based on.

A

Basically, the machine is like the minimum of guarantee of how much resource you can use and limit is what's the max and shear is like the proportion of extra resource you can use if user class these idle and then we calculated this entitlement which is somewhere in between reservation limit so in in this part on company's word, we actually models the issue resourceful as a recovery CRD. So basically you can do coups. It here apply a resourceful setting like resource pool, Mac replace, and then you get that resource pool which specified you know.

A

What's a reservation for CPF for each of the resource dimension like CPU and memory, so this is how like the parking scheduler is designed. The idea is like we basically use all the resource pool code or the poor scheduler placement engines. Parameters owns for same license, but instead using may sauce, underneath we actually gonna basically watch the resource pool to populate the resource pool tree.

A

And then we have a poor watcher to watch all the poor creations which, like all the armed bandit ports and it, including ah spook you so which actually it's resource-poor, has a pending queue. And then, in the background, we calculate what's the entitlement for each resource pool and they then admitted to them to already queue.

A

So after the in the ready queue, we have like a replacement engine to dequeue all those raid, a ports and then bind them to Sulu API server, and then it's going to launch the sewer covenant and as they on the other side. We also have a note of what sure, which watch all the know, the informations and put them in no cash.

A

So let me quickly give an example of elastic resource pool, so, for example, in this example, we have three resource pools. Each of them has a reservation of 20 units and they then have a limit of 100 so initially, for example, resource pool long only have a demand of 20 or over ten. So basically you know it's there's, no other no pod launch through this resource pool and then we actually gonna use, let's resource pool to and resource pool, three to use the additional resource.

A

So, presumably because the at demand is much higher they're like half the amount of eighty and then each troublesome, we are getting like a forty units of resource now. Second, the next time cycle, we could launch more ports in resource pool Wong, so at the document moment and we're gonna look at her to the environment and is then we're gonna figure out like okay.

A

Now each resource push the kind of deserve, so these three are locations and then we're going to preamp the back the the ports in resource pool to n resource for three and gave us a resource to resource one resource pool one. So next we're gonna talk more about spark and other applications on top of communist party, hey.

B

Guys so now we have peloton running on top of kubernetes. So let's talk about how we're running, spark and other frameworks on top of it. Let's talk about spark. First through we have the spark running differently in different clusters. You have some spark running on yarn, some spark running on missiles and some spark running on. We started to move on palette in as well, so there are different challenges for each of these for running on y'all. Currently, we don't have docker support in uber. We have it an upstream.

B

We haven't used it yet, and we have a lot lack of big containers at septum--ah support, also for yarn in uber. We have challenges running on missiles, missiles does not have elastically so sharing, as we main talked about it, and then the the bigger problem is its power job registered as a framework, which is a scalability bottleneck right. So you cannot run two to three hundred spark jobs together and me so so so then that's the reason we came up with the palette ur.

B

So right now spark running on peloton, but we have to support all these drivers, which is a bigger cost, because you change API is in spark, they are changing, API is left and right you have to keep on maintaining it fix bugs for each version, because you once you have such a big organization, you have different versions running everywhere, every time right. So this is a big big cost. We are still in production from two years is part and there are six plus production clusters running spot. So that's the reason we thought.

B

Okay, this is not the scalable, so let's do something else, and that's the why we wanted to run spark on communities. So the multiple reasons running spark on kubernetes right, so his kubernetes is becoming the de facto for AI and m/l workloads. So we wanted to consolidate everything together, it's very expensive, to maintain all these custom drivers, so we wanted to unify all these spark drivers and its ml drivers and ml frameworks together into one resource scheduler, and that gives us two things one.

B

We will not have this cluster of fragmentation, so that means we would not. So let's say if one cluster is free and another is busy, we can use the workloads to come in. If it is one cluster, we can use the resources appropriately and secondly, we want. We can now prioritize all these workloads. If it's running into one compute platform, you don't need to have this global priority. You can have your local priority for each organizations and then that can be prioritized across your workloads.

B

So these are the advantage you want to use for one resource scheduler, and definitely we wanted to leverage the growing community. We see there is lot of momentum in kubernetes and on top of kubernetes, all the frameworks right. So we wanted to use all of that. We can use out of the shelf all these distribute tensorflow, spar flame. All these drivers, which is already working for flame, we can just use them right. We don't need to have the support cost. So these are some of the reasons we wanted to go spark on kubernetes.

B

So this is. We are running spark on kubernetes right now we have paladin scheduler, so somebody submit a spark summer job to apply server. The written scheduler takes that job put it into a it's a hierarchical resource queue and then we go admitted it based on admission control and then it watches the nodes and then bind the part. Then spark driver go and launch that in some cubelet then queue on the spark driver.

B

Then then, then, the normal spark scheduler runs in spark driver which go and talk to API server get the executors, then those executors again go to paladin as a scheduler and then go admission, control, scheduler and all that right and then they get launched into the another cubelet for spark executors and then peloton scheduler can do all kind of preemptions based on the quota management and all that other stuff right. So this is how this spark is running on kubernetes right now, so we have this pattern scheduler, which can do all that.

B

So there are certain challenges, running spark on kubernetes right now and we have solved some of those so lack of this elastic resource sharing right now in kubernetes, so which we are complementing with paladin resource scheduler resource pools. Second, there is no support of global. We don't want to sow in batch workloads. The global priority is very hard. You have many organization which is running and you don't have to have a global priority. Kubernetes right now works on a global priority.

B

You need to have all workloads to be prioritized into single scale, which is not possible for batch, so you probably want to have each organization has their own priorities, so this is how we enable it through resource pools.

B

Currently, SPARC does not support the dynamic resource allocation because of there is no external shuffle service and it's park for communities right now, so we are solving it through right. We are already wrote remote, SPARC, shuffle service, which we talked about a little further, and then there is a lotta support of security FS. So we are passing Kerberos tokens through security, kubernetes, secure secrets.

B

So this is how we are overcoming all these challenges for SPARC on communities right now. So let's talk about how this overview, how this local Shepherd works. So right now, what happens? Is you? The SPARC, mapper and reducer runs mappers right to their local disks? Then they generate. Generally, we have this SSD machines. We are on for map for compute machines or SSD machines. So all these local shuffle or the external shuffle write all the load, this data into the local disk. This is how this is being laid out on the disk case.

B

You have index files and you have data files and local, shuffle mapper write to the local shuffle service, and then they are being written into these index file and the data files and these each partition are laid out. One reducer comes a reducer goes and talk to each local shuffle service finds out which partition it needs to find out. It goes and tells the local shuffle service that ok I want these partitions and then it goes, is pair, merge them and then give it to the iterator for the reducer right.

B

This is how your local shuffle works. There are certain challenges of having this local shuffle service right now. First, if you are using assistive machines and if you are using lot of writes to the disk your SSD got, we are out so we have our data centers, which we used to run the SSD for 3 years. Our dos assist- you got word out in 3 six months right. So all the disks were bad within 6 months, because we write so much data or shuffle data on to the disk.

B

There is something called DW PD in SSD, which is called disk right per day, and if you exceed that per day, then your life of your SSD goes down, and this is what is happening in our data centers too. So that's when we thought okay, we need to write something which, because you can have better SSDs, but there is all you. Those are not good in terms I mean in terms of your economics right, so you don't want to put those higher SSDs. So second problem is reliability.

B

You certain very often you see that your job got killed because somebody else not a lot like wrote lot of data on to your disk. So this is like noisy neighbor issue right, so people write lot of data and then certain jobs get fair. I mean this is like. Every day we have like three thousand jobs filled in our data center in yarn cluster. Three thousand job fails because of this issue.

B

We don't further kubernetes, we don't have dynamic allocation and this colocation also so the we wanted to write unified scheduler. So we wanted to co-locate, stateless and batch together. However, we have so much written data written, but the bad jobs into the local disk, the disk utilization because becomes hundred percent and because of that, all the stateless services running on that machine gets unresponsive, because you are writing so much data.

B

Your load average on the machine is very high and because of that, you can't even go and do a search on those machines, so the the stateless services vary, which are very, very latency. Sensitive, gets like very much blog in terms of throughput, so we can't do colocation on the same machine. If you are writing so much data. So this is what we are thought. Maybe this is one of the reason we should write remote shuffle service.

B

So this is how we did remote shuffle we have multiple. So so we did so. There is something called shuffle manager shuffle manager. It's a it's a part of your spot. Today, however, you can plug in different storage into shuffle manager. You can say local, you can say remote, you can say NFS, you can say GFS and you can write your own shuffle manager for that today. But if we did these experiments with those managers, we wrote as de festival manager. We wrote an official manager because we wanted to see instead of writing to local.

B

We can write remote, but we did that and the experiments were we are like 2x or 3x slower, even on NFS, even on HDFS right, because there is a lot of data which people I mean lot of small files with these hosts or these mapper task light which cannot get so. The closing opening file going on network causes so much issues, and so much latency is because of that the latency of the whole job becomes 2x or 3x slower.

B

So we took a step back and what we did was we changed a little bit paradigm for schmuck shuffle right now. What we did was so all the shuffle manager goes, and so we have a cluster of shuffle machines. All the shuffle manager finds out which partition and can I can write to which shuffle server, and then these shuffle partitions all the mappers of same partition will go to one server and the server will write into the local SSD sequentially.

B

So because of that, reducer will go directly to one server and fetch all the records into one shot. It doesn't need to go to each machine and find out. Where is that it partitions are merge them together and all that, so it doesn't need to do all that. It's all there go to one server and fetch the file, and after doing that- and these are all streaming so mappers are streaming to stream or shuffle and reducers are streaming from remote shuffle to the lower.

B

After doing that, we are actually on power performance with the local shuffle, so we are doing remote, but the performance is pretty much the same. What we used to get from external shuffle so remote shuffle we are doing last three months. All these yarn and paladin workloads are running on top of it.

B

Currently, thousands of application running job latencies are on par with external shuffle right now and we are actively enhancing that working towards onboarding allspark workloads on to remote shuffle and we are trying to open source.

B

So let's talk about GPO and deep learnings right. So, as Minh said, we have lot of use cases which are using GPUs and deep learning use. Today in our computer stack, we have like self-driving vehicles, we have trip forecasting, fraud, detection and many many more deep, deep learning use cases which are coming up on GPUs there. There are certain challenges when running distributed, tensorflow right now, we don't have elastic GPU resource management, so that is one of the issues which we are complementing with pollutant.

B

We don't have locality and network aware placement tasks, the surgeries issue, gang scheduling, which is a challenge right now we are trying to solve it through peloton and as well as the failure handling. If one of the tasks goes down right, gang scheduling, let's talk about a little bit on the gang scheduling, see peace, a subset of the tasks in job can be skipped specified as a gang right. You can say these are the tasks run as a gang? So what are the primitives which we need to follow?

B

They have to be admitted as a gang scheduled as a gang preempted as a Gangnam killed as again right. So these are the primitives which are missing and which peloton have it. We have that in production on peloton and mesos, and we are trying to do it through pod groups into kubernetes gang task are so we what we do is we we take us again. We admit we do the already so sharing all the quota management. We admit, based on that, we do advert as as a gang and we bind or place as a gang.

B

And then, if you wanted to preempt any of the tasks because of the priorities or cluster is busy and whatnot, we do preempt all the gang together. Similarly, if something fails, we make sure that the whole gang has been killed right. So this is how we added the gang primitives.

B

So this is how we run distribute tensorflow right now on kubernetes. So Michelangelo is our deep learning service, which runs and talk to a PI server and create the part, and similarly Pelton takes that Perdue admission control place it and then once it is placed it places through parameter server and this worker executor and every para.

B

Every this every part runs to containers with the params parameter, server container and the Michelangelo container, because that's the service which we use and then they go and discover each other through API server right and then they get launched and then run as a gap. So this is how we currently run a distributed. Tensor flow on coop ideas.

B

So let's talk about a little bit of workload, colocation, so why we wanted to work load. Why you want to co-locate workloads? This is a bigger problem for efficiency. Right. If we run stateless service separate than the batch service, then you, you cannot use all the resources how she should be using.

B

We are aiming to use 20 to 25 percent of the resources if we can work them or to run them together and 20 to 25 percent resources is if, if you are talking about ubers fleet, it's like huge huge amount of money right so and we we can save 20 to 25 percent of money or servers. If we go locate them, the challenges are if we wanted to co-locate them on the same machine. The disk I/o, which I just talked about network on the machine, CPU caches and the memory over subscriptions right.

B

So if you are running everything together, then these are the challenges which we which happens today. So what we thought is it's hard to solve this. These are the problems on to same machine. So what we thought of ok, but we can do it solve it on the same cluster, so we created something called dynamic partition into our cluster, which is you have stateless partition and you have batch partition. So you have two different partitions. You oversubscribe all these physical resources on each partition and then you move machines if needed.

B

So, as I said, you have this cluster, where you have stateless partitions, you have these batch partitions and all these partitions are running. The stateless partitions are running. Stateless services, batch partitions are running bad jobs and then we are measuring the hotness of each partition. So you see 40 percent utilization 40 to 50 percent and 50 to 60% right. So we over subscribe the resources on these each partition. This is how you save money, you you you pack, each services and each more job together and then you might measure if something is getting impacted.

B

You move machines based on that.

B

So this is how we do right, so we measure the partition and we and then the nodes which are least loaded. We move them into different partition based on the hardness of each partition. So if you're much, if your partition is like 70 or 60% heart, we go move machines from the other partitions to this partition. By that we can easy up this. This partition and services and other jobs are not getting impacted.

B

There is an assumption here. We we say, services are higher priorities or the Staedtler services which are higher priorities, then the some of the bad jobs. So this is the underlying assumption which we take in our questions. So this is how it works.

B

We have something called node agent, which runs in each node, which measures what is the CPU length? What is the CPU load and all different CPU matrixes for each node and then send it to this something called node adviser, node adviser, goes and finds out from all the nodes. What are the load and the inference of services on these node to each other? Our it finds out that and then it gets that information to the penitent scheduler.

B

Now Paragon scheduler will take decisions based on the if the partition is hot, if the node is heart of the services is hot. Based on these three parameters, it takes decision which machines we need to move from which partition. By that we don't impact these services, which are running or the jobs which are, and we want to do it proactively, because we don't want, because if you do reactive, the services is already impacted. So we try to measure the whole partition, along with the node, as well as the service.

B

So if we see the partition is GU growing in terms of hotness, we move machines to that partition. So for that it's it seems easy. We have to do a lot of work to do that. We did load aware placement load, aware placement, because we we have to do the load of a placement, because we don't want to cause churn into the system, because it may happen that one machine is hard and then you move much service from that machine to another machine and that machine is hard.

B

So you always plays into that partition based on the load on the machines we have the scorer's. We build the scorer's for batch and stateless which can find out what is the right machine at this point which we can move virtual partition. We added partition within partitions because there were some bad jobs which are needed, which are more important and the break glass right.

B

If, if there is a spikes into the unusual spikes in your services, then we can go break glass and then get all the machines from the batch, with the assumption that the batch machines, some of the batch workloads, are not as important as stateless and give those machines to the stateless clusters. So this is how we are implementing this colocation this right now in for its implemented in peloton, so it's pretty much orthogonal to communities or missiles.

B

So somebody kubernetes is the future for big data and m/l workload based on the adoption. So we have done peloton and gate scheduler POC we are already implementing all these bad features.

B

Peloton on misses is in production, stateless in batch, and we are thinking how we migrate and looking for code of collaboration to enable kubernetes for big data and emeralds. That's all thanks.