Cloud Native Computing Foundation KubeCon + CloudNativeCon Europe 2020 - Virtual, 4 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Is Sharing GPU to Multiple Containers Feasible? - Samed Güner, SAP

Description

Don’t miss out! Join us at our upcoming events: EnvoyCon Virtual on October 15 and KubeCon + CloudNativeCon North America 2020 Virtual from November 17-20. Learn more at https://kubecon.io. The conferences feature presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Is Sharing GPU to Multiple Containers Feasible? - Samed Güner, SAP

Provisioning GPUs for ML workloads in data center can be very costly and more costly if they are not fully utilized. Thus, maximizing the GPU utilization is a must for ML workloads. This session will show how a single GPU can be used to run multiple ML workloads, especially ML inference, in parallel and will deep dive into the understanding of how GPUs are provisioned and attached using K8s device plugins. It will show how the nvidia device plugin can be extended to schedule multiple ML workloads to a single GPU and collect desired GPU information with Prometheus. This session will highlight and deep dive into native GPU sharing using K8s device plugin without additional technologies such as vGPUs from VMware.

https://sched.co/ZesB

A

Hi all thanks for joining this session. In this session, I will try to give you an answer on a very interesting question for machine learning, workloads running on top of kubernetes and consuming gpus. This question also gained huge traction and attention by the kubernetes community overall, so we at sap artificial intelligence were also very interested in finding an answer in this session.

A

I will not only give you an introduction how the provisioning of gpus work, but also share our findings, implementation outlook, how you could get started already today in answering a question whether sharing gpu to multiple containers is visible.

A

So, but first a little bit about myself. I am currently working as developer at sap, artificial intelligence, mainly on infrastructure, related topics for machine learning and continuous delivery. I have been ethically contributing to community projects of terraform and cloud fundraising and the most important I can combine the two worlds of germany and turkey with the german bee beer and church kept up by living in germany and having turkish roots.

A

If you want to have further coffee with me after the session feel free to dm me on tilata or over the conference platform, I'm really looking forward to chat with you. So regarding open source sap is a platform sponsor of a cloud native computing foundation. We at sap are committed to contribute to open source, so we do this with projects such as garner kima or for the kubernetes world, but also for java vote of the java, submachine or luigi, and feel free to check out these projects and about artificial intelligence.

A

We care a lot at sap about artificial intelligence and work on different business services, which are later directly consumed on the ai platform or embedded into a large product portfolio. Our vision is to build the intelligent enterprise by embedding step by step, easy to consume machine learning, services running on top of kubernetes, one of the most consumed services or document information extraction.

A

You can see to the right where we can extract information from invoices with that and many other services, we run a very serious number of models and production kubernetes which not only consume cpus but also gpus, for in machine learning inference so for machining workloads. It's most of the time we are using gpus and gpus are very expensive workload.

A

This is also what we have seen in the production environment of us, and so the gpu is also a challenge, meaning if I look into our production environments and seeing gpus, only um average utilization of 70 percent makes me really feel sad about it, because why we are not able to fully utilize the gpu. So on the right side, I did some fictional calculation on how much one can lose.

A

Actually, if we run about 500 gpu models for an average price of three dollars- and you can see that we are that one can actually use about 300 000 um dollars per month um when running on gpus 500 models on an average deputization of 70 percent, and this is a challenge. So at sap we ask ourselves um how we can improve the situation.

A

Can we actually share a gpu to improve the situation, so we asked ourselves whether the sharing of gpu is in containers is feasible, so in general, at sap we are using gpus from nvidia and I think in the machine learning area, it is also widely adopted. Talking about gpus, I asked often myself back then how the magic works behind the scene. How does a device made available to kubernetes so kubernetes can consume it and in the end, how it's consumed by report?

A

There has been a great work done back in kubernetes 1.10, the so-called device plug-in where everyone and every vendor can provide its own custom device. This could be a gpu, an fpga or your own custom device, such as bitcoin miner, could be consumed by containers.

A

The contributors put a great work here to lower boundaries of accessing custom devices. So let's check it out, let's keep it general and imagine we are on a worker note. um As we all know, each of our node is running kubernetes or node agent, which is responsible for a variety of things such as spawning or containers, and keeping them running and talking to the queen eats master.

A

If we now attach our own custom device to our node- and we want to make it not only consumable but scheduled through kubernetes itself, we need to install the windows device plugin, which is actually a simple jrpc server running on every node, where we have and one to all of our custom device. Once the device plug-in is deployed through user deployment through kubernetes, the vendor device. Plugin does some initial hardware initialization, for example, finding the hardware loading with driver, etc.

A

Once done it registers itself by uh it reduces itself by calling register and passes the resource identifier. We already have seen on our pod specification, consisting of a vendor name and the device name once done. Cubelet will register the given device plugin and the offered resource and continue on requesting more details on the number of available resources and also identifiers by kubelet, regularly checks. We have setups of the device ids for given devices the device ids are used before container creation and pass by kubelet through allocate function to the device plugin during the allocate function.

A

The device plugin prepares the device marks it internally as use and returns back data, which is crucial for the container to yeah to consume with device correctly, and so the key function to be implemented in the device. Plugin are the functions, allocate register and list and watch. So let's have a look on what happens behind the scenes.

A

Once I deploy a machine learning model to kubernetes, which requests an nvidia gpu in the swim line diagram you can see to the most right or gpu to the left um most right by the device plugin, then our workers in our master node in the first stage, as discussed before our device plugin, does some hardware unitization by discovering the gpus which are attached to the node once done.

A

It reduces itself in the second stage of kubernetes and, interestingly, is that that cubelet reports back to the master note the availability of the new device as just advertised by the device flagging, if not already done by other device plugins, which are running on other nodes upon that cubelet gets data about the available devices on the node from the device plugin in the third stage. The device plugin makes all gpus unique and returns back their ids as kubernetes is now aware of the number of available gpus on the nodes.

A

It reports to count back to the master, because the master needs to know what is available to schedule the tensorflow model, a which are just submitting as a user. Once kubernetes receives the request to schedule the model it allocates in stage 4 a gpu, given that the user requests the gpu and then the nvidia device. Plugin returns, an environment variable nvidia, visible device gpu with the gpu id, which is later then passed to the container and which can directly access the corresponding gpu within the container.

A

And at this point where it gets very interesting for implementation. And actually this was the whole magic provisioning, a gpu back. Then we actually shared our initial thoughts, findings and implementation approaches with the community. In many comments, thanks again here for thomas youngblood working with me on this topic back, then we identified three possible variants on sharing a gpu. The first, which was already in place, is the model staffing on application logic, where we bake multiple models into one docker image and use tensorflow serving to switch between these.

A

This very static approach does not offer any c groups for each model, but the whole model, and, in addition to that extended features such as rate limiting per model, is not possible.

A

The third variant has been the node selector hacking, and you hear it. It is hacking. We do not use the device plugin at all. We give the container the privilege mode which, which you should never ever do and let the containers use the gpus on their own.

A

The drawback of this approach is that kubernetes does not really know that there are gpus and cannot respect them during the scheduling and a very large problem is that kubernetes might schedule pots on your node, even though you do not have any gpus left in the end, we have decided for variant two well, it was not really a decision but more exploration and with the health initial help of some developers from the nvidia device. Plugin bavarian too tries to solve this issue in a more native way by extending with device plug-in by so called vgpus.

A

To be honest, vgp sound, very terrific, but there are not so, let's check out how we extended the device plugin to share some gpus.

A

So the problem in general was that once a gpu id is allocated, it is marked as used by the device plugin, and so no more device are advertised from a kubernetes to the kubernetes master. In other words, any deployments requesting gpu will be pushed back as request requirements cannot be met. In addition to that, we knew from node selector hacking and model stuffing that the gpu can actually handle multiple processes. Accessing directly to the gpu, so we started thinking about how we can do that.

A

Why we shouldn't advertise more gpus, if there's even only one gpu, so a solution was that simple we generate using the gpu id of a physical gpu, a number of virtual gpus by simply adding a suffix. Afterwards, we assign those vgpus in the device plugin to the physical one and tell cubelet that we own more gpus with gpu id. So we lie once google had calls for allocate function, we simply remap the virtual gpu id to a physical one and that's it, but as simple it sounds. We have quite a few trade-offs of this approach.

A

It already starts with a problem. How many vgpus can we or shall we provision per gpu? How do we set the limit and are there any boundaries with regards to recall and gpu memory? So we ask ourselves following questions: how many models can we fit on the nvidia k80? We are running our machine inference models on the k80. How does the whole system behave and what are the trade-offs in doing so and limitations? Of course? So we did our experiments and collected data talking about collecting data.

A

The device plugging is nothing more than a port accessing the local devices to advertise resources to kubernetes, meaning that the very same strategy of parts for monitoring and collecting data can be used here too, so we extended with nvidia device plugin, which internally uses the nvidia management library to collect additional data from the devices and the nvml library actually offers low level access with code c binding. So in the end we created a few graphana dashboards with the data we got from promotois to track different values such as number of virtual gpus.

A

We consume gpu param per vgpu and with utilization, so we have collected data. We have a proper question, and so we did some experiments to answer those. So, let's check out our experiments in the first experiment, we do the vowel, noun and body simulation. While it is not really comparable to machine learning inference. It still gives us some insight on how the work might be distributed internally and amazingly to left. We can see that the g flops per second are distributed evenly with every part.

A

We just added the spare scheduling is also confirmed by our data to the right, where the amount of time for an experiment to finish grows linearly with the number of ports accessing a vgpu, keep in mind that we did not limit the vgpu nor the gpu hadn't have given every part a fair share of cpu from the node p2x large at aws. We were running on, while the experiment is kind of comparable to machine learning training. We were more interested in sharing gpus in the case of machine learning inference.

A

um Actually it includes more factors such as network throughput and latency. So, let's check out those experiments we have done um in general. Let's forget to say this: we are running all of our experiments on a p2x large instance, which has four cores and one nvidia tesla k80 attached with 12 gigabyte ram as the inception v3 model was used back then, for a large amount of our production workload. We use it to thrive for our experiments.

A

The first very inference experiment we spawned in total up to 12 redeviews and sent in total of 10 000 required per part and limited each part with 350 amp cpu. You will see later how important in our implementation, the limiting of cpu actually is a disclaimer. At this point, this experiment is done using a sequential request pattern, meaning each model handles at most one request at a time. Given the data to the left for 12 models running on 12 vgpus, we have a p99 in response time below 500 milliseconds.

A

You might be asking how we managed to fit 12 inception models onto one k80 with software by drum, while most of the time a model wants to allocate a full gpu. For that matter, we used a very nice functionality of tensorflow serving and in terms of reserving one can specify using tf.g options. The fraction of gpu ram and model is limited to. Given our data to the right, you can see that, with 12 models running on 1k80, we assign each model only five percent of available vram, which corresponds approx, 600 megabyte for each model.

A

Another approach was to offer vgpus until our memory is full to get the most of the gpu, but actually we were very keen about finding the upper limit. How many models can be run without crashing on a 1k80, so we ask ourselves: can we go actually deeper and hell yeah the inception model 3 model requests at most 228 megabytes per model, so it's theoretically possible to stack up to 50 models on one gpu due to the cpu limitation.

A

On our cp2 instance, we could run this x-men only for 30 models with a cpu limit of 100m and assigning the three percent of which ram we are having on our gpu, keep in mind that this is still sequential request pattern, nevertheless, for 30, moles or p99 response time, versioned only by factor 10 compared to a single deployment. In our opinion, this is an amazing way to keep models running which do not have large spikes in the amount of incoming requests.

A

These were quite some interesting results, but we also emulated, of course, the parallel request pattern and keep in mind. We are still on our p2 setup with inception v3, but this time we let the model process 10 requests at the time. We did not enable matching here to establish a baseline where we could later on compared with batching.

A

When we start our experiment for one deployment to establish a battery sign, we were able to observe that increasing or decreasing the cpu limit, with also limit or deputization. So simply stated, we introduce an artificial bottleneck here for the gpu utilization. This has a huge impact, as we can avoid the cpu limit and over commitment of the gpu in general. We have observed that and our commitment will lead to a large increase in latency, so the equation is really simple in that regard. If you increase your throughput, the gpu has more work to do so.

A

Given the limit of 350 amp for the cpu and three person vram, we were able to have 10.7 queries per second at the gpu utilization of 50.

A

For one deployment for two deployments, we were successfully able to fully utilize a gpu at about 98 percent and achieved more than double of a qps for a similar latency enabling batching. We could even decrease the latency and increase the throughput by two times in exchange for 6 times more vram.

A

Given all these data we ask ourselves again, is sharing to gpu to multiple containers really feasible well in germany, germans after seeing all of these results would say yang, which is a slang for yes and now or in other words in our world. Yes, you can share every gpu, but we have trade-offs and except at least for this implementation, um some limitations.

A

So after having so many figures and numbers, you were asking yourself: okay, what happened now here? So what does it all mean for our implementation to be clear and honest here? Our solution is far away from product generatedness, but we were still able to share our gpu with very promising results. But for that result we really tried hard. We had to do a lot of runs, find out how much vram we actually need and what our minimum limits are and what is the relationship between deputization cpu limitation and throughput and latency.

A

Nevertheless, we were able to save up to 30 times of our costs for inception briefing models running on top of k80 for machine learning inference besides the lock into tensorflow, serving for inference due to the inability to limit virtual ram from a device plugin, we had to misuse kubernetes cpu limits to artificially limit origin utilization.

A

While this is very hacky and not really kubernetes native, we were still able to deploy multiple models with very similar latencies by fully utilizing our gpu and remember the calculation. Gpu utilization was a problem, and with that we actually can fully utilize our gpu. So we have saved some money to buy us some things with our 300 000k um so well. There are, of course, other limitations. Besides the control of virtual ram and gpu utilization, the first and foremost is that we do not really have a clue about what happens on the gpu level.

A

When we run multiple processes, we assume that there is some sort of fair scheduling mechanism which we could find out by our experiments, but still given the fact nvidia cannot guarantee isolation on hardware level. It does not make sense for us to run gpu sharing in our multi-tenancy setup. Moreover, we are simply not able to specify the vram and gpu cores, which makes it very hard to use native kubernetes scheduling to schedule parts. Speaking about scheduling. We do our scheduling, driven by the number of virtual gpus.

A

It is not kubernetes deciding using some detailed sub resources such as cars and vram, but it's only using wii gpus, where we are responsible on deciding yeah how we map our gpu to v3, pools and another. Very simple problem is the resource fragmentation.

A

Even if we have gpu sharing where we have isolation and also maybe possibility to specify limits on vram and gpu costs, we are still unable to bin pack our models correctly, because kubelet returns the api server, aggregated information, and it does not see the gpu as a first class citizen. So it can happen that, during the scheduling of a workload a gpu might get over-committed.

A

If there are multiple gpus running on the same node for this problem, we would have need a locality area, scheduling mechanism implemented at the nvidia device plugin, which avoids formal commitment of gpus.

A

So now, what do we really need to run gpu sharing in production? We need isolation on gpu level, with gpu device driver should offer an api to specify vram and gpu cores constraints per process like like we had in c groups, and we see we already see in cpu and memory at the current point of time, nvidia doesn't spot any solution. With regards to that, with we discussed before we source the fragmentation, our equipment, we need a device plugin which is able to manage sharing of multiple gpus on one node.

A

By being locally aware and avoiding over commitment of gpus and another chapter we did not really discuss at all is the initialization overhead when we spawn a vgpu, but also how fast can a device plug in switch processes which are using the gpu and, of course, last but not least, we need those limits to actually use native kubernetes scheduling which, where people can just specify in their pot um yeah which limits we want to have on the machinery model, and I have to tell you that our community is actually amazing.

A

Since our initial proposal on the issue, there have been four projects trying to solve this issue, where even one very promising was released after the original date of the session. So let's check them out. The first project is diplomatic. It uses the same approach like we do and let's see, the user specify the number of vgpus to be proposed to the user using tensorflow serving and cpu limits. One could realize the very same setup like we did in this session. The second project is from tankand and called gpu manager.

A

It internally uses another project developed by tenkan called the cuda controller, which is a wrapper around the nvidia device library hooking up on cuda calls to enforce isolation to the right. You can see that you are then able to specify limits um akin to the gpu.

A

Moreover, it requires at the custom scheduler extender to extend the default kubernetes scheduler for gpu admission last year, the newly released work cubeshare, which also enforces hardware isolation by intercepting cuda calls at gpu and implementing an own scheduling mechanism, while gpu manager requires the extension of kubernetes scheduler, we see from tenkan cubeshare co-access as controller and solves a lot of information problems with custom resources.

A

We have also written a paper about their approach, showing their approach and their results. I would recommend you to check it out. It gives a very comprehensive insight into the world of containers. Second scheduling, gpus and gpu sharing. If you are planning to contribute or implement at your company gpu sharing, you should definitely check out those projects and contribute to them. At this point, I want to give a big thanks to all of these people contributing to the ecosystem. We already have, and with that we are actually coming to the end of this session.

A

um Thanks for having me here feel free to contact me on github, linkedin or twitter on the slash summit. Guana, I'm happy to answer your question.

A

Hi all um so, thanks for joining the session, I'm now going to answer a few questions I got here so first question is from josh um he's asking based on the fact that nvidia can't see a way to get c group equivalence gpus into white, build this remaining unsolved, keep this very niche and outside of upstream. So um this has something to do with device or with the nvidia device we actually use.

A

So, for example, nvidia has announced the t4 devices which are allowing up to seven gpus um or seven workloads on one uh one gpu, so we have in our experience. We have used ak 80s to do this and the k8 is at the current state and does not support, and I am not believing that k, 80s and also v100s are going to support gpu soon or at any time. Therefore, we have also these great um works.

A

I've just shown you, um for example, from cubeshare which are trying to solve just by implementing your own yeah scheduling mechanism on a lower level. um Let's go. I hope this answers with question. Let's go on on the next question from claymore. Is it usable on gke there's already a cubelet device plugged on them? Does it bypass it so um the device plugin is just a daemon set. You install on the notes you which are actually having nvidia gpus and, as far as I know, you can also uninstall existing device. Plugin.

A

This shouldn't be a problem, so, yes, you could use any of the open source projects to install on the gte cluster with gpu. Sharing.

A

Another question is by jab. I always thought that context. Switching between processes slowed down gpu processes from your experiment. It seems like you've disproven this does this have anything to do with the software you're using for serving or is contact, switching, not not big, of a problem anymore. I have to be honest here um so because we are trying to solve the problem on a very high level on the device plug in level. We do not really know what happens um on the gpu level.

A

um I strongly assume that there is some kind of thrashing behavior. That process are stopped and then um new processes take over offer available capacity of the gpu. But, to be honest here I do not really know, but it's really amazing to see that context. Switching is actually not a big deal.

A

Yes, so I hope that answers your question um there's another question: um what are your thoughts about on gpu sharing for short-lived ml jobs? Would you think it's better to get it done faster or have a higher bandwidth?

A

um So I think by short of ml jobs you mean multiple or in parallel running machine learning, experiments, for example, for training. I would suggest that get it done faster and would be better thing than deploying multiple machine learning, training, jobs onto the same gpu. Of course, this changes from use case to use case, but in general my advice would be to get it done fast. You can also, if you see that model doesn't converge in a given time period.

A

You can also cancel and deploy the next job, and I cannot see any other questions here. um Although the question I have put an answer on it, um so I think I can end this session here, thanks very much for joining again this session.

A

It was great to be here in this virtual event and have a nice virtual event. In upcoming days, see you bye, bye,.