Cloud Native Computing Foundation KubeCon + CloudNativeCon Europe 2022, 2 Jun 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Improving GPU Utilization using Kubernetes - Maulin Patel & Pradeep Venkatachalam, Google

Description

Don’t miss out! Join us at our upcoming hybrid event: KubeCon + CloudNativeCon North America 2022 from October 24-28 in Detroit (and online!). Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Improving GPU Utilization using Kubernetes - Maulin Patel & Pradeep Venkatachalam, Google

Kubernetes supports efficient utilization of resources by enabling applications to request the precise amounts of resources it needs. Unlike fractional requests for CPUs, fractional requests for GPUs are not allowed in Kubernetes. GPU resources requested in the pod manifest must be an integer number. This means one GPU is fully allocated to one container even if the container only needs a fraction of GPU for its workload. Without the support for fractional GPUs, GPU resources are invariably over provisioned leading to a wastage. This is especially true for inference workloads that process a handful of data samples in real-time. To address this limitation, we have developed user-friendly solutions that allow a single GPU to be shared by multiple containers thereby improving utilization of GPUs and saving cost. In this talk, we will show the demos of our solutions and share performance results.

A

Good afternoon everyone, my name, is marlin and I'm a good group product manager in google cloud working on kubernetes engine.

A

We all know that gpus are very expensive resource and utilization of a gpu is a core concern for all the gpu users. Poor utilization of the gpus costs them dearly. So in this talk we are going to show you how to improve gpu utilization using kubernetes.

A

So, in my humble opinion, I believe kubernetes is an ideal platform for aiml and high performance computing workload, and there are three core reasons why I think kubernetes is best suited for aiml and high performance computing workloads number one is portability: kubernetes provide open standard based and cloud native apis.

A

This allows the practitioner to seamlessly port workloads between the laptops, private data center and the public cloud. Second kubernetes can seamlessly scale from a single node to thousands of nodes. It supports auto scaling, auto provisioning, gpus, tpus and many other advanced features that allows them to do a very large scale, training and inference.

A

Third is productivity. Kubernetes makes the practitioner more proactive by freeing them up from having to manage the underlying resources and compatibility issue, so they can actually focus on their core business mission, be it be training serving or high performance computing.

A

So let me quickly walk you through an architecture of google kubernetes engine. Google kubernetes engine is a fully managed container orchestration platform provided by google. It has two main components: control, plane and data plane, control plane comprises of many things, including master nodes, api server, scheduler, hcd and many other services, so control plane provisions. The data plane, which comprises of worker nodes worker node, is the place where workloads run and worker nodes can run the workloads using gpus and tpus worker nodes are grouped together as a node pool.

A

All the nodes that belong to a single node pool share, the configuration node pool is also the basic unit of auto scaling. All the nodes in the node pool either have a cpu or gpu to run the workload.

A

So, as I mentioned before, gpu utilization is a core concern for gpu users and poor utilization cost them very, very dearly. What we have observed in our gk fleet is that gpu utilization for typical workload is quite low and the gpu utilization is actually getting worse day by day, as gpus are getting more and more powerful.

A

A single workload may not even be able to saturate a very powerful gpu under utilization problem is especially acute for certain type of workloads, such as inference gaming, visualization and notebooks.

A

So, let's look at some of the examples, so data scientists build model using the notebooks and most of the notebooks today are attached to gpus and, as we all know, notebooks stay idle for a prolonged period of time, wasting very, very expensive resource. Let's look at some more examples like chat, box, vision, product search and product recommendation.

A

These are all real-time applications that are latency sensitive and business. Critical, so kubernetes, auto scaling and auto producing features are essential for such application, but not sufficient for two reasons. One is it takes minutes to spin up a new node in the kubernetes, and most of these applications are latency sensitive, so they cannot tolerate that delay.

A

Also until now, we could not do a very effective bin packing for gpu workloads. So how do we help this workloads? Be more cost efficient? That's the main purpose of today's talk, so the main challenge today we face is that kubernetes allows fractional utilization of cpus, but it does not allow fractional utilization of gpus. What it means is that a kubernetes workload can ask for 0.5 virtual cpu and kubernetes knows how to give the workload 0.5 cpus, but today you cannot ask 0.5 gpus in kubernetes, so what happens is one workload?

A

One gpu is fully allocated to one workload, even though workload needs a fraction of the gpu to execute its task. So how do we fix this? So there are many solutions to allow workloads to share a gpu.

A

Some solutions work, the application level, some solutions, work at the gpu system and software level in some work at the hardware layer. So today I'm going to talk about two solutions that we recently launched.

A

First one is time sharing and the second one is multi-instance gpu and both of them are very popular mechanisms to share gpus and both of them together address most of the use cases and most workload needs. So let's walk them through some one by one.

A

So first I'm going to explain the temporal multiplexing solution, popularly known as time sharing so time sharing, allows multiple container to run on a single gpu.

A

Each container gets a time slice, so gpus are allocated fairly to all the containers and under the hood it it does time sharing through contact. Switching. What it does is that at any given point on time, one container has exclusive use of a gpu, but at a periodic time interval it does a context. Switching and the next container gets the exclusively used the gpus, and this is done in a round robin fashion. So all the container gets a certain fair time slice that they deserve.

A

So the beauty of this solution is that when there is only one container allocated to a gpu, it gets to use the entire compute time of the gpu. But as soon as you add the second container now two containers share the gpu. So each one gets about half the time compute time and that's how you can enable a fair sharing of gpus.

A

So for nvidia gpus series, the older generations, they were doing context, switching or preemption at the cuda current boundary.

A

However, the new generation of nvidia gpus, pascal and later they do preemptions and contact switching at the instruction level boundaries and that basically facilitates fair sharing of gpu and the solution that I'm going to explain today is fully managed solution by gke. So all the configuration and management and underlying heavy lifting is done by gke.

A

So now I have already explained what time sharing means. Now, I'm going to explain to you how we can enable time sharing on a g key what the user experience looks like.

A

So in order to create a gpu node, where the time sharing is enabled, you can provide a configuration parameter either at the cluster create time or at the node will create time, and in this chart you can see an example like it has. A configuration, parameter called max time share, clients per gpu and the value is set to 10.. What it means is that, in this example, nvidia tesla t4 gpu can be shared by up to 10 containers.

A

So 10 is the upper limit between 1 and 10. Containers can use this gpu all this configuration you can do it with api calls. We are also going to launch user interface, so you will be able to do the same configuration using ui, ux.

A

So let's say you run the command either the cluster create or note will create. It will automatically set up the nodes with the time sharing configuration, so you can. After the nodes are created and drivers are installed, you can actually inspect a node.

A

So let's inspect the node by running group code describe nodes, so you can see here that the nvidia.com gpu resource value is actually set to 10.. So what does that mean? It means 10, shared gpu resources are available and each resource represents a time slice.

A

So in a simple speak up to 10 container can share this one nvidia t4 gpus.

A

So after the nodes are created, the next thing it does is basically it labels, the node so with each node that is created through this configuration will be labeled, so that workloads, which request this timeshare gpu, can be landed on this particular node. So in this example, you can see that two parameters are specified. One is time sharing and the maximum number of containers that can share this gpu third thing it does is basically change this node.

A

Why is this needed? This is needed to avoid or prevent whole gpu workloads from being scheduled on this node. You only want the workloads that want a sharable gpu to land. Here you don't want a workload that needs a full gpu to land on this particular node.

A

So now we have set up the nodes. Next thing is to basically configure the workload. So what do we do? We specify the deployment.

A

So, within the deployment spec, you can actually specify node, selectors or affinities to schedule the workload to run on a time share. Gpu.

A

If the nodes don't exist, then gk is smart enough to either auto scale an existing node pool which matches the configuration or it can create a brand new node pool that matches the request from the workload and I'm going to talk more about that. As throughout this talk so now the workloads have to request nvidia.com gpu, and the request count in this example is one you have to remember here. Is that one is not the measure of gpu time allocated to a particular container, how much time a gpu?

A

How much time a container gets depends on how many containers are running on the gpu. So if there are 10 containers, each will get 1 10 of the time. If there are only one container that will get the exclusively used, it gets to exclusively use the gpu.

A

So now we have figure out how to provision a node how to land workloads on those time shared nodes, but there are some nuances that we need to be familiar with, and there are some issues and corner cases that we have to be mindful.

A

So in the time sharing gpus all the processes get their separate address space. So there is no issue of data overlapping. However, no memory limits are enforced. What it means is that, if the containers are not well behaved, then you can get into out of memory situations, so the responsibility of restricting memory usage is up to each workload.

A

So how can we do this so two ways in which you can avoid out of memory situation? The first one is: you can actually use cuda unified memory, what it does. It basically enables on-demand paging between host and gpu memory, so that way it avoid out-of-memory situation.

A

The second solution is that you can configure this in the applications, so application frameworks like tensorflow or pytorch. They expose you knobs, which you can set to avoid out of memory situations. So this is something to remember to avoid when you are sharing too many too many containers on the same gpu.

A

You want to avoid out of memory situations.

A

So now I'm going to walk you through how auto scaling works when you have a time share, gpu, so auto scaling is a very key feature of kubernetes.

A

It enables workloads to avoid over provisioning and under provisioning situations, thereby saving the cost while offering a better performance. So auto scaling is quite a complex topic, so I'm going to walk you through a very simple workflow.

A

So in this case we already have a timeshare gpu node. Now this node will expose a gpu utilization metric per container and you can actually also use custom matrix. So if you wanted to specify query per container, that could be your custom matrix, so the horizontal part or scalar, auto scaler, actually watches for this metric.

A

It actually tracks this metric and what it does is that when this metric exceeds the threshold that you specify, it will actually add replicas of your container. So let's say you are watching for um gpu utilization metric and the threshold is 70.

A

When the utilization goes more than 71 percent or higher right, it will start adding replicas because it thinks that the application is running hot and it needs some help.

A

So when it does this, it can do it in couple of ways. Now we have added more replicas of the pod pod needs to land on a node if there was an existing node which can accommodate that it will happen. But if there was no such node available to land this extra pod, then it will automatically add new nodes in your cluster.

A

So cluster autoscaler is smart and it will take care of this for you now. There are three main scenarios: when we talk about auto scaling, scale up scale down and auto provisioning, so I'm going to quickly walk you through all the three.

A

So when the nodes are unscheduleable, then gk is smart enough to scale up the most cost effective, node pool. This is to your advantage. How does it do that? It basically looks at the parts back and it looks at the nodes back and sees which are the nodes that can satisfy this parts back and out of those nodes which one will be the most cost effective to scale up.

A

So if there are too many parts waiting to be scheduled, then it's also smart enough to figure out how many nodes it should add whether it should be adding one node or five node or ten node to address or service all this outstanding port.

A

You can also ask what happens if there was no existing node and you are starting from scratch. There is no node pool that is running on the cluster. In this case also gk is very sophisticated and smart to automatically provision the node that will satisfy the needs of the workload. So we call this auto provisioning.

A

So let's say you're running your cluster and suddenly the load drops then gk is smart enough to scale it down and the way it does is by monitoring the utilization of all the nodes in the cluster. When the utilization drops below a threshold, then what it will do is it will try to figure out if all the workloads that are running on underutilized node, whether they can be consolidated in fewer number of nodes safely.

A

So, if the answer is true, then it will basically move the parts from underutilized nodes into fewer number of nodes and free up the extra resources. This will save you money by reducing the number of nodes needed to handle the workload.

A

Now we talked about auto scaling and we talked about in the context of you already have an existing node or node poll, and you scale it up when you have appending parts. What happens if there is no existing node poll? You want to bootstrap from scratch.

A

In that case, if you enable auto provisioning on gke, then it will automatically figure out what is the best node configuration and node pool configuration that it can bootstrap the workload, so it will automatically add those nodes from zero nodes, so that's called auto provisioning. This basically saves you time and effort of configuring, the node poles.

A

So, on the right hand, side here is an example. Let's say you had enabled auto provisioning and you're just starting your cluster. There are no node pools so based on this deployment spec, it actually knows that you have enabled time sharing and it can actually figure out. It needs to add a time share. Node to the cluster and not node, auto provisioning will automatically do that for you, so it saves you effort.

A

So we talked about time sharing and in the beginning I said we are going to talk about two distinct mechanisms for gpu sharing. So now I'm going to switch the gears and talk about special multiplexing.

A

So this is a relatively new technology that was launched by nvidia. It is known as multi-instance gpus. It basically allows multi-instance gpu-enabled gpus to partition into gpu instances, and the key difference here is that partitions are physically isolated with dedicated compute and memory. So this physical isolation, that's why it's called spatial multiplexing.

A

In the previous case, there was a temporal multiplexing you're just time, slicing single gpu across multiple containers. So in this case it supports simultaneous workload, execution with guarantee of service.

A

So now you have physical partition, so you can actually run those containers in parallel where all of them are executing at the same time, and that gives you a better quality of service.

A

This is only supported on a100 gpus as of now, and we have done a lot of testing on this and we have found that throughput increases linearly when you add more instances which makes logical sense.

A

So, in the a100 gpu case there are seven compute units and eight memory units. Each unit of memory is about 5 gb.

A

This compute and memory units you can combine in different configuration to slice the a100 gpu in a different instances, and this table actually shows what combination a legit.

A

So each combination is basically tagged as a compute g dot memory gb what it implies. Let's take an example here when we say 1g.5 gb, it implies one compute in it and 5 gb of memory, and when you specify that you can create seven instances with this particular configuration.

A

If you picked a different configuration like 3g dot, 20 gb, you will get two instances.

A

So when you get seven instances, you can run seven containers on this particular gpu. If you have two instances, you run two containers in parallel on this particular gpu.

A

So similar to time sharing. You can configure this gpu with. However many partitions that are listed on the previous table, and in order to do that, you have to specify this particular parameter called gpu partition size. In this example, I picked 1g.5 gb and, as we saw in the previous table, this creates seven instances, so this can run up to seven container in parallel.

A

So when you inspect those nodes, you will see nvidia.com gpu resource with a resource count of seven. So this is a particular example. You can slice it differently, depending on your needs.

A

Now, how do we deploy workloads on those nodes very similar to time sharing? You will have a deployment. Spec first thing you will notice is that there is a resource count. In this case, you request one previously, we talked about the nodes are already labeled, with the kind of sharing solution that the nodes are configured with.

A

So with the combination of node selectors,.

A

The scheduler can figure out which workloads can land on a which slice of which gpu.

A

So you in this case you can see. The replica count is seven because there are seven instances, so you can run up to seven containers on a single a100 gpu, which is partitioned into seven instances.

A

So now, let's compare and contrast the two mechanisms, so you actually understand when to use mig and when to use time sharing. So, as I mentioned before, in the case of mig, the partitions are physical.

A

In the case of time, sharing partitions are logical. So when you have a physical partition, it will have a max partition limit, which is by design a hardware limitation, so a100 you can only partition in to max seven instances.

A

In the logical case, you can partition a gpu as many ways you want like. You can load too many containers on a single gpu, but I will caution you that if you will, if you added too many containers on the same gpu to be shared, then you have to watch out for the overhead of context. Switching so be careful how what how many containers you want to share a single gpu with so in the case of mig by virtue of physical partitioning, it provides a lot of benefits.

A

For example, it provides a physical isolation and in many applications isolation is an important requirement. It also provides memory protection again that avoids out-of-memory kind of situation. So very beneficial and provides the quality of service guarantees so clearly, when you're looking for quality of service or better isolation, mig is a better choice. None of this is possible in a time setting sharing scenario because you're just sharing a single physical gpu across many containers, because migs are physical, partitions.

A

The reconfiguration of a make gpu, if you wanted to change partition, requires a little bit of effort in the case of time, sharing reconfiguration is very easy. So if you specify like you, wanted to share a single physical gpu with 10 containers in the case of time sharing tomorrow, you decide. No, I only want to share it with five containers. You change one parameter and then you are done so reconfiguration is quite easy in the case of time sharing.

A

So when do you choose choose what, as I mentioned, if quality of service or isolation or prevention from out of memory are your main criteria, then certainly mig will be quite beneficial because it provides those guarantees.

A

On the other hand, what we have found in the practice is that time sharing is very good for a bursty workload, so the benefit here is that, let's say you specify a gpu to be time share across 10 containers, but you only have one container to begin with. It will get the full power of the gpu. If you are two containers, you will still together will get the full power of gpu.

A

On the other hand, if you are working with mig, you specify seven slices, but you only use one container to run on that those many seven slices. Then you are only getting one seventh of the performance because other six slices are going to stay idle. So that's the trade-off. So when you have a bursary traffic, you can use time sharing and you're going to get much better utilization of the gpu and much more flexibility.

A

The other benefit of time sharing is that it's actually works on all the gpu families we have, including a100, including mig partitions versus mic, only works on a100. So you don't have that flexibility in every single families of gpu.

A

There are things to consider beyond the time sharing versus make, so we recommend that you do gpu sharing only within a single trust boundary.

A

So what that means is that if you have a scenario where a single user needs to run multiple applications, it's totally legit and okay to do gpu sharing, because you are working with a single trust boundary.

A

Similarly, if you have a single company or single tenant, but multiple users, an example would be like multiple data scientists running notebooks and those notes books want to share a gpu that should be okay again, we are within the same trust boundary. However, we don't recommend these solutions in a multi-tenant scenario. So if you have multiple customers, where you have to cross the trust boundary, we do not recommend that because we don't think the isolation properties of any of the solutions are meeting the security bar to allow sharing across the customers.

A

So please keep in mind within a single customer. Totally. Okay to share across the customers is not advisable at the current state of technology.

A

So, in summary, the key takeaways from this discussion is that we offer two solutions for gpu sharing on kubernetes. The time sharing solution works in every single gpu family and offers better solution for bursary. Workloads.

A

Mig only works on a 100 gpu, but it does provide better isolation, quality of service and out of memory protection.

A

It's your choice. Depending on your workload needs you can choose either of them, but keep in mind. This is only good for a single trust. Boundary don't use it across the customers. That's not what we recommend. Thank you for listening to my talk and open for any questions.

B

All right, thank you, everyone, um I'm your moderator for this session, so I'll be running around the mic. I see there's one over there and I'm gonna get to you in just a second first off, I'm actually gonna ask a question from online: uh can you enforce to use the same physical gpu by different containers? For example, you want to run x server in one container and desktop in another. These two containers need to share a single gpu. It won't work. Otherwise you should ask for gpu in both containers.

B

If a node has more than one gpu, there are no guarantees that they get the same gpu that make sense.

A

Yes, I missed the last part, but yes, you can share the gpus across two different applications and the one thing to watch out for is that out of memory situation, so you can actually, if you're, not careful, then in the time sharing case, uh you can encounter out of memory situations, but if you're doing with the mig, that should work fine and we do have customers using and sharing single gque across totally two different, totally up different applications.

B

Thank you, okay, I'll bring this over to the person over here who had a question.

C

Thank you thanks thanks a lot for this talk um in a time sharing case. Do you observe any uh calculation of performance drop due to cash refresh while context switching or not so.

A

We have done extensive testing and the results are very workload specific, so it cannot be translated across the workload. However, if you limit the number of containers you share a single gpu with then the performance hit is very negligible, but if you try to go extreme like I want to do, 50 containers on a single, gpu and gpu is t4 tesla.

A

Then certainly, you will have too much overhead from contact switching, but the specific example like nvidia does a really clever job in managing the memory and other things we haven't seen a huge performance hit because of contact switching in this scenario. As long as you limit the number of containers, the overhead is very small.

B

All right, uh I saw someone back here. First.

D

B

Know who it was, though, okay I'll come back.

D

Hi, thank you for your talk uh very insightful. I have a question about the memory. I know that for mig you actually can use only one slice of the memory. So if you use two instances you can use only half the memory that the gpu actually provides with the time slicing.

D

Is that the same or is it not the same? It didn't really become clear from my talk, so the question is: can two individual containers both use the full memory one after the other, or can they use only half and is this caps.

A

Yeah, so let me first clarify the question. So in the case of meg, you will specify a memory slice for a particular instance. So that's the memory that is available to a given slice. So that is very clear in the case of time. Sharing it's not clear. So what you have to watch out for is that sum. Total of memory used by all the containers does not exceed the total memory that is available on gpu.

A

That's why we mentioned that out of memory situation is real in the case of time sharing and you have to make sure that applications are well behaved and they don't claim more memory than physically available on a single gpu.

D

Right so they together have to still fit in the total amount of member, because it's not being offloaded between one and the other correct nice, yeah.

B

All right, we've got a couple other questions over here, but first I'm gonna do one online again. uh If you run something like tensorflow, it likes to allocate and keep a whole gpu keep whole gpu memory, uh letting no space for sharing any commentary on those types of situations.

A

Yeah, but it also allows you to limit the memory that you can request for your container. So as long as you use tensorflow carefully and when you're doing sharing, you make sure that total requests do not exceed the physical capacity of the gpu, you can do it, you can. We have customers using tensorflow and sharing the gpus.

B

Okay, cool and.

E

uh Hi thanks for the talk, uh just the clarification in case of time sharing, uh I can allocate uneven time for the different containers right by providing the number of gpus greater than one like in your example. There was one, but I can provide two and it will get two times of.

A

Time, that's a great question uh thanks for asking. There are a lot of new answers to that. So, if you ask more than one then basically it allows you to bin pack, basically heterogeneous workloads on the same gpu, but in the end the compute time is evenly divided, so each container will get the same amount of compute time, but in terms of memory and other requirements you can fit in heterogeneous workloads on the same gpu. So that's the trick. You have to play with that counter.

E

Oh, I see so it's static, one that.

A

Means so time wise, everybody gets same time slice, but in terms of memory, now how you want to fit the different workloads, heterogeneous workloads on.

B

What you're, using under the hood.

A

So the solution we launch does not use mps, that's on our roadmap, cool.

B

And there's another question over here: yes,.

F

Hi thanks for uh for the talk, uh I have two questions. uh First, one is it in ga this solution, or still in preview.

A

So this is available. It's.

F

Generally available, yes, so technical question can I request more than one slice for a single container. For example, I decided to use one cpu five gigabyte, for example my, but I need two slice for my container. Okay. Yes,.

A

Can I do it? Yes, you can do that.

F

So you put two in the request: yes,.

A

Okay, so in the case of time slicing it's pretty straightforward. You ask for like two and you get two slices and you can run container on two slices in time. Sharing case is a little bit tricky.

A

If you ask more than one, you still get a proportionate time slice so intuitively it's a little bit hard to make sense out of it like I asked five, but I if there are two containers, h1 is going to get equal amount of time, but you can use that cleverly to fit in heterogeneous workloads if they have different memory requirements.

F

B

All right we're a couple minutes over time, so I'm going to go ahead and cut off questions there, but I'm sure mullen will probably hang out for a few minutes. If anyone would like to come up and ask him questions.

A

Yes, please come here and I'm happy to answer more questions.

A