Cloud Native Computing Foundation Kubernetes AI Day EU 2022, 19 May 2022

Previous Meeting

⏯

youtube image

►

From YouTube: Sponsored Keynote: Challenges and Opportunities in Making AI Easy and Efficient with... Maulin Patel

Description

Sponsored Keynote: Challenges and Opportunities in Making AI Easy and Efficient with Kuberenetes - Maulin Patel, Google

A

Good morning, everyone, my name is marlin patel, and I am the group part manager for google kubernetes engine world is moving towards cloud native computing and kubernetes has become the de facto standard for cloud native computing.

A

In my opinion, kubernetes is an ideal platform for aiml and high performance computing, and in this talk I'm going to talk about challenges and opportunities in making ai easy and efficient with kubernetes.

A

So there are three core reasons why I think kubernetes is an ideal platform for aiml and hpc number. One is portability: kubernetes provides cloud-native, open and standard-based apis.

A

It allows users to seamlessly port workloads from their laptop to private data center to the cloud, and this is very, very important for aiml community because it allows them to reliably reproduce their results.

A

The second one is scalability. Kubernetes allows workloads to be scaled from a single node to thousands of nodes, and it also supports auto scaling, auto provisioning, gpus, tpus and many other hardware accelerators, which helps to train model faster, quicker and cheaper.

A

The third reason is productivity. Kubernetes makes data, scientist and ai practitioner more productive by freeing users up from having to manage their own workstations or servers. It lets them focus on their business critical mission, which is to build a model and train the model without having to worry about underlying infrastructure and compatibility issues.

A

However, there are many challenges that this community faces, so gpu utilization is one of the core concerns for aiml practitioners and poor utilization costs them dearly.

A

So in our google kubernetes engine period we have seen that overall, gpu utilization across the fleet is quite low and gpu utilization is actually getting worse day by day as gpus are getting more and more powerful.

A

A single workload may not be able to saturate a really powerful gpu. An under utilization problem is even more acute for certain types of workloads, such as inference gaming, notebook and visualization.

A

So why is that the case? So the kubernetes allows fractional requests for cpus, but it does not allow fractional requests for gpus, so you can ask 0.5 cpu to kubernetes, and it knows how to give you 0.5 cpus, but you can't ask 0.5 gpus.

A

One gpu has to be fully allocated to one container, even if the container only needs fraction of gpu for its workload, so this invariably leads to over provisioning and sometimes cost overrun.

A

So as a community, we have the opportunity to make gpus kubernetes native resource and that will hopefully address this challenge.

A

The another challenge that ai practitioner faces is the failure resilient training.

A

So kubernetes was designed with the fundamental assumption that pods are disposable and replaceable, so we treat pods as cattle, not pets, which means they can be disposed anytime.

A

This assumption does not suit well for many distributed computing frameworks. Majority of distributed computing frameworks, especially that are used for aiml, are very sensitive to disruptions. They are intolerant to disruptions such as preemptions failures or maintenance events, and the problem gets really really acute when you do really large scale training with thousands of nodes.

A

In those cases, the probability that a one particular node will encounter a disruption increases when you scale the training cluster size, as well as the duration of the training.

A

The way community deals with this today is through checkpoint and restore, so you frequently take checkpoints. However, these checkpoints are typically taken at the epoch boundary.

A

So if a disruption arrives, then all the work that all these thousands of node has done since the last epoch is lost, which is not a good story from the cost and time saving point of view.

A

There are frameworks like python elastic which handles any kind of disruption gracefully, but the challenge there is those solutions are framework specific.

A

What this community needs is framework agnostic, elastic training, so our goal should be to support any framework without any code changes, and this will give two main benefits. If we have the elastic training, then you can use it to run your training on spot vms, which are a lot cheaper than on-demand vm. So it saves a lot of cost and it's also another problem which is obtainability.

A

As most of you know, gpus are scarce. Resource spot vms are also scarce resource. It's very hard to find, let's say thousands of gpus up front to start your training. If you have elastic training support, then you can start a training with, however, number of spot gpus available to you and then scale it up when more gpus are available and scale it down when you lose them, so that will also address the obtainability challenge.

A

Another opportunity for this computer. This community is to basically enable native support for checkpoint migration and restoration in kubernetes, and the way it can be done is that whenever the underlying infrasurfaces, that there is an impending maintenance event- or there is a preemption coming, then kubernetes can transparently and gracefully, take a snapshot or checkpoint and store it, and this will make it work, conserving, meaning current checkpoint mechanisms. You lose the work since last epoch, but if you had like a transparent checkpointing on demand, then it will conserve all the training work that has happened.

A

So this is another opportunity for the community.

A

The ai practitioners also face a lot of challenges when it comes to observability and kubernetes, so kubernetes observability primitives were mainly designed to provide service level indicators like uptime, like cpu utilization memory, utilization, gpu utilization, and things like that.

A

Now they are ill suited to monitor model health, so ai practitioner, when they think of their model, the things that they care about are like model performance like how accurate my model is. Precision recall, f1 score.

A

They also care a lot about data drift and concept drift, so ai practitioners also care a lot about training and serving skus, and typically they detect it with kl divergence or studying the future importance between training and serving and last but not least, it's very, very important to know the fairness of the model in many industries like demographic parity or equal opportunity.

A

So in my opinion, it's very hard to get this kind of information from existing kubernetes primitives like prometheus metrics or logging metrics. To give a little bit of concrete example like in typical kubernetes observative observability, you rarely have to deal with events that are months or even years apart. On the other hand, in ai world, this is a very common occurrence. So let me give you some examples. Let's say you have a model to predict customer churn.

A

Now customer acquisition happens and at some point in future the customer is going to churn. So in order to study the accuracy of this model, you have to combine these two events, which are spaced months or even years, and apart there's another example. For example, you have a model that predicts loan default.

A

Now you issue, a loan default may happen sometime in future, to understand the accuracy of this model. You have to combine these events, so in order to understand the accuracy, precision recall and many other metrics, the observability solution needs to combine desperate events that may space far apart. So this is one of the examples where we as a community have an opportunity to extend existing observability solutions that are there for kubernetes to make them suitable for aiml community.

A

So, in summary, I see a lot of exciting thing happening in this space and I'm super excited and energized to be able to be working in this space and I hope the ai force be with you. Thank you all any questions comments I'm around.