Cloud Native Computing Foundation Kubernetes AI Day North America 2021, 30 Oct 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Serving Machine Learning Models at Scale Using KServe - Yuzhui Liu, Bloomberg

Description

Don’t miss out! Join us at our next event: KubeCon + CloudNativeCon Europe 2022 in Valencia, Spain from May 17-20. Learn more at https://kubecon.io The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Serving Machine Learning Models at Scale Using KServe - Yuzhui Liu, Bloomberg

KServe (previously known as KFServing) is a serverless open source solution to serve machine learning models. With machine learning becoming more widely adopted in organizations, the trend is to deploy larger numbers of models. Plus, there is an increasing need to serve models using GPUs. As GPUs are expensive, engineers are seeking ways to serve multiple models with one GPU. The KServe community designed a Multi-Model Serving solution to scale the number of models that can be served in a Kubernetes cluster. By sharing the serving container that is enabled to host multiple models, Multi-Model Serving addresses three limitations that the current ‘one model, one service’ paradigm encounters: 1) Compute resources (including the cost for public cloud), 2) Maximum number of pods, 3) Maximum number of IP addresses. 4) Maximum number of services This talk will present the design of Multi-Model Serving, describe how to use it to serve models for different frameworks, and share benchmark stats that demonstrate its scalability.

A

um Hi everyone good afternoon, uh thank you very much for being here and being interested in k-serve. um I also know many of my team team members are watching uh this session virtually and on a thanks for my team members and my the contributors to k-serve.

A

My name is eudry. I am the team lead of data science, runtime team of bloomberg. Our team provides a data science platform with functionalities to support what bloomberg internal users need for machine learning model development life cycle.

A

Those functionalities include how to do data exploring using jupyter notebook how to train models using popular frameworks such as tensorflow pytorch, escalar extra. We also manage experiments and do online inference. Today, I'm going to talk about the open source project we initiated together with multiple collaborators named kaiser.

A

We also run into recent problems. We want to deploy many models at scale, and I want to talk about how we address this scalability problem and first, I want to give you a little bit background about k. Serve k serve is previously known as cave serving if you are already familiar with scale serving if you are already deploying cave serving into your platform. Caser is the same as cave serving.

A

It recently moved to an independent git organization, and it now enjoys more autonomous.

A

K-Surf has experienced multiple important milestones in september 2019. We released the first version of k-serve as a sub project under coop flow. It was released under the name kf serving a few months later. We introduced it at coupon us and we spent the next year and a half to develop this project and we released its stable version, v1 beta 1 version into 2021..

A

Finally, last month we renamed the project to k-serve. It is a sign that it has reached the next level of maturity.

A

Here are linked to two important articles and blog posts. You are very welcome to follow the links to learn over announcement together with coop flow community and also the blog. We posted tech at bloomberg that talk about our journey of building a production grade. Machine learning model sewing solution.

A

We won't be able to achieve all this without all the awesome contributors here I want to spend a few seconds to acknowledge everyone who have contributed and collaborated with us.

A

For those of you who are less familiar with inference, this diagram demonstrates a typical and also simplified model development life cycle. Normally, overuser are data scientists and machine learning engineers. They will first prepare data and use a machine learning framework to train a model, and once the model is trained, our user wants to deploy this model into a production system, and this model should be able to answer real-time questions such as, given two sentences can you can my model? Tell me what is the similarities between these two sentences or given a news article?

A

The model should be able to run inference and let us know what are the topic topics related to this news. Article.

A

So how hard can it be?

A

It turns out building a solution like this is very, very difficult.

A

First, we want to think about the cost of deploying such a machine learning model, how much cpu, how much memory, how much gpu resource is needed and when there's no request coming into the service, is there a way to automatically scale it up or scale it down? We also want to monitor our machine learning services.

A

We want to think about how to do readiness check how to do liveness check. uh We also want to produce permitted metrics where which we can use to build dashboard and setting alert for we also care about a secure, secure rollout.

A

Once we build a new version of the service and that break the production system, how can we automatically detect that and stop the rollout? Can we cannery a new version of a model and compare the result and be able to swap traffic between the two different versions of the model?

A

We also want to define a protocol inference protocol. There are many different types of model server in the open source world, and how can we allow users to have a consistent layer of protocol to send requests using grpc, http and kafka.

A

There are not all the users want to run inference using real-time requests. Some of them only want to run end-of-day batch interest. So how can we support this type of users?

A

There are also many different type of machine learning training framework. So when we do the serving step, we want to be able to serve various models trained by different frameworks.

A

After model is running inference, the next step will be. How can I explain my prediction so we also integrate the explainer component into k-serve.

A

So here comes caser, so k-serve is a highly scalable and standard-based model inference platform on kubernetes for trusted ai at the lowest level, is our computer resource. In normally in a kubernetes cluster, we have a collection of cpu, gpu and memory, sometimes even tpu. Those are the resources we have for computing.

A

On top of all the computing resources, we run a kubernetes layer. Kubernetes is used as a way to orchestrate and manage all the compute resource on top of the kubernetes layer. This is over serverless layer. We use k-native and ecl to build the serverless layer. With this layer we are able to automatically scale up and scale down according to the incoming traffic.

A

We are also able to scale down the number of parts to zero when there is no requests coming in. So we can release the compute resource when it's not being used at the top level.

A

Is some machine learning integration layer we integrate with multiple popular model server in the industry, so k-serve is able to serve machine learning models trained by various machine learning frameworks.

A

This is the diagram that explains the major component in case of solution. If you're already familiar with kubernetes resource, you must understand that most resource is represented by yaml file. So k-serve is the same. We define a resource called inference.

A

The case of user can describe what kind of machine learning model they want to deploy into into a system, and this request will be handled by kubernetes api server stored into fcd and is eventually being reconciled by inference service controller, which is the main controller of k-serve.

A

After inference, service controller reconcile the incoming resource, it will create the underlying major component. One of the most important component we create is the predictor, which is essentially the model server rerun with the predictor component.

A

The model server can handle incoming requests and around inference result. The second important control component is transformer. Sometimes users want to implement customized, um pre-process and post-process steps to translate data point into a format that the predictor can understand and then post process it back to a format. That's an application called understand. So that's how the customized implementation can be integrated into k-serve.

A

We also have an explainer component, which is which use alibis explainer. That can explain why inference result is produced.

A

A very critical part of k surf is we define a standard inference protocol this standard we work very closely with multiple model server community, including triton, torch surf and ml server. We make sure we, the case of community, can come up with a set of consistent inference protocol to provide a unified user experience.

A

This is a set of http protocol. We have defined, as you can see, we have the standard protocol for model server to check, to do liveness, check readiness check and to check if a model is ready to take incoming requests.

A

We can also use this protocol to check a service metadata, a modus metadata and, of course, most importantly, runner, inference.

A

Similarly, we have a set of grpc protocol. We can use to check the health state server, metadata model metadata and inference with the standard protocol.

A

We can easily integrate with multiple model server and the client set can set requests consistently.

A

So now we have already run case of in our production environment for a while, and now we start to run into new scalability problem, how do we deploy a large number of models in production?

A

So, let's take a look, how the current approach works. So how does how currently k serve? Deploy a model, a machine learning model? uh The gray box here represents inference platform cluster and in this cluster, each of our users on their own name, space, which is represented by the blue box, the light blue box and in their own name space. They can run multiple inference services and this inference service will fetch a model from external model storage.

A

The model storage can have can be bcs or gcs or even http service. Once the model is downloaded into the inference service, it will open up a http endpoint, where user can send request data to and get inference result back.

A

So what kind of problem this approach bring us when we want to scale up number of models, because when, if we want to scale up number of models, we essentially need to scale up the number of services we run in our platform, which doesn't scale very well, and I will talk about the scalability limitations we foresee when we want to. If each team want to run hundreds even thousand models in our inference platform.

A

Those are the limitations we are already aware of and, of course, there may be other limitations. We may run into.

A

First, I want to talk about compute resource limitations in each inference service. It comes with a certain amount of resource overhead. We have a side car that run alongside each model. Server that handles the incoming request, produce permissive metrics.

A

It can also do certain batching and logging, so those sidecar has a needs, certain amount of cpu and memory to run alongside the model server. So let's that's the config, the resource the sidecar requires is configurable, but let's, for example, um let's think each side car takes 0.5, cpu and 0.5 gigabyte memory overhead. Based on this configuration.

A

If we deploy 10 models, let's say each model has two replicas.

A

Then each model's resource overhead is around one cpu and one gpu per model, but if we can figure out a way to load 10 models into one inference service, then on average each model's resource overhead is about 0.1, cpu and 0.1 gpu, that's a lot of resource reduction.

A

The second limitation we foresee that we are going to run into is the maximum pod limitations. Some of you may be already familiar with the kubernetes default setting on each node. By default, we can run 110 parts and based on the official documentation from kubernetes scalability best practice. We shouldn't really run more than 100 parts per node.

A

Based on this limit on a 50 node cluster, we can deploy around 1000 to 4000 models based on the number of reps, because we want to we we want to configure per model.

A

The third limitation we foresee ourselves will run into is the maximum ip address limitations. I think a lot of you also understand each part has an independent ip address in kubernetes clusters.

A

The ip address are assigned to new models, replicas of models- uh if you uh have run transformers in case of um there need to be ip address, assigned to transformers and explainers, uh let alone there are also basic kubernetes control, plane parts running in the cluster.

A

The number of ip address available in each kubernetes cluster varies a lot depending on how the admin manages this cluster. But I want to point out that this is a in one of the testing cluster. We run a test, we have several thousand um ip address available and, based on that uh limitation, we can run like several thousand models.

A

So, in order to solve the scalability problem, we work very closely with our collaborator from ibm and we come up with a solution called model mesh.

A

This is the diagram of model mesh solution, so let me walk you through it from the top to the bottom at the top. It is the machine learning application which sends inference, requests into model mesh and one model mesh can contain multiple serving runtime different serving runtime is essentially a different type of model server. In this diagram, we there are two serving runtime available. Different serving runtime can produce service solutions for different type of machine learning model.

A

One critical component in this diagram is the mesh and polar sidecar that run alongside model server, so the grid, the light green box and light pink box here represent different model server, and you can see that there are mesh and polar sidecar running alongside it.

A

The sidecar will decide when and where to load and unload models based on the usage and the current request volumes. If a particular model is under heavy load, it will be scaled across more parts.

A

You can see those little circles inside the serving runtime. They represent different models. If we take a look at model b in the in the blue circle, it is scaled to have two replicas. So comparing to model a which is in the green circle, it can handle more inference, requests.

A

The mesh sidecar also acts as a router. The model mesh store model to part it routing table in the lcd in external fcd. So when there's inference, requests coming in the sidecar will look up the routing table and it will figure out which model is loaded into which pile id and routes the request to the correct pot. According to the routing table.

A

And so you now may be curious to learn what kind of service runtime model mesh can provide. So out of box integration we have, we will provide triton inference server, which is developed by nvidia's, which is developed by nvidia. This model server can serve machine learning, framework models trained by emotional frameworks such as tensorflow, pytorch, tensor, rt or onyx. We also by default, integrate with a sounds ml server. This zelda's ml server is a python based server. It can serve frameworks such as sklen xgboost or like gbm.

A

So a lot of you may be very curious to know what kind of performance of model match can provide if we co-locate multiple models into the same part. Will that have impact on the latency of throughput. This solution can provide so we did a performance test. This performance test was done on a single node, 8 cpu, 64 gigabyte, ram cluster, and we deployed a very, very simple string model. It's all. It only has around 700 bytes.

A

So if we use the traditional one model, one per container deployment approach, we are limited by cpu and we can approximately deploy around 40 models in this testing cluster.

A

Sometimes if we deploy into a larger node will be eventually limited by ip addresses.

A

But now, if we move on to use model mesh deployment, we are able to compact around 20k models into this testing cluster and essentially run into memory limit.

A

In addition to the density test, we also did a latency test.

A

This latency test is done by running two triton serving run times, and we gradually increase the qps from 25 per second all the way to 2800 per second. Each performance test will run for 60 seconds for each qps. We also gradually increase the density of the model mesh from 1000 2000, all the way to 20 20 000..

A

As we increase the density of the models we can. We notice that there's a slight increase in latencies, but for single digit millisecond latency inference one worker node can support about 20k models for up to 1000 qps.

A

I also like to point out that this performance test is down on cpu nodes. Normally, when we run inference on gpu, the performance can be increased.

A

Dramatically uh now, model mesh is already released um as part of case server case of 0.7 deliverable. So you are all welcome to check it out and try a model mesh.

A

There's still a lot of work, we want to continue to work on for the modern match, so here's the roadmap we have in mind in q4 2021.

A

We want to have better influence and serving runtime integration and currently in order to use model mesh. Each username space needs to have its own model mesh controller. So we would like to enhance the smaller mesh controller to support multiple namespace.

A

Currently, the model mesh only supports downloading model from s3 storage. We want to spend time to expand the storage we support, including gcs and http service. Next year q1. We want to spend some time to work on inference graph.

A

We also start to extend model mesh to support transformer, so users can plug in their customized pre-processed and post process implementation. We also want to make model mesh start to support canary roller and eventually consolidate that model. Mesh controller with the case of main controller.

A

So you're all very welcome to contact us by visiting our website and check out our github or slidecast or just talk to me after this talk. Okay, now I'm open to answer questions.

B

I have two questions. Actually you mentioned several types of constraints like cpu and memory constraints for cpu only entrance.

A

B

Does this picture changes when you start using uh gpus difference? I would imagine there will be another set of constraint on top of this. They were the one we mentioned already.

A

Oh okay, you can repeat the second part: how.

B

Does the picture changes when you start using gpu assisted inference in terms of the constraint in the system.

A

I'm sorry can I walk closer.

B

So how does the picture changes when you start using gpu assisted inference, what kind of constraints or what kind of problem or bottlenecks you start seeing.

A

Okay, so thank you very much for the question, so the question is that I mentioned that there's a compute resource overhead come with each inference service. The question is that if we start to use gpu, what's the change, um what kind of change it is about the overhead, so there's nothing changed so in this, uh for example, if we take a look at the diagram, the model one will be deployed into a model server, so model server is the one that requires cpu or gpu for inference.

A

So that's one independent container around inside the inference service and alongside that model server, there's another container. That's the extra container running as a side. Car requires extra cpu and memory. So changing the model. Server from using cpu to gpu doesn't really change the side. Cars requirement for computer resource.

B

uh Yeah, but I mean, if you start using inference, then the gpu is going to be a bottleneck in this inference right depending on the load, so you effectively introduce another level of bottleneck of possible bottlenecks in the system.

B

uh I was curious if you can see if you, if you saw those bottlenecks in inference and how do you address those bottlenecks.

A

uh Do you mean how, when we start to use uh gpus yeah.

B

When you start using gpu yeah.

A

So do you know if we start to use gpu as a way to use gpu resource to address those overhead.

B

Because it's a very specific, like resources, yeah.

A

That's a very good question, so if some of you, this may lead to some discussion about slicing gpu so currently, if we want to request gpu resource in kubernetes, there's no very straight straightforward or easy way to request a slice of a gpu. So um I know in the managed compute in the mesh kubernetes concept, it's very easy to think about. Well, I have a gpu allocate to this inference service. Can I request a slice of the gpu and just to use that for the sidecar?

A

The answer is that there's no very easy way to do that. So when we request gpu, we request a full gpu actually in real use cases. Very often we notice that, even though a model server requests a full gpu, but when the inference happens, it won't really use the full capacity of the gpu. Sometimes it doesn't even use the full capacity of the memory. So if we take a look at typical gpu, most of them come with 32 gigabyte memory, so some models actually really small.

A

They are like megabyte level, uh some larger ones like a bird model, maybe um maybe gigabyte level, but that still wouldn't come. That only consumes a small fraction of the full gpu memory uh in terms of the computing power, sometimes is uh even less.

A

The amount of computing power you consume heavily depend on the volume of requests that you send into the model server. So when you have a lot of concurrent requests sent to the to the model server, that only has a gpu allocated to it. You can notice that the gpu consumption world goes up, but during the time that there's a very small amount of volume coming to the gpu powered model server uh like you, it's very obvious that the consumption of gpu computing power decrease a lot.

A

I think that also really depends on what kind of model server requesting that gpu and how optimized that gpu has. So that's why we work very closely with triton, because triton is developed by nvidia and they have a lot of optimization about gpu in the trial model. Server.

B

All right, thank you.

C

Okay, one more question and then we'll have to take our break, but as I walk over I'll, just remind everybody that we after this, we have a break until 2 55 pacific time, and then the sessions will start back up again.

D

So in this diagram, you're talking about single model serving versus multi-modal, serving so in multi-modal surveying we're saying that multiple models running in a single kubernetes board is that correct understanding right.

A

Yes, that's that's the correct understanding. So what we're trying to do is to collocate multiple models that can be served by the same model serving runtime and collocate them together.

D

But isn't it a little bit against the kubernetes model? Reason? I'm saying that one thing is uh all of these model competing for the resources, because you don't never don't have isolation of resources, so there is no quality of service. One model is taking more computer resources. It will impact rest of the models. Second thing: if this part goes down or this node goes down, all the model goes down together.

A

Yeah, that's a very good question. Each model can take some different, a different amount of computing service. Actually, that's a we internally over working group. We had a long discussion, so the original design of the solution is that each model of let me go back to the to this diagram. One of the original idea we have is that each model will have the same amount of replicas that just spread among all the replicas belongs to a model server, and then we start to realize.

A

Let's say we have model a and b and c in the same part and the model a takes most of the request and a b and c only take like one request per hour, but because of model a need to handle really high volume, so model a will, drive the same, serving runtime cost up into like many replicas like 10 20, but motor b and c will be forced to scale up together. So that's that's the idea like we can.

A

We spend a lot of time discussing and that's why we moved to the model mesh idea, so we can scale up different model differently. So, let's think of the diagram again, the model b, it has more replicas, so essentially, collectively model b owns more computing resource in this serving runtime and model. A only has one replica so, like collectively model a only owns like less computing resource.

A

That's why we designed the solution in a way that each model can have different number of replicas across all the parts that belongs to one server runtime.

C

Okay, thank you very much and we will meet back here again at 2 55.. Thank you.