Cloud Native Computing Foundation KnativeCon EU 2022, 19 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: How We Built an ML inference Platform with Knative - Dan Sun, Bloomberg LP & Animesh Singh, IBM

Description

How We Built an ML inference Platform with Knative - Dan Sun, Bloomberg LP & Animesh Singh, IBM

Deploying and scaling machine learning(ML)- driven applications in production is rarely a simple task. However, serverless inference has been simplified and accelerated through the use of Knative. Knative runs serverless containers on Kubernetes with ease and handles all the details related to networking, requests volume-based autoscaling (including scale-to-zero), and revision tracking. It also enables event-driven applications by integration seamlessly with various event sources. In this session, the speakers will discuss why their organizations initially chose Knative when building their ML inference platforms, and how these efforts evolved into KServe (github.com/kserve) project. We will also discuss how we leverage Knative to implement blue/green/canary rollout strategies for safe production updates to our ML models, improve GPU utilization with scale-to-zero functionality, and build Apache kafka events-based inference pipeline. At the end of the talk, we will share some of the testing benchmarks (compared with Kubernetes HPA), as well as performance optimization tips that have enabled us to run hundreds to thousands of Knative services in a single cluster.

A

uh Hello: everyone thanks for joining my session, um so my name is dan. I I lead the bloomberg data science serverless platform and I'm also the working group lead for k-serve and I've been working with canadian computer creative community since 2019, and worked with a lot of the developers here to build the uh k-serve, which is another open source project. We are talking about talking about today, yeah. So today.

A

uh My topic is to uh discuss how we built the mr platform on top of canadiv, and uh um I want to discuss uh some of what works well in canada and what other features are we also looking for to push canada to the next level um yeah?

A

Unfortunately, my co-speaker cannot make today and also credits to for my kids to do all the slides animations.

A

um So, first, what is a mer influence platform? Mri influence platform provides a standard manager inference service that provides, helps to unify the model deployment across our multiple ml frameworks, and it also simplify and the model serving and monitoring at scale on production cloud environment. So a ml inference. Inference service is a service that can generate predictions from a trainer model in response to inference requests that can be a single request or a bachelor of request.

A

So, besides the common problems like infrastructure challenges to deploy cloud microservices, so we also have a unique challenges for deploying ml influence, inference services. So, first of all, we want to enable the auto scale influence workload both on cpu and gpu workload. So, as you know, the default accommodate is hpa does not support auto scanning based on gpu.

A

In order to support that, you need to implement hpl with custom metrics on the set of gpu metrics like duty cycle power, consumption and gpu memory, which is sometimes can be really hard to reason about how it makes the auto scaling decisions. So we are really looking forward looking for a solution which can work, the auto scale work the same way both on cpu and gpu device, and and secondly, another important feature we want to do is to when we're doing a model layout.

A

So we want to employ a safe model, rollout strategy with deeper validations, in addition to readiness probes. So this is crucial for enable like a continuous deployment for the model updates without humans, loop and besides, in addition to the request, response style service and we are there.

A

I also use case where we want to perform the inference um based on some event, source like a3 kafka, and we also need to uh forward request to a number of like downstream uh analytics components which can which want to monitor the models to ensure that it performs the uh reliable predictions and last, but not the least. We also have use case where we want to have an infrared platform which can change multiple inference services together to get back a response and or it may need to combine the outputs from multiple services for the model.

A

Ensemble use case.

A

um So uh why do we decide to build uh the influence platform on top of keynative right? So can it give us a very nice serverless service, obstructions for service, networking and routing and adding both requests based driven, auto scaling both on cpu and gpu device worked pretty nicely. It also supports both scale down two and front zero.

A

Key native also implements immutable revision tracking, which allows for explaining traffic among multiple revisions for group, blue, green and the canary route, and other nice features it provides is like you can get outbox distributor, tracing and metrics for free and and low balancing like it can low balance based on like a concurrency on each part to make smart low balance decisions.

A

So, in order to avoid the ring event wheels all these uh already solved problems, so we decided to build the ml platform as inference platform on top of canadian. So we can focus on to solve, like our unique influence, challenges.

A

um So um so here is how the case server was. Cases was an open source project which was uh funded by companies like google, ibm and bloomberg back in 2019.

A

Under the cool flow umbrella, it was used to be a sub project, and now we grow, we grow tremendously afterwards and now it's an independent project under the governance by the airfare ai, um so k server in the service mode. It actually creates the uh canadian service to provision um to enable the servlets functionality like auto scaling canary route and then eventing capabilities.

A

So um in fact, k native is actually installed by default in the queue flow, so it actually as a result it it powers the tens of like a production model deployment currently um because of the huge base of a huge user base of queue flow.

A

um So influence service is a combination. Customer source we created uh under k serve, which is a mf-friendly uh user interface, to allow people to describe mirror deployments.

A

So, um in a lot of time, like people just need to specify the model format and and the model storage ui so um and then they can deploy the infrared service with a simple yamo and under the hood it gets. The infrared service gets translated into a key native service which runs the outbox model server, which is implemented in k-serve.

A

It downloads the model, in indeed container then once model is downloaded and then spins off the service. In response to the the real-time influence requests, you can also choose to use, build pack or case of sdk to build your custom model server, which works pretty much the same same way.

A

So the case of control plan provisions, uh a few core influence components like a predictor transformer and explainer predictor runs as a kinetic service, which is, in the main container, runs the model server and uh sit along with the q proxy, which exposes the auto scaling metrics and can choose concurrency, and we also have a model agent which does influence related features like log in the requests and perform batching uh and then sends requests to the model.

A

Server and transformer is a is a component which um which transforms the raw input, request and converts to the format model server expects, according to the standardized influence protocol and explainer, sends requests to the predict predictor to uh try to make a uh to out to generate the human interpretable uh predictions explanations.

A

So, let's first look at the most important feature: kennedy provides, which is the request driven, auto scaling. So the security, auto scaler, excuse and auto scale um based on the request amount so by collecting the uh concurrency and request rate metrics from the q proxy, uh and it is a process both scale 2 and front 0.. um So it's uh it can be really useful when you, when you deploy influence service on gpu device which can save the gpu resources.

A

While the service is idle and and cold start is still a kind of problem for the ameri deployments on production environment because usually it needs to download the model and which takes sometimes takes a few minutes and then like and the part gets started like a few seconds so um in case of actually uh sets the default main replica to one so uh and- uh and you can also choose to set up a bigger number uh on the production environment. So it can scare automatically to handle the burst in the peak time.

A

So, let's take a look: how uh scale down to and from zero works. So, uh while the service is idle um k, native controller, actually rewrites, the http router to the canadian activator and once you receive the um receiving the request, volume and auto scaler makes based on the request, demand. It automatically scale up the uh to the desired number of parts based on the auto scanning metrics um and uh canadian activator uh stores the request until the pods are ready to serve the uh live traffic.

A

um So so canadian, on the other hand, also scale down to zero after the default 30 seconds. So sometimes, when you do testing or benchmark testing, you may want to avoid the cost of a spin upper bound down this part. So you can also choose to add the additional annotations to keep the part a little longer. So to avoid the code, starter penalty, cost.

A

So the kinetic, auto scaler makes the auto scaling decisions based on the uh your concurrency target and the observed observed concurrency matrix. So let's say: if your target concurrency is one which means like each part can only process one request at a time and uh if you are, if you are getting like a five requests, average average uh concurrency request in your part, then auto scale will automatically scale up to five parts to handle the your current traffic.

A

So, uh comparing to the kubernetes hpa so um candidate for auto scalers processor addition edition to the cpu memory matrix. It also supports the concurrency in an rps metrics, so it can auto scale based on your request load it. It also can support a scale down to and front zero, while the uh kubernetes hpl can only scale down to one um internal symmetric, scraping, uh activator and q process actually push the metrics to autoscaler so via websocket, so oftentimes we can react faster than the the kubernetes uh hpa, which queries has to credit.

A

The metrics from promises and um canadian by default calculates the average concurrency in a 60-second window. It also enables a panic window, which is a six second, which can react faster when you are receiving burst of traffic, why the hpa used a stable five minutes window, so sometimes it may not be able to handle the uh the large breast of the quest.

A

uh So next important feature we are implementing in case of is the uh the model layout. So um often time like when you wrote a new model, you need to to validate the model if the model is performed actually, with the you know, accuracy and uh before moving the traffic from the old model to the new model.

A

So we found that a canadian kubernetes deployment is often limited by uh it's inability to stage the traffic, um so um canadian so case of um actually implements uh opinionated like a two-way brokering and a canary route based on the native uh revision implementation.

A

So, every time like when you update info service, it generates a new revision and uh case of actually uh automatically tracks. Last uh known, good revision by automatically tagging the revisions that was rolled out, a hundred percent of traffic.

A

Yeah some of the limitations of default kubernetes rolling, upgrade uh um it has very little control over the speed of the road and it's enabled to control traffic flow to how it flows to the new revision and the readiness probe often are not suitable for validating models and doing like deeper and stress tests, and also it's not able to check the external metrics to verify. The model. Updates and rolling upgrade can hold the route if something goes wrong. But it's not able to roll back automatically.

A

So we wrote a new model. I actually want to stage the traffic on the stable version before I want the new version to be running, so I will where I I can like verify and evaluate the model. So in this case I can add a canary traffic percent field on the influence service yammer so to zero.

A

So in that case the traffic is uh still, it will spin up a new, a new version, but the child will not receive the live traffic.

A

So here's how to look so you initially have model one and which is creates the kinetic revision. Zero, zero one uh and you get a endpoint which is you can can process the live traffic.

A

So now you wrote your model to which is the revision002, because we set the canary traffic percent to zero, so uh it doesn't actually receive the live track but on the end hand like it generates a endpoint which was tagged with the latest, so you can use the latest general attack url to test your model.

A

So once you're happy with the models, then you can bump the canary traffic canary traffic percent to 100. Then they will move the tractor from the old model to the new model.

A

So after you valid models, there could could be still something goes wrong, so in case you want to roll back to the last model. So can it so k server actually automatically tracks the uh previously loadout revision in the infrared service status? So it knows which uh what was the revision uh it it needs to roll back to um so a user can simply set the setback, the canary traffic percent to zero. um So in this case the the model uh traffic will automatically uh go back to the previous version, which was tagged as previous.

A

A

Now the traffic will get rolled back to the previous version automatically.

A

um So this is for uh canon cri fans, so this is like uh equivalent commands. We need to uh execute like to simulate to this uh process. Kso, basically um automate the sum of steps by automatically tagging the technical revisions and and also like track the track. The last known, stable version uh automatically for the user, um but yeah this is equivalent commands. You can use to run the exactly the same route. Workflow.

A

So now I'm going to execute the demo to automate all the rollout steps I just showed so I implemented with using the argo workflow to execute these steps.

A

So uh so first step is to um create a new model um with version 001 um and then and and then next second step is to uh after the after. The first model is really in the ready status. uh It will update the model uh starter ui, to update to the uh and to draw the model version two.

A

So now you can see that the first model is ready, and now it's running the second step to update the model.

A

But uh I I want to stage a traffic is still on the old model, so the new model doesn't receive any live traffic, so you can see that the new model gets the zero percent traffic and the all the live target is still processed by the old model, and the third step is actually uh surround a model of addition job.

A

So right now it's a simple curve just to verify the request uh is uh I get the expected response, but you can also plug in your like own jobs, to do more, like advanced, like kind of uh around a batch request from a golden data set and verify all the models produce the accurate result or like around the stress testing, to make sure the latency meets your requirements.

A

So yeah so the once uh once the test job is successful and then we automatically bump the traffic percent move the traffic from the old model to the new model automatically.

A

So in this way we can like, uh uh in the mr use case, we can like uh uh implement a continuous deployment workflow where, like without humans, loop, so every time they've update model and then they will run the model, testing model, validation, drop and everything's successful. Then it'll automatically reload the new model to production.

A

Yeah so now you can see the traffic that all 100 traffic is moved to the new model.

A

uh So another another requirement from k service that we actually, in addition to the uh running inference. We also need to monitor the model to make sure it produce the uh reliable, uh reliable uh predictions.

A

So this request actually requests a event driven like a architecture where you need to for capture the original influence requests and therefore to accept a set of model monitoring components such as the outlier concept, drift uh and adversarial, like detectors so, um canada eventing provides uh composible primitives to enable lay banning for for event, producers and consumers, and I also use the cloud event to standardize the passing the event data.

A

So so k-serve has a model agent sidecar, which, like intercepts the request and then forwards the request to the canadian broker. So candidate bro parker starts the events in a durable array and and the you can have a set of consumers which you can subscribe to the brokers based on the filter. Certain filter uh of the events, it is interesting. So here is a. We run a set of like a model monitoring components to uh to run analytics on the on the inference request, and then you can also use to generate alerts.

A

If there anything uh outlier or like uh all, the data is drifted. You can generate a lot based on that.

A

So another requirement from case service that we just talked about is like. We want the inference graph, which wants to change multiple requests, much for influence service to get back the single response, or sometimes we also need to combine the outputs from multiple service influence services to for the model ensemble use case, so q native does provide the sequence and parallel crd, but that is mainly designed for async eventing.

A

But here we more like want a request response style. So we decided to uh implement our own like influence graph crd, which creates a uh implements a graph orchestrator to change the request and merge response from like a multiple infinite service, real time on the path to deliver back.

A

The final response, um so on the influence graph, you can have a different type of nodes like a single node, a single service or, like you, have a multiple service based on conditions uh based on all weights, and you can also like run inference services in parallel and then merge the response at the end and all these different nodes can be chained together.

A

So it's very flexible and uh involves, like you, can compose pretty much compose any arbitrary, like influence graph um based on the city line, and um um I'm happy to like uh uh discuss this like uh with a canadian continuity. If something like can be useful to the uh can be all more generic design can be contributed to the canadian upstream.

A

Yeah, that's all I have today and uh both k, server and the canadian kind of community are green, the community. So I think if we combine two communities, it will be really powerful and then we can looking forward to collaborating more with canadian community to um to approach the canada to next level, and hopefully we can like get a lot of more exciting features. There.

A

Yeah happy to take any questions.

B

Anyone has questions for dan, okay,.

C

So uh you talk about the autoscaler, I'm curious if you needed to customize somehow or if you're good with the defaults and if you are hitting by any chance the the panic window, often.

A

uh So the question is: is there any uh case where we need to tune the panic window right.

C

um If you hit the panic window and also if you customize, if you I'm assuming you're using the concurrency settings, yeah.

A

Yeah yeah, then.

C

I don't know because you put there the defaults, the 60 seconds and six seconds for the panic window. So I don't know if you are customizing that you are fine tuning. Somehow we.

A

Are mostly using the defaults we are more like uh based on cases. We need to tune the uh target concurrency or the other continued concurrency fields, um so um so yeah, we, I think the defaults works. um Okay, I think that we just need to tune those like concurrency fields, based on your based on different applications.

B

Call any more questions.

B

Anyone: okay, awesome thanks dan.

A

Thanks thanks, everyone.