Cloud Native Computing Foundation Kubernetes on Edge Day EU 2022, 19 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Model Serving at the Edge Made Easier - Paul Van Eck & Animesh Singh, IBM

Description

Model Serving at the Edge Made Easier - Paul Van Eck & Animesh Singh, IBM

As edge devices consume the world, the ability to deploy AI models on these devices becomes increasingly vital. Challenges surrounding the management of numerous models across a multitude of edge hosts can be tricky. Not only that, the limited compute power that edge hosts provide makes it necessary to eliminate as much overhead as possible. These are common pain points holding users back from large scale adoption. However, with the combination of ModelMesh with technologies like K3s and MicroShift, the practicality of employing such a system has increased dramatically. As the multi-model serving backend of KServe, ModelMesh offers a small-footprint control-plane for managing model deployments on Kubernetes. Using multi-model runtimes with intelligent model loading/unloading, ModelMesh is able to make the most out of a limited set of resources while still providing the capability to serve many models for inference. Come to this talk to get the edge on edge model serving!

A

Hello: everyone, my name, is paul. Thank you for coming back to this session. After a good lunch.

A

So yeah welcome to this the day of edge model serving at the edge made easier. Is the name of my talk. I am paul van eck. I am an open source engineer software engineer at ibm, based in silicon valley labs at in california.

A

um My co-speaker could not be here today my co-speaker anna mesh singh, so it'll just be me, but he will be with us in spirit or more likely, probably asleep in california, so just to kind of give an outline for the talk, we're going to first kind of go over edge computing, then the whole model serving overview.

A

Then the whole notion of using kubernetes uh for model serving at the edge right then using model mesh, something called model mesh, which I will introduce to easily manage kind of higher density, edge model deployments on kubernetes on edge, so yeah we'll go into an example deployment and some challenges that were encountered and some lessons learned.

A

So yeah, so, as you all probably already know, we are in the midst of a new generation of computing, you know edge computing. That's why we're here, and so with each generation of computing uh before it edge, will impact every industry um and force.

A

You know I.t departments to adapt to new architectures, um new deployment models and business models, and so here we focus on bringing computing to offices, this distribution, centers manufacturing sites, and so this kind of create this decentralized approach to application design and bring with it the new challenges of workload, management across thousands or millions of edge nodes, and so it helps to visualize edge computing through the continuum of physical infrastructure, from centralized data centers to devices right so to the far right of the diagram that shows the centralized data.

A

Centers um representing you, know, cloud-based compute, and you know here, cloud resources are practically unlimited, whereas device resources are inherently constrained and so so moving along the continuum from centralized data centers towards devices the first main edge tier, is the service provider edge, and this is distributed and brings edge computing resources, uh resources, much closer to end users and so moving.

A

Even more right, we have the user edge, which represents you know, a highly diverse mixture of resources, and you know, as a general role, the more uh the closer that edge compute resources get to the physical world.

A

You know the more constrained and specialized they become so this user edge space is essentially the focus for today, and so, as we look at workloads that benefit from running at the edge, we see things like business logic, applications network modernization, then the focus where I want to kind of focus on today is the notion of stretching ai and analytics to the edge. So in this scenario, a user's predefined trained models can then be deployed to the edge.

A

Additionally, customers can train the models on the edge as well in real time by capturing the data and letting the models and having the models be retrained as they operate at the edge.

A

And so, as we look at the machine learning life cycle, we see that ml models are kind of constantly being updated. You know either your training, your retraining and deployment and so model deployment is a very key aspect um as without deployment. How else will you consume the model and you know, bring ai to your systems, and so, as it turns out, actually doing production grade uh model deployment and inference is actually you know, plagued with complexities.

A

um There are quite a number of things to consider, and you know here are some of the questions that the user might have to navigate when considering model serving approaches.

A

So but again, the focus for today is model serving at the edge on you know: resource constrained devices. You know single board, computers or and systems on modules.

A

um Sometimes you know it's a necessity to use uh on-premise and distributed um compute resources that are close to the end users, so user edge deployment in this sense comes with. Like many benefits, data locality is the first one. You know we perform inference where the data is.

A

You know this ties in the quicker inference response times and bandwidth consumption savings, so you no longer have to necessarily send your data or your. You know your inference payload to the cloud or across some expansive network.

A

So- and this also has the benefits of increased security, and you know data privacy.

A

So remember we still have those complexities. We want to tackle those complexities. We saw on the on the previous slide. So how do we handle these complexities at the edge? And so one of the ways is of course kubernetes at the edge and that's kind of the topic for today. So so we want to leverage the orchestration capabilities of kubernetes at the edge okay.

A

So, as you all may know, there are several ways to use kubernetes at the edge um you can deploy a whole lightweight cluster on the edge devices using things like k3s micro k8 and something called micro shift, which is kind of a relatively new project that offers a small form factor.

A

Open shift designed for field deployed low resource devices, and so multiple edge clusters can be deployed and you can use something like cluster management tools, such as open cluster management, to manage each individual cluster and then there's the option of having the control plane on some cloud somewhere, and this will be your main kubernetes cluster and then you add edge nodes managed by something like kubedge, and so these edgeworker nodes will be available for deployment, and so so for us, I'm going to go with the edge deployed kubernetes clusters to keep everything at the edge, and you know I guess one typical approach for deploying apps or models on the edge is to just containerize the model server and create a kubernetes deployment, and maybe you might mount the volume and which contains your model files, your assets and uh eventually you'll.

A

Do a coupe code to apply and deploy the actual containers. So this is essentially what other model serving platforms. You know what I guess, most kubernetes based uh model serving platforms and never inevitably do. um However, this can be quite cumbersome, especially when you're dealing with a lot of model frameworks um different like tensorflow and pi torch, each have their own images and different arguments. You have to worry about um so it's kind of like a rabbit hole.

A

You have to jump in and might be a little daunting, and so you might lose out on certain aspects like it's the feature richness of like scale to zero, or um you might have varying inference, request formats or protocols that are used on different model servers which can be annoying so can traditional model serving platforms for kubernetes, extend to edge devices, and so first I'm going to introduce k-serve and so k-serve started off in the kubeflow umbrella as the model serving platform for kubeflow.

A

So it is now a incubating project in the linux, the lfai and data foundation, and so what it is. It's a highly scalable and standards-based model inference platform on kubernetes for trusted. Ai.

A

Typically, I mean traditionally deploy this on the cloud you know you have underlying underlying case. Serve is k. Native n is to this is an optional layer. um You can either deploy models using k native or you can use a option uh raw deployment mode where you just deploy models using just standard kubernetes resources like deployments ingresses and services, so some of the benefits of k-serve it provides an easy to use, interface and or crd called an infant service for deploying models, given a format and uh storage endpoint.

A

So, typically, you just need to provide these two items and you can give the k-serve and k-serve will know what to do with it. We'll find the appropriate, appropriate images to load and we'll load it into the container.

A

Another another another advantage of k-serve is that it revolves around a standardized inference: protocol, okay. So this is something that the community has helped shape. um People from you know other model servers like nvidia, I think pi torch torch, serve, supports this protocol and selden's ml server, and so by implementing this protocol, both inference, clients and servers will our inference. Clients and servers will increase their kind of their utility in portability.

A

um You know, can you can see me seamlessly integrate on other platforms and perform standardized inference using this protocol so yeah? So some of the main ones are triton server, pi torch, torch serve and then ml server, and these are some of the standardized rest or grpc endpoints.

A

So, with this case of rings, it's not necessarily good for edge. You know, there's resource overhead because of you know side cars might be injected into each pod. You know having an independent model server uh for or having independent model server per model or a model per pod. um That's kind of you really um really kind of it's a lot of resource consumption being being done.

A

Okay, so it doesn't make the best use of you're already resource constrained on these edge devices, and so we need a way where we need to where we can serve multiple models in a singular pod or container, and that's where k serves uh multi-model serving back-end for k-serve or for multi-multi-model back-end comes into play.

A

So this is a project that was open sourced by ibm last year and has joined the case of organization, so as users were getting hit with these, these scalability issues with the traditional model serving on k native, um we decided we needed to open source or consolidate on a on a path or approach for handle multi-model serving aka the having multiple models inside a container so model mesh is a platform yeah.

A

It's been running inside ibm or in production for quite a few years now, and you know it's the backbone for quite a few of watson's services, um yeah watson, assistant watson and natural language, understanding and watson, discovery and so model mesh allows for you know multiple models per container, but it allows models to be paged out if it's not being used or loaded just in time. If a request comes in that he needs that model.

A

It has some parallels to k native and serverless, but um it's kind of just relegated to a single container so yeah, so it does strike an intelligent trade-off between responsiveness to users and their computational footprint. So I would say it does make. The use makes excellent use of resources on your cluster.

A

So the overall architecture might look something like this. So when the user applies an infant service yaml containing the model details, the model mesh controller will select the suitable serving runtime pod in which to host the model. So the pods for these serving runtimes, that's kind of the important part where you see the green and orange.

A

So these pods are typically they contain three containers. You have your the model server container, which is typically a third part of the infant service or server like triton or ml server. These support loading multiple models and have unload and loading endpoints, which we can use to kind of dynamically, alter which models are deployed, and so then we have the model mesh sidecar container, which handles the model management and handles both control and data plane, request routing.

A

Then we have the puller which handles pulling models from external object, stores or endpoints, so this handle is pulling it into the actual cluster pulling model files into the cluster.

A

So a single kubernetes service at the top points to all pods across all deployments and external inferencing requests are made via the service and the whichever ingress model mesh pod it hits model mesh will determine where the model is actually load, which actual model server. The model is actually deployed in and will route the request as needed.

A

So just a side note: ncd is used as uh the coordinate operations and you know persist, model and instant states yeah. So, like I said, the model servers are kind of an integral part of model mesh, so k serve or model mesh has serving runtimes.

A

This is a crd that is used for defining model serving environments and which container images should be used or loaded, and you know what the supported model formats are so currently out of the box.

A

These are the two main ones: triton infant server and ml server.

A

And so you can pretty much use anything. I think we're working on. We have openvino and we have, I think, serve, is in the works, but as long as it supports dynamic loading and unloading model mesh should be compatible with it, and so, as mentioned model mesh has been running in production cloud environments for quite a while. So the question that I guess I was thinking about was: is it tenable to bring it? You know on the kubernetes on the edge? So let's talk about some of the advantage advantages it might provide.

A

So first, through k-serve's infant service interface, users are able to easily deploy multiple models into singular, serving runtime containers or pods, which drastically reduces the resource overhead compared to the single mod, a single model per pod pattern that we were that is typical, with traditional model serving so each pod. Typically. Has you know, resource requests and it'll be quite easy to hit allocation limits when dealing with even just a few models on resource constrained devices.

A

So another thing we pay attention or that model mesh is of the key features of model mesh is its whole aspect of cache management and and to some extent, high availability right. So malo mesh treats the set of pre-provisioned pods on the kubernetes cluster as an lru, cache and so malamesh decides which models are loaded or unloaded based on usage, recency or current request volumes.

A

So you know, if you have multiple edge nodes, a model might be scaled out to have copies on each of the nodes if there are serving runtimes available, and so if a specific model is getting is getting swamped with traffic, it might scale out to might produce model mesh might decide, hey, let's load this into other of two additional serving runtime pods to kind of create, more um uh availability for that model.

A

But if a loaded model hasn't received any traffic in the while and the new model comes in and the cache happens to be full, the least recently used model will be evicted from the cache. That means unloaded from the model server's memory to make room for the new model to be loaded and used yeah. So with this cache management, you know model mesh works well, for you know, handling uh unevenly distributed, uh inference, request load, you know you might have 20 models registered and perhaps only five or are more commonly used.

A

You know kind of like the 80 20 rule, where 20 of the models handle 80 percent or you know, handling 80 of the traffic.

A

And so some additional uh highlights of model mesh is it's. You know intelligent placement and loading. You know we try to malamesh tries to balance the cash age across the pods, as well as the request to load.

A

So again, it's striking a balance trying to keep it balanced and there is a priority loading of models so a model with a request waiting for it will be bumped to the front of the line if there is a request waiting on it and if there is a loading queue, so this all ensures we balance, ensures we balance resource usage across nodes or edge devices, as well as improved responsiveness, so after having deployed models. um What about day two operations? So key aspect is the operational com uh simplicity of model mesh.

A

You know it supports rolling updates automatically. So if you deploy a new model, the old version of that model will continue to receive all the traffic until the new model is loaded into memory and is ready for inference, and at that point the traffic will be shifted so yeah. This all sounds great, so I wanted to try model mesh on edge and that's that's.

A

What I did so here, I unfortunately cannot bring my jetson nano to this event, but that's a picture of my justin nano on the right and so on my jetson nano, which has a quad core arms. You know arm 64, base, processor and four gigabytes of ram.

A

I deployed micro shift, which is a small form factor open shift. As I mentioned, um it's a it's a project, that's currently being developed by the red hat emerging technologies group um and for those that don't know what openshift is it's kind of? It's a you know the red, hat's enterprise kubernetes platform, so in any case, microshift does work remarkably well on this, especially on this jetson nano.

A

It runs as a single binary and runs on less than one gigabyte of ram and generally less than a core a single cpu core.

A

So with microshift I have an all-in-one minimal installation of kubernetes on my jetson nano and on top of this I deployed model mesh as a kind of like a stand-alone installation.

A

So since yeah so model mesh is a part of the it's part of k-serve. But in this case I don't care. I didn't really care too much about the single model serving so I opted to not install the k-serve controller and just use the model mesh controller, because right now, model mesh controller is its own standalone controller, and you know the saves saves resources right trying to squeeze out the most from your device.

A

So with model mesh, I was able to deploy a mix of tensorflow and onyx machine learning models which, by default with model mesh automatically maps to the triton, inference server serving runtime and yeah, because yeah triton serving try and serving runtimes supports. Quite a few of the model formats- and you know, performing grpc inference was pretty fast uh model mesh. The primary api is grpc.

A

And so, even if so, if a model was loaded in memory, you get about a second less than a second inference response time, and then you know if it has to unload and load the model, uh you might might have to wait a few seconds, and so after that I wanted to try. You know higher density model packing, so I wanted to play, apply more inference services to the cluster, to register more models with model mesh, and so I deployed around so in this case.

A

In my scenario, I deployed around 17 or so densenet onyx models, so these are about. I want to say about 36 megabytes, each um they're for image classification over, I think, a thousand or so categories, and so we on the right.

A

You have a simple diagram where which kind of shows what's kind of deployed on the model mesh side of things, and so I expose the service as a node port and from anywhere on my network, I can send inference response or inference request to garner to kind of get whatever inference request uh response for my request.

A

So again, this uses the caser v2 protocol, grpc, specifically, and so with this after applying yes, when I do kubeco get inferent services, you know I was able to get a list of registered models and all of them are listed as ready, even though not all of them are necessarily loaded in memory, so using so for those familiar with grpc.

A

Usually you have a protobuf that kind of corresponds to the api you're using so with that, I was able to generate a python client, for which I could send grpc inference requests with arbitrary images for classification, so in the sample command.

A

I send an inference request to one of the deployed onyx models and receive my inference results based on this is my cat mabel, who volunteered or got volunteered to be a part of my image classification, so the response times, even with all of these all these models response times still sub second, but if it is a cache miss so again, not all of these can be loaded in the memory at once. If it's a cache miss there is a.

A

There is a quite a few seconds of latency involved because again, malamesh has to unload the model to make room for the target model of inference.

A

So now, of course, even with just this single device, you know it's an all-in-one cluster installed with model mesh. You know it was. It was very. I was pleasantly surprised to see. I was able to deploy and host uh quite a few models more than I expected actually and so with multiple devices or a separate control plane right.

A

So my control plane and my worker know they're all the same device, but still I was able to get um quite a lot out of this installation, especially on such a resource constrained device and so with multiple devices. For instance, if you maybe use a k3s cluster multi worker node, you could definitely get larger cache sizes for model loading and the opportunity for high availability. You can have multiple rep or copies of models loaded on different nodes, and you know even more compute power, dedicated compute power for the model. Inference service servers themselves, so yeah.

A

This is overall a very uh interesting, interesting challenge. For me, um it was a since I'm a model mesh developer. um I am looking more into kind of how we can bring model mesh to other kind of areas, and so you know generally, when you're dealing with, um I guess, pointing things to edge devices. You generally run into the lack of arm, builds and support. You know like, for instance, k-serve has some dependencies on some python python packages which don't necessarily have arm support, and a lot of container images are only built.

A

You know you might look for a container image for k-serve on docker hook right now that just be only amd, 64 or x86, and so yeah, and so right now I guess a lot of the images are huge over a gigabyte, especially so there's a lot of trimming down.

A

That needs to be done, and so that's something that I kind of put into and considered when I was kind of building my own arm based images and, of course, when you're dealing with edge devices the whole need to create kind of smaller models or doing some techniques like pruning or quantization.

A

That's also needs, that's also something to be considered.

A

But anyway, if you want to learn more, I do have some links here. um Definitely um try out any of these technologies. um I do think model mesh, depending on your use case, can be a tenable solution for model serving on edge, and with that uh I thank you and I'll. Take any questions. If there are any.