Cloud Native Computing Foundation KubeCon + CloudNativeCon North America 2020 - Virtual, 4 Dec 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Serverless for ML Inference on Kubernetes: Panacea or Folly? - Manasi Vartak, Verta Inc

Description

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2021 Virtual from May 4–7, 2021. Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Serverless for ML Inference on Kubernetes: Panacea or Folly? - Manasi Vartak, Verta Inc

As providers of an end-to-end MLOps platform, we find that autoscaling ML inference is a frequent customer ask. Recently, serverless computing has been touted as the panacea for elastic compute that can provide flexibility and lower operating costs. However, for ML, the need to precisely define hardware configurations and long warm-up times of certain ML models exacerbate the limitations of serverless. To provide the best solution to our customers, we have run extensive benchmarking experiments comparing the performance of serverless and traditional computing for inference workloads running on Kubernetes (with KubeFlow and with the ModelDB MLOps Toolkit). Our experiments have spanned a variety of model types, data modalities, hardware, and workloads. In this talk, we present the results from our benchmarking study and provide a guide to architect your own k8s-based ML inference system.

https://sched.co/ekCB

A

Hi folks, good evening, my name is manasi vata and I'm the founder and ceo of virda. We are a ml infrastructure company, providing a platform for the delivery operations and management of ml models, and today I'm going to be focusing on a topic that has come up time and time again with our clients and partners, which is when you're serving ml models. Does it make sense to adopt a surveillance infrastructure so with that, let's get started a bit of background on myself as well as verda.

A

So I founded virta based on my phd work at mit. This was on a open source model management and versioning system called model db. Just as git is the de facto version control system for source code, we found out that models didn't have an equivalent way to version or manage them. So we built this open source system that is now maintained by verda, and we have expanded that significantly to provide an end-to-end ml ops platform just like devops and the tooling. Around devops has enabled software teams to ship code more frequently and more reliably.

A

The virta mlaps platform helps ml teams ship models more frequently and in a reliable fashion.

A

So the reason this talk and the work that we're presenting in this talk today came about was that we help a lot of clients and ml teams, deploy their ml models, and serverless is a very appealing paradigm in this case, because it provides scaling ability.

A

It also has a promise of reducing costs, because you don't use you don't keep resources around that you don't need, and so what we decided was that we would run a set of benchmarks, comparing ml inference in a serverless setting compared to a non-serverless setting and so provide the sort of best advice to our clients.

A

So today I'm excited to share some of that work with you all and without further ado. Let me get started. What I'll be covering in the remainder of my talk is what is serverless. Why is it interesting, uh unique considerations for ml serving that come into play, then we'll go to the benchmark and results there and then I'll wrap up with some key takeaways and how you can determine whether serverless is appropriate for you.

A

So serverless, what is it and why is it interesting if you think about the different ways in which you can run software applications today, you'll find that there's largely four categories: one is you can start with a bare metal server, manage the hardware and then run applications on top of it. The next one up the chain is virtual machines, these abstract away the hardware, and you can have multiple guest machines running on your host machine.

A

Further up the stack, we have containers that help us abstract away the operating system entirely and we're left with only the application and its dependencies at the very top. We have serverless, where we're only thinking about the function that we want to execute, we're not really thinking about how to run it or how to scale it.

A

That's the value add that is brought in by a serverless platform, as you might imagine, as we go from bare metal to serverless we're going up the abstraction chain, which means it's easier to use, and yet it has less control and less flexibility.

A

As a result, today I'm going to be focusing on cyrillus and containers and comparing these two as ways to run ml workloads, so serverless can be a confusing term for many people, and so let me let me throw some definitions first, so this is an abstract one that I really like by martin fowler. It says serverless is at its most simple and outsourcing solution. So what that means is with serverless you don't need to provision or manage the servers.

A

Also, your application doesn't have a long running server component, so you don't have a loop, that's sitting around waiting for waiting for application to get requests.

A

Instead, it is much more an event driven paradigm where every time a a serverless function is invoked, the underlying platform is going to spin up one or multiple copies of the serverless function, and then that is going to be used to service requests as a developer, you're only writing the code or business logic. You don't need to figure out how to run or deploy it. The platform is going to take care of that.

A

In addition, the platform is going to take care of scaling up and down of resources.

A

If you have a certain burst, because your app is very very popular, the serverless system is going to scale up to support that, in contrast, if it's, um if it's a holiday and no one's using the app that you've built for workplace productivity, then it's going to scale to zero until there are requests and then finally serverless has many different flavors. What I'm talking about in this talk is function as a service, and some of the popular serverless systems are lambda's, gcp cloud run cloud functions, etc.

A

So that's what serverless is, let's quickly look at why serverless is interesting, so some of the pros and cons when you are choosing to adopt a serverless setting so first of all, serverless is easy to use developers focus on business logic. Don't need to worry about how an application is going to run. Second, it's able to scale the more requests you get and as your workload varies, more copies of the serverless function are created automatically.

A

Similarly, it's going to scale to zero when your application is not receiving requests. So, overall, this means that the maintenance, overhead of infrastructure is very low. You don't need to manage the nodes, perform upgrades, tune, resource requirements and so on and as a result overall, whether we're talking about total cost of ownership or in certain cases, depending on the workload, even the infrastructure costs can be lower for a serverless system compared to a non-serverless system.

A

So some popular applications that are good fits for serverless include asynchronous message: processing, iot workloads, stream, processing and so on.

A

So those are cases where serverless is a good fit, but that doesn't mean the serverless is a good fit everywhere. For instance, serverless applications are stateless. So if you have an application that does require state to be present in it, then serverless may not be a good fit here.

A

Second, there are implementation restrictions so, whether you're talking about lambdas or cloud run, there are limits on how long a function can run the resources it can ask for concurrency and so on.

A

In addition, you can't choose or control the hardware. So if you're looking for particulate, processor or accelerator, the serverless platform might not support it. However, a kubernetes-based platform is likely to support that and then, finally, because the serverless applications scale to xero when they're not in use whenever the first request comes in and there are no instances running, there can be a large latency which can result from cold start you're. Just spinning up the first instance of this particular function, you're downloading dependencies and that can take a while okay.

A

So, given this background, when the serverless makes sense usually well, it makes sense when the application can be made stateless. This is usually not a deal breaker. You just store the state in a database in a file, store, blob store and so on.

A

The second one is that resource requirements are modest, so they fit into what the serverless platform expects as typical resource requirements.

A

The third one is that performance requirements or sles are not super astringent because, as mentioned earlier, cold start is a real issue, and so, if there are tight performance requirements here, around latency in particular, then serverless needs to be used with care and then finally, serverless makes sense when query workloads are not steady.

A

If you have a very steady workload that uses your servers pretty effectively, then serverless doesn't make as much sense, because it's not going to save you, those that's not going to save you resources and therefore costs as well, and then finally, serverless really helps with reducing infrastructure burden, and so, if your team really has a lot of burden around managing infrastructure, serverless would make sense.

A

So with that we can look at what ml surveying entails and how serverless plays with that. So you need considerations for ml, serving first of all, ml surveying means making predictions against the trained model. So there are different stages of the ml life cycle. Your first training you're, first cleaning, the data you're preparing the data, then you're training the model and then you're making predictions against the trained model.

A

So we're talking about that third piece of the ml life cycle and the first consideration, that's unique for ml is that models can be large and what we mean by that is a model can have millions and millions of parameters.

A

As a result, the serialized version of the model can be hundreds of gigabytes, so distilbert, one of the smaller nlp models is actually 256 mb. If you compare that to your vanilla python function that you might run that's fairly large and we'll see how we can run into problems when we're running a distilled bird model in a serverless setting, we run into similar challenges for ml libraries, ml libraries can be varied. They can also have a lot of dependencies and, as a result, loading up.

A

These libraries can hit some of the walls that are imposed by resource restrictions for serverless platforms.

A

Then, finally, ml models may need to be served via gpus or tpus they're, more efficient. That way and serverless systems currently don't enable that so we now have a better sense of where serverless makes sense and how amal serving workloads might compare to that particular situation. So next, let's look at the benchmark.

A

Our goal with this benchmark was to identify when it makes sense to use serverless for ml survey, and we evaluated that question on a variety of systems. Today, I'm going to be focusing on three systems. These are managed services and we picked these because we found that these systems were most mature among the offerings out there. So the first one is lambdas aws lambdas. This is the state of the art in serverless.

A

The next one is running serverless on kubernetes, and this is using the k-native project and we chose to use the google cloud run platform in order to perform the benchmark.

A

Finally, for a container-based platform, we use the virta system and what that entails is that virta packages and runs models as containers and does the scaling automatically similar to a serverless setting. So, let's quickly look at how each one of these systems works, because that'll help inform the analysis we'll perform shortly so for aws lambda. This is the manage serverless platform.

A

You write code for the lambda function, you upload the package code to s3, also upload any other dependencies and then every time that the lambda is triggered via an event or an http request. The platform is going to spin up a lambda instance. It's going to execute it, it's going to scale it up and down as required, and when the request is done, and there are no other requests, it will scale to 0.. So that's generally how a lambda works on aws.

A

If we look at google cloud run, it's pretty similar, except now we're talking about containers as opposed to just functions, written in particular language. So in this case, what the end user has to do. Is they upload a docker container to the container registry um and they tell google cloud run about the container, how it should be invoked and then every time that there is an event or http request, it's going to trigger a cloud run execution and as before, the platform manages all resources, scaling and endpoints.

A

Finally, with virda, as I mentioned before, the platform runs models as containers on kubernetes and how it works is that the model is uploaded to the verta platform. You can optionally specify resource and hardware requirements that you're looking to have met, and then you deploy the model inferno notice that the step was missing in the serverless workflows and that's because with serverless you can spin up serverless instances on the fly. However, in this case we deploy at least one.

A

We will always have at least one replica running for the model, and so that reduces our cold start. Costs, however, does mean that if there are no requests, we still have the model running at all times and then similar to serverless platforms, virta manage just the endpoints and scaling of models all right. um The last part of the benchmark, spec is metrics. We measure warm stuff, latency, cold, start latency, auto scaling, latency and also usability concerns.

A

Like can you actually use serverless for this workload or are there hard barriers which make it impossible to use serverless?

A

Finally, for the workloads we tested these systems on a variety of models, including nlp computer vision, traditional ml models today I'll be focusing on just a couple of nlp models, because the trends are pretty similar across the board and we have different workloads that vary qps or queries per second, all right. So before I go to the results, um I do want to point out a few caveats. The first one is that serverless for kubernetes is still evolving, and so the numbers that I'm presenting here are based on the capabilities and software available.

A

Today, if you rerun the numbers in two months, they might be different, and so what we have done is that this benchmark is a living benchmark and you can find the full set of results at that url and also learn about how you can run the benchmark yourself.

A

Second, one is we're using managed services in this benchmark and as with all maintenance services, there are knobs that cannot be seen or controlled by the end user and there are likely optimizations performed under the hood, which means that it's a bit challenging to perform the apples to apples comparison.

A

The best that we can do and what we've done in this benchmark is to use best practices recommended by the cloud providers and to use off-the-shelf settings for the managed services all right. So, let's first get started with results around usability, because, if you're not able to even use the serverless platform for a particular model, then the quantitative numbers don't make sense.

A

So the biggest hurdle that we found in using serverless for ml has to do with the restrictions that are put on resources.

A

So here I'm listing out the restrictions that are put on lambda and cloud run with respect to memory, disk and cpu, and what this means is that, if you require memory greater than three gigabytes, then lambdas are not an option for you. You cannot use the serverless platform.

A

Similarly, if you need more than four virtual cpus, it's a no go um on google cloud run you're, just not going to be able to do that now, with ml, as we discussed before ml models and ml libraries tend to be rather large, and so it it happens more often than not that your ml model doesn't fit in the constraints imposed by the serverless platform.

A

To give you two examples, the first one has to do with distilbert and to run distilbert. What you have to do is install torch and to install the transformer library from hugging phase. In this case, both of them together are greater than 500 megabytes and for a lambda. This falls outside of the realm of what's doable, and so what we had to do was to surgically remove pieces of library and code and get creative with model loading so that we could actually use distilbert on a lambda.

A

So if you, if your model is larger than the expected resource constraint, you need to get very creative and significant wrangling is involved there. The second one is another example that we have seen with multiple clients.

A

There's an embedding model for an entity and then there's a nearest neighbor lookup model, that's going to find entities that are most similar and because this particular model includes an index that can be pretty large. The model ends up being greater than 20 gigabytes, um that's way outside of the realm of any of the serverless platforms, and so you cannot use a serverless platform in order to serve this particular model.

A

So last one I want to highlight here is configuration options, ml, libraries and low level. Lineage libraries have optimizations that can be tuned via environment variables.

A

However, in a serverless setting these environment variables or the hardware that determines these environment variables are not exposed to the user, and so it's hard to set these environment variables correctly.

A

All right, so the set of results that we looked at so far have to do with usability. Can you actually use serverless for this workload?

A

Next, we'll look at some numbers that assume that all right, you've, gotten you've got into uh running your model on serverless, how effective or efficient. Is it at that? So first thing we'll look at is warm star prediction: latency warm start is there's already a serverless application running and there's no need to download the model or spin up a instance from scratch.

A

We see here that aws, lambda and google cloud run are pretty comparable across a p50 p95 and p99 latencies. Interestingly, the virta platform is actually faster by 2x.

A

This can be attributed to more control over the environment, so this can be processors that we use or specific environment variable settings since lambda and cloud run are closed source it's hard to accurately identify what might be the reason why virta is running 2x faster, but it likely has to do with the environment that we're running in, and here I'm noting the configurations that we use for all three systems.

A

We have tried to keep all the configurations comparable so that we're doing a apple, strapless comparison.

A

The next result has to do with cold star prediction latency. So this is a case where no serverless functions are running already, and so this is the time required to spin up a new instance of the serverless function and then service a request.

A

So in this case we find that for verda we always have at least one replica running, and so it's extremely fast to make the first prediction.

A

On the other hand, in the lambda case and cloud run case, it takes several seconds in order to spin up the first instance of the serverless application, and so we find that the latency is in the tens of seconds, as opposed to less than a second for virta.

A

So the previous results were around time to force prediction.

A

The next set of results are for scaling latency, so one of the biggest advantages of serverless is its ability to scale based on the workload, and so here we're comparing the time required to reach steady state, which means in this case we are testing with 100 qbs one worker per query.

A

We're saying that we know we've reached steady state when we start receiving 100 responses per second, and this is successful responses.

A

Here we see that google cloud run is quite fast 33 seconds, lambdas are twice as slow, virta is significantly slower, and this is to be expected because we're now only relying on the kubernetes auto scaling, there are no optimizations being performed in order to always have say warm, warm replicas, ready to go and so on. So the optimizations that a serverless system might implement in order to perform auto scaling are not present in the container-based scaling based on vanilla, kubernetes.

A

All right, the final result here is what happens when we vary the model size, so here we're comparing distilled bert against burt. Bert is twice as large as distilbert, and we find that this does impact the latency. We see that the p95 is twice as much um and the time to first request, however, is not that much longer, particularly for aws, for cloud run and for virta. We do see some degradation in the time to service the first request.

A

So that's a quick overview of some of the key results in comparing serverless versus non-serverless for ml survey.

A

Before I wrap up with a key set of takeaways, I do have a note on cost, because cost can be one of the key reasons why teams decided decide to go the serverless route, and this has to do with two things. One is if we discuss only pure infrastructure costs, depending on the workload and the amount of resources necessary, the pure infra cost may be comparable or even higher, in a serverless setting than in a non-serverless setting. So if you have steady state in your query, workload and your queries have good utilization of the servers.

A

It might be cheaper to just use a non-serverless solution. However, one place where serverless does shine is tco or total cost of ownership. This is a combination of infrastructure, cost maintenance, cost and development costs. So if tco is extremely important to you, then serverless would make sense. Otherwise it requires a more nuanced view on what are the infrastructure costs that we're actually signing up for all right.

A

So with that, let me quickly summarize what we learned based on our benchmarking results, and hopefully this is helpful for people to make their own decisions on whether serverless is a good fit for your ml surveying use case.

A

So, first of all, serverless solutions have hard limits on resources. So if your model doesn't fit within these resources, then serverless might not be a good fit for you. Second, the ability to configure hardware can lead to better performance, and this is easier in non-serverless systems.

A

So this is the case where um the virta system was 2x faster and it had to do with the environment that we were running in next scaling with the serverless platforms is much faster than the vanilla, auto scaling that we could do in kubernetes, even with custom metrics, there's special optimizations that are performed by these platforms for auto scaling that aren't provided out of the box if you're running containers on kubernetes and then finally, um the query pattern, then workload impacts, cost, and so, in order to assess whether serverless is cheaper than non-serverless for prediction requests to ml models, we need to close the other query patterns and workloads.

A

So that's uh those were the key points that I wanted to convey regarding the benchmark. um Please check out bird iai surveillance, inference benchmark for the whole set of results, and also to learn how you can run the benchmark yourself.

A

So with that thanks very much for your attention, I would love to continue the conversation around the benchmark. Please feel free to reach out and want to see alberta or data serial with questions and I'd be happy to feel them there. Thank you.