Cloud Native Computing Foundation KubeCon + CloudNativeCon Europe 2021, 14 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Automated Machine Learning Performance Evaluation - Alejandro Saucedo

Description

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon North America 2021 in Los Angeles, CA from October 12-15. Learn more at https://kubecon.io The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Automated Machine Learning Performance Evaluation - Alejandro Saucedo, The Institute for Ethical AI & Machine Learning

Overview Deployed production machine learning models come on different sizes, shapes and flavours when deployed in cloud native infrastructure - each with varying hardware (and software) requirements. Whether it is RAM, CPU, GPU or Disk Space, there won't be an optimal global configuration for all your models' training and inference. In this talk we will cover the motivations and concepts around general benchmarking in software, as well as the key nuanced requirements to leverage these concepts in machine learning systems. We will learn about the theory behind benchmarking specifically on machine learning models, as well as the parameters that need to be accounted for, including latency, throughput, spikes, performance percentiles, outliers, between others. We will dive into a hands on example, where we will benchmark a model across multiple parameters to identify optimal performance on a specific hardware using Argo, Kubernetes and Seldon Core.

A

Hello, my name is alejandro saucedo and today we're going to be covering automated machine learning, performance, benchmarking and evaluation at scale a bit about myself. I am engineering director at southern technologies, chief scientist at the institute for ethical, ai and governing council member at large at the acm.

A

We have a lot of topics to cover, so, let's dive straight into it today we're going to be diving into the motivations for automated benchmarking, uh we're going to be covering some techniques and tools that you will be able to use to perform uh benchmarking against the deployed model.

A

We're going to talk about then the the ways in which this can be automated, as well as how to automate this with uh workflow management systems and talk about what those are and then finally, we're going to be covering a couple of uh examples, hands-on examples that will show you how you're able to adopt this in your workflows.

A

So, let's start with a familiar model right, and this would be the hello world of machine learning, the c410 classifier.

A

What we have here is basically a model that takes uh an image and is able to predict what class this image is, and from that perspective, in this case, this is the image of a truck and is predicting class number nine, which would be in this case the image of a truck.

A

Now, from that perspective, what we want to do is we want to um see it from the productionization perspective right. So, of course, there's going to be some uh complex experimentation process that would be carried out by uh the respective you know: data scientists and the main experts to find what is the best performing type of model. In this case, we would be able to work with an already trained model which we will be able to fetch with some of the utilities in one of our open source frameworks.

A

So from that perspective, we already have this tensorflow resnet 32 trained model.

A

What we are able to do is we now are able to ask the question: well, how do we productionize it, and we can actually luckily use a lot of the tools available in the case of this talk, we're going to be using this tool called selden core.

A

Seldom core is a framework that allows you to productionize your model, artifacts or code into a fully fledged microservice that can be scaled in kubernetes clusters and, as you will see uh throughout the rest of the talk, uh the the microservices that get produced uh have a rest and grpc api. It produces metrics it. It has some ability of logs, but ultimately this is basically what we will be able to leverage to say.

A

Okay, I want to productionize my model if you're curious on actually delving into the steps required of how to productionize your model, there will be a lot of uh notebook, open source examples that will allow you to do so and try it out. uh So yeah you'll be able to actually dive into as much detail as you want, but for now we will be able to ask the question of well.

A

How do we evaluate a model that we have deployed and basically deploying model, is becoming relatively easier and easier, with seldom core you're able to either just provide your artifact or provide a python wrapper you're able to convert that python wrapper from source to image to an actual image, but then you would be able to actually deploy into a kubernetes cluster. The way that you would do it with selden is through this um declarative interface, where you would be able to say I want to deploy this model, I want to name it cipher 10.

A

I want to use one of the pre-packaged model servers. These are optimized containers that you don't have to build yourself in this case it's using the tf serving underlying image. Well, you know, there's also the ability to use triton, uh you know, scikit, learn extreme boost, prepackaged servers etc, and, ultimately, what you need to just put is: there is a bucket containing your model binaries. In this case we would have already uploaded our exported tensorflow uh binaries into a google bucket once you actually deploy it against the cluster.

A

Basically, just doing you know apply of that of that uh config file into your kubernetes cluster. You will be able to see that selden model being uh managed and orchestrated by the selden operator. What that basically means is that now we have a microservice that we can send requests to. So is that basically it are.

A

We basically done have we finished uh all of the journey that we needed well, unfortunately, um uh as we all may have experienced in the past, uh the the performance of of the models that may have been deployed to production could have some different new nuances uh that may cause uh uh a diverse performance to what may have been seen in the development right, and in this case, for example, it could be.

A

uh You know, from the more obscure type memory leaks uh that that you know result in a in a in a sort of like a clogging of of of memory. It could also be a much higher uh usage of. Of course, it could be actually um different sort of like attributes that we will cover as a result of this talk. But what ends up happening is the model stops working or there is a massive um reduction in performance and something breaks right, and from that perspective it's it's the question of well.

A

What could have been done in order to prevent or to be able to to understand what are the exact required configurations in order to minimize this undesired behaviors, and from that perspective, we do have to first acknowledge that production, machine learnings are systems are hard and the reason why is because they require and depend on specialized hardware. This could be either very large amounts of memory. This could be specialized processing units like gpus or tpus.

A

This could be all the way from complex dependency, graphs, compliance requirements, reproducibility of components, but ultimately there is basically a complexity layer that is added on top of the already complex challenge of managing production. Microservices that may not even be related to machine learning. So from that perspective is important to ensure that it is possible to introduce some best practices that allow us to manage this complexity.

A

Now there is the extra complexity that we need to kind of like start fleshing out. Well, what does it look like in regards to the components that we need to take into consideration?

A

Well, in the context of model configuration parameters, you know we did see our our previous deployed model that consisted just of the actual artifact and the underlying uh prepackaged server that we wanted to use, but there are other variables to take into consideration uh in the context of you know, perhaps a machine learning model, a machine learning model wrapper that is written in python.

A

Perhaps you may need to take into consideration the number of g-unicorn workers right if it's using g unicorn or the the number of threads that that your application is running the number of of of cores that you want to allocate, as well as the memory that you want to allocate for your cluster itself, to not get clogged. Also the question of how many replicas do you want to be able to paralyze uh to be able to handle the requests in the load, balancer strategy, and from that perspective you know it is.

A

It is, of course, still semi-standardized micro service kubernetes concepts, but it is the added complexity of the requirements of the specialized uh uh uh runtime uh components that that perhaps may need uh some further more complex uh uh hardware uh requirements, and you know from that perspective, it could also be things like you know, gpus, as well as the time required to process each request, because it would be, unlike uh perhaps other other type of microservices cpu intensive right as opposed to io intensive.

A

So from that perspective, you need to take into consideration the throughputs, the number of requests per second, the time for each request to be processed and then from there the number of processes, the number of threads. You know if it's a python base, if it's you know simple suppose or lower level, it's just uh the the the hardware base parallelization as well as many other things so complex. But how do we manage that?

A

Well, there are some best practices uh and some motivations of actually adopting some benchmarking approaches, basically being able to perform evaluation of new versus old models, how they're performing against them assessment of throughput of existing models that are being deployed. Assessment of latency, as well as monitoring of it optimization of resource allocation.

A

If you actually want to minimize costs, what is what is the exact resources that you would want to allocate optimization of number of threads workers to be able to ensure that the internals are working correctly and evaluation of performance under load or stress, or basically long running models right after maybe a week or or or or so there could be?

A

um You know diverse and performance of the of the microservices, and there are multiple benchmarking types and performance valuation types that can be done in in the general microservices area, things like performance testing, a general name. You know for tests that check how the system behaves and performs right. Basically, you know how would behave if I actually run 100 uh requests per second load testing. Is you want to actually take it to the limit, see what is the maximum rate of request that it can actually withhold uh and stress testing?

A

Basically, like you know, actually extreme loads that you may want to carry around. We will be able to see how to leverage each of those on the context of machine learning models themselves.

A

There is also, luckily tools that we can leverage for our use cases. We will be using two specifically for our examples. One is for the grpc api called ghz and one for the http api, which is called vegito right, we'll be able to leverage this too, and we will see how they allow us to actually perform the benchmarking.

A

So, starting with vegeta, we can see that the actual uh benchmarking performance can be done in a very standardized way. We can say hey what this is the end point of our machine learning model the rest endpoint, the http endpoint. We want to be able to send a post request with you know: 10 cpus. We want to actually maximize the throughput. I want to reach like a rate of 120 requests per second, and I want this number of workers, and then I want to print a report based on that report.

A

We can actually see what are the latencies, the the the mean latency, the the percentiles, the total duration. The number of uh you know the rates that it was withstanding, the throughputs, which is basically successful, as opposed to just the number of requests that it could withstand at the same time and then the status codes right so with this you're able to identify what is the actual performance of the model and similarly we're able to perform this uh from the grpc perspective using gh set, you can see that the actual uh parameters are very similar.

A

It's just that we would actually use the protobufs right if you're interested about what this looks like you know, by actually delving into the example you'll be able to try.

A

It yourself send a single request: the grpc http endpoint, so you get an intuition, but this is more than anything for you to get an idea of how to leverage the tools, as we will now start diving into how to automate these processes, and the reason why we want to do this is because, as you can see, there are multiple things that we can actually uh tweak right. We can tweak the total cores allocated. The total memory uh required the latency per request that we take into consideration.

A

The requests per second and the throughput, the number of workers or threads, the number of replicas required the horizontal, auto scaling requirements, as well as the perhaps missed requests that would be lost lost as the actual pods are scaling if it actually takes time. So from that perspective, we also need to automate the actual evaluation we don't want to have to run. You know vegeta or ghz 100 times with different parameters in order for us to get some useful results right.

A

So that's, basically, now the premise of okay: we've we've seen how we, what are they? What are the attributes that we can evaluate? We have seen what are the? um uh What are the specific uh best practices uh that we can use? What are the techniques and what are the tools right?

A

So now is basically, how do we piece all of this together to automate it and to also make sure that we can automate it at scale right, not just something that I would run on my laptop and wait until it's done, but something that I can actually um you know actually uh deploy at scale and ensure that this can actually be done in a programmatic way and from that we are able to leverage the concept of workflow managers.

A

So this may actually come perhaps more often in the con in the context of etl-based systems or extract load, transform data or in the context of ci cd systems, where you would have a pipeline that carries out several actions, and you know you, you perform some sort of output, but basically we're going to be using the concept of workflow managers. The way that we're going to be using this work for managers, which will allow us to basically run jobs uh with multiple steps, multiple reusable steps, and we will actually see what that looks like in.

A

In practice. We will, you know, will be common intuitive if you haven't come across workflow managers, but we're going to be using uh the argo workflows, uh workflow manager, which will allow us to have a very simple workflow. The workflow will consist of a first step to actually deploy or configure uh if it's already deployed and already existing selden selden model seldom deployment right. So we saw that we were able to to productionize our machine learning model by actually converting into a fully fledged microservice.

A

We also saw that we can actually uh choose the parameters of how we deploy it right. This can be the number of reques the memory, the cpus, the the threads, the workers, the replicas. So basically, this is. This is a step that is able to specify uh what it looks like right. What our model looks like then, once it's actually created and updated, and it's running with all of the configure requirements.

A

uh We are then able to run the the benchmarking step right, so this is basically running either the vegeta state step or the gh set step, which runs the evaluation, the performance evaluation, the benchmarking, with a particular set of parameters that the number of cpus, the number of workers, the rates expected, the duration, etc, etc. And, uh of course, we would then not just run it once because otherwise, the the benefit that we would get from this would be quite minimal.

A

Instead, what we would want to do is to be able to actually run it across a broad range of values, and if you come from a data science machine learning background you may have, you may be able to build an intuition through the context through the concept of grid search right. Basically, when you say I have this bunch of hyper parameter uh choices, and I want to actually run a permutation across all of my hardware parameters to see what the actual output or the performance will be.

A

If I choose, you know, parameter a to b, one two, three five ten one hundred and then parameter b, being twenty forty sixty. So it would back basically, basically run uh a permutation or a combination. I guess more specifically a combination of uh all of these different uh values across, so basically uh that that is the the ultimate uh uh objective of this. So what argo really looks like this is actually, first just an example to get you into the syntax of argo workloads.

A

So argo basically allows you to perform steps right in a modular way same in kubernetes. So what that means is that you are able to define in in this case what they call a template, so a referenceable state, so a reversible step which in this case it just actually uh prints uh a message right. So it basically says this. This template called uh whale say, uh prints, a message that comes in as a parameter of the name message right. So then the actual workflow consists of two steps.

A

The first step is to run that template with the actual parameter hello to a and then basically the second one is to run another uh step with the well. What seems to be the same parameter, which actually should be different, but ultimately, what this would do is actually run two steps with two different jobs: kubernetes containers that run until completion. It waits until it's successful, and then it runs the next one. That's basically what this is.

A

If you come across other workflows, maybe this is a little bit painful, because it's you know going from the basics, but if you haven't, they should give you a good intuition of why we are using this right. It runs a step. It assesses whether it's successful and then it runs a next step and it's also able to pass parameters right now.

A

What we are going to be able to do in our case is we're going to be able to actually build a reusable argo workflow, and from this perspective we will talk about first, the argo workflow and then the reusable part. First, the argo workflow part is basically the step where you would be able to say. Okay, I want to run this three steps that we talked about the create or modify seldom resource right, the one that actually deploys it configures it.

A

The wait for selling resource right, the one we're actually like you know, waits until it's actually running because, of course, you're deploying a model so you're waiting until the actual microservices is fully running and then running the benchmark. Now, specifically in each of these steps, what we would want to do is not just to run the steps, because in this case, what we can see is that the steps would be either running, vegeta or either or running gh set, and you will see why we want to do this.

A

We want to either run vegeta or ghz, but the parameters you know I, I didn't put all the parameters here, because it's a pretty long file, but what you would actually see here is um the number of cpus the rate, etc, etc, and you can see that there is kind of like a mapping of the parameters that are used for vegeta to the parameters that are used in gh set right. So we want to make sure that we can actually reuse the same sort of type of parameters and then in argo workflows.

A

Fortunately, they also provide a way to perform. What is a grid search so you're also able to actually call the the steps you're able to call the steps uh by using a a set of grid options and requesting to pass a basically combination of each of those. So you you're, basically saying hey. I I want to then run this argo workflow. I want to use this this um parameters right. I want to pass for uh the number of cpus 0.1, 0.5, 1, 2, 5 and then for ram.

A

I want to put like you know: 200 megs, 500 megs, one gigabyte, two gigabytes, and then I want to basically run a combination of all of those things same with the duration. I want to run it for 10 seconds two hours, five hours whatever and the benefit of this. We covered the workflow aspect. Now we can cover the reusability aspect, so I I mentioned that we're creating a reusable argo workflow. We can actually leverage uh the helm, uh templating um cncf tool uh to create our own sort of like reusable component.

A

That will allow us to actually provide the the the the the values that we want to reuse, and in this case you know we just use one value, uh which is you know, number of replicas server workers, so I basically just would run these ones um across all of these different options. Now, from that same perspective, and we can pass basically the data which, in this case it's just showing some dummy data to actually make sure that it fits within within the screen.

A

But what this basically would do is we would be able to just run this it would deploy on the kubernetes cluster. It would actually run in this case only once graduation of 30 seconds, and then we would be able to retrieve the output with the argo logs, so it would, as you saw, it, would actually print the output in this case, with digital report json, in this case with json query uh that you know we will be able to retrieve. um You will see why we're doing that when we analyze the results.

A

But here uh the key thing to to to to to to see is that we now are able to leverage a component that is possible to save us a lot of time, instead of actually us having to run, of course. For starters, our model locally, with a benchmarking and instead of also also us, deploying the model and running, for example, vegeta or ghz locally against that and waiting 30 minutes and then maybe coming back and realizing that your you know, computer had to restart or something for some weird reason.

A

um You know, instead of doing that, we're not only deploying this and allowing that to actually be fulfilled remotely in the cluster, but we're also able to perform some sort of more complex grid, search across the values, for the benchmarking to be able to understand what are potentially the optimal configurations for that model. That is being deployed in an automated manner, and this is important because you know in the context of seldom core the the the core principle that we built against is the context of thousands of models.

A

And you know, in that case you have the distributed systems concept of pet versus cattle right.

A

You can't have every single model being um you know looked after with with a particular set of like best practices and uh a a a data scientist that is like always kind of like you know, doing maintenance across it, because if you have a thousand models, those complexities need to be managed at scale, and there needs to be a standardized set of interfaces and best practices that can be leveraged in order for you to be able to take into advantage.

A

You know things like this like performance evaluation, so you may want to actually even start automating, as we have been doing in terms of like internal research at southern core, exploring ways in how we can automate these components using the cloud native best practices and, of course, more specifically best practices of microservice microservices that can be brought in and adopted for the machine learning operation space. So you can see the value of some of these things here now.

A

The reason why I was mentioning the printing of the output is because now we're able to actually fetch that output, the json output that comes from what has been printed and be able to actually see the results. We can actually see the grid search over here. You know the the number of replicas server workers, threats, cpus max workers and then the actual results they mean, percentiles, etc, etc. The rate throughput of every single sort of output- and you can see that now we can do very interesting analysis.

A

From this perspective, we can actually evaluate the results using. uh You know, in this case uh the pandas data frame. We can actually see only the rest requests and sort them by the rates, and we can actually see what are the configurations that allow us to achieve the the the biggest rate or not the biggest rate, the highest rate. um In this case, we can see that you know with, of course, three replicas uh you know, but here we can actually see other interesting things.

A

You can actually see uh the relationship between in the context of python based servers, uh threads versus workers, as well as the cpus and on the replicas, and you can actually try to see some some um some trends in your your specific models.

A

uh Unfortunately, every model may have not every model, but but there will be potential vast variation, that's variation between model to model when it comes to the the parameters that would make it perform better, and this is why it's it's so important to have tools like this in your in your toolbox to be able to leverage uh and use those best practices.

A

So, if you're curious, you can actually try all of the things that we covered here end to end uh on a jupiter notebook. All of these things are open source, which is which is great, and you can actually find it on the main repo which is hub.com, seldonio, solvencor right and here the documentation has examples not just about this benchmarking, automation with argo workflows. We can also find a you know, vast amount of different resources that will allow you to delve into.

A

You know from the very basic deployment of models to the more advanced uh you know: integration with with other type of batch systems or or or you know, integration with streaming using kafka, et cetera, et cetera. I mean explainability of live detection. I mean you name, it you'll find it there, so so definitely recommend you to do that, and with that today, we've covered a broad range of different, very interesting concepts. We've developed into the motivations uh for this topic of automated benchmarking and performance evaluation.

A

uh We've performed uh some deployment of our simple model, as well as an initial benchmark from our you know: local computer, uh then we talked about how we're able to uh not just automate but also uh scale uh this capability using workflow managers, as well as covering an example using a reusable uh workflow to be able to perform evaluation across a grid search of parameters to identify the optimal configuration of particular modes. So with that, uh thank you very much uh for joining my talk and uh yeah.

A

If you have any questions, uh please don't feel free to reach out either throughout the conference or afterwards so yeah. Thank you very much and see you around.