Cloud Native Computing Foundation KubeCon + CloudNativeCon Europe 2022, 2 Jun 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: How Cookpad Leverages Triton Inference Server To Boost Their Model S... Jose Navarro & Prayana Galih

Description

Don’t miss out! Join us at our upcoming hybrid event: KubeCon + CloudNativeCon North America 2022 from October 24-28 in Detroit (and online!). Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

How Cookpad Leverages Triton Inference Server To Boost Their Model Serving - Jose Navarro & Prayana Galih, Cookpad

The adoption of MLOps practices and tooling by organizations has considerably reduced the pain points to productionise Machine Learning models. However, with the increase of the number of models available by a company to deploy, the diversity of frameworks used to train those models and the different infrastructure required to run each model, new challenges arise for Machine Learning Platform teams e.g: How can we deploy new models from the same or different frameworks concurrently? How can we improve throughput and optimize resource utilization in our serving infrastructure, especially GPUs? Cookpad ML Platform Engineers will talk in this session how Triton Inference Server, an open-source model serving tool from Nvidia, can simplify the process of model deployment and optimise the resource utilisation by efficiently supporting concurrent models on single GPU or CPU, and multi-GPU servers.

A

My name is jose navarro, and this is praiana galif and we both are the machine learning platform team at cookbot.

A

Today we are going to talk to you about how nvidia triton inference server can help you optimize three areas of your inference: engine performance.

A

User experience for your machine learning teams and cost, but first let me introduce you to cookpad a little bit for context.

A

Cookpot is the largest online community for home cooking lovers.

A

We are making everyday cooking fun, but why well? The act of eating has a major impact in everyone's physical and mental health, but also the choices we make when we cook has also a big impact on our planet and with those two in mind, we believe that there is a big difference between creators and consumers when you are creating or you are cooking suddenly, your awareness starts to grow you starting to care about where your ingredients come from or how the taste changes.

A

If those ingredients are in season or they are produced in a more controlled environment.

A

A

When people start caring, they tend to make better decisions that impact not only their health but also our environment.

A

We are an online community, a global community that is available in more than 70 countries and support more than 30 languages, which is important to understand some of the challenges that we have as a ml platform team.

A

We have more than 100 million users monthly and you can find more than 6 million recipes shared globally.

A

In the app you can browse recipes from ingredients that are in season to get inspired or you can follow authors like craig one of my favorites who uploads amazing recipes, but you could also search by ingredients, dish or cooking process, etc.

A

Machine learning is more available than ever in the past few years, with the adoption of mlo practices and tools, we have removed most of the pain points to deploy ml in production.

A

And that means that more ml teams have moved from working in isolation, creating proof of concepts towards being deployed and distributed in product teams, delivery teams or feature teams wherever the name, your companies keep these type of teams.

A

As a result, more ml is running. It has moved now from running heavy batch jobs in the background towards running more and more online inference and with more ml models. Complex models available.

A

That means that the infrastructure requirement to run online inference in gpu comes along because we need those inference to run smoothly and quick for good user experience.

A

Also, as a machine learning platform team, you probably don't want to lock your ml teams to use a specific ml framework to simplify your inference server.

A

So multiple ml framework support is a requirement, while keeping the user experience for ml engineers easy for them to deploy new models in production.

A

In the next few minutes, I'm going to tell you how triton inference server can help you improve the performance while keeping the user experience simple for dml engineers and also reduce the cost.

A

And I know what you probably think you know he's gonna start talking about cost and you care about performance right, but I promise you that this is going to be also a key element to performance.

A

In my personal experience, and also talking to other ml practitioners, running ml in gpus is not particularly difficult. You add the right nodes to your auto scaling groups, you add the right tolerations to your deployments and then request a gpu and you got it. You've got your application running on the gpu.

A

The challenge is that you have to balance the value, the user value, that you are adding by deploying a new feature with the business value that you get in return with the cost which could be quite high, and why is that in this simplified example? Let's say that I want to deploy a model application into a cpu node.

A

I deploy me up my application. I request up an amount of resources and, if I'm down my my homework correctly, my application will utilize a good portion of that leading into a well utilized and healthy cluster, and what happens if I want to deploy a new model.

A

Well, since cpu and memory are resources in which you can request, a portion of your new model will share the same note than your previous one. So as a result, you have added new user new value to your users by deploying a new feature.

A

You have probably increased some business metrics and the infrastructure cost has maintained stable happy days. But what happens if your model requires gpu for inference?

A

As we have listened today before gpu's resources, you cannot request a fractional amount of them. So if your model requires gpu, you have to request the full gpu for it, and that means low utilization.

A

But okay, let's say that the feature is worth it. You pay the price, and then you deploy your model.

A

What happens the next time you want to deploy a new model exactly you have to request another gpu and remember when I said before that we are a global community in more than 70 countries and supporting more than 30 languages, no matter how simple the feature we want to deploy is: there is very little chance that we can train a model that will perform well in all of the regions or all the all the languages.

A

So that means that, for every feature that we want to deploy, we end up having three four five models for it to cover for the most popular regions or languages.

A

A decent gpu card in our cloud provider is around three dollars an hour which will take you to two over two thousand dollars a month, and if you have to deploy four or five every time you want to deploy a new feature. We are talking about considerable money.

A

And if you are thinking that, maybe we could scale the cost by using multiple gpus environments, the cost of a node with four gpus is exactly or very similar to four times a single gpu. So you can even scale the cost as way.

A

However, if you deploy triton inference server on a gpu, you can concurrently host multiple models in the same gpu and, if you deploy it in an environment with multiple gpus triton will replicate those models in each gpu so that it can balance the inference compute in each of them to maximize utilization.

A

You're probably wondering what this means if it's a gpu on a walk in the countryside or you see gpu deploying windows xp.

A

Well, what I'm trying to say here is that triton has enabled you a happy path to deploy models in gpu in a cost effective way.

A

And that is the first performance gain that you can get with triton every time you deploy your models in cpu, because c gpu is too expensive. You could migrate them to gpu.

A

But triton also comes with a few other options for optimization.

A

If your model allows batching, you could easily configure dynamic patching for your model so that triton accumulates a number of individual requests and then build a larger batch that will compute more efficiently than doing it individually.

A

Dynamic batching can also be configured by you could select a maximum amount of batch. You could add a maximum delay so that tritone will run the inference of the batch as it is if it reaches that delay, but more like you could preserve the order so that triton responses responds in the same order. That requests arrived.

A

And moving on from dynamic, batching, another interesting optimization is model instances you can easily configure triton to replicate an amount of times each model in a given gpu that allows triton to overlap the transportation of data from and to cpu and gpu and overlaps with the inference compute.

A

Finally, you could also combine the both into your model, so you could configure dynamic, patching and model instances.

A

Before I also said that triton will help you to improve the user experience in the worst cases scenario very simplified, our ml engineers. If they want to deploy a new model, they will have to create several resource, kubernetes resources for it right. A deployment, a service, a config map service account to give some permissions hpa to make sure that we scale dynamically under certain circumstances and a pot disruption budget to make sure that we always run a minimal amount of replicas.

A

But allow me to repeat myself when I said that we have a challenge with the amount of languages and regions that we deploy, that we support so that for every feature we end up replicating those resources.

A

Well, it's it's not that bad really, because uh using an open source tool called customize, we are able to template the base of the application and then deploy new models will result in just patching a config map, etc, but that doesn't result resolve all the issues, because if you also need to modify the resource allocation depending on the region or more and more, you end up patching more and more files. And it's a bit complex.

A

A

Now, with triton our machine learning engineers don't have to worry about what ml framework they use to train their model, because triton supports all the all the ml backends that we desire like pytorch or tensorflow tensor rt on x.

A

They only need to package the model following a layout so that triton understands what type of backend needs to use and that process is documented in triton.

A

Also, once the model is available in the given packet, they only need to create a pr to modify the right config file and our cluster automation will do the rest.

A

Is this working here we are to summarize big thumbs up for nvidia for creating triton inference server, who is enable us to improve the performance of our models, sharing resources in gpu and improving our gpu utilization, while keeping the user experience easy for our machine learning engineers and with that I'll leave you with bry, who is gonna demo some of these features? Thank you very much.

A

B

Thank you. um This will not be much of a demo more showing you how we deploy our models on gpu and how it performs versus our previous deployment, which is on cpu.

B

We will be using apache bench for a simple benchmarking for this section, we'll first, but first we'll show you how easy it is to deploy a model on triton and even before that, a little bit of context for ourselves. Today we are going to measure a pi torch based image transformation model, so it accepts an image data and then it returns that image representation in embedding space.

B

So the input is with the input, will be a 244 by 244 by 3, multi-dimensional array and the output will be 300 floating point and for those of you, who've deployed model machine learning model before in some cases, gpu is not even necessary for real-time inferencing, but the one the one we are going to show. You is the kind of model that get the most benefit when deployed on a gpu, and this is actually one of the models that we use that we actively using at cookpad.

B

So the setup is that for the cpu best deployment, it will be a simple python service. It will load the model using pytorch. We are going to deploy it on a compute optimized ec2 instance.

B

So we got beefier cpu, because model deployment is mostly a cpu bound task and for the gpu based one, we will drop the model in triton, but, as you can see, we are adding a front-end python service in front of it just to make it easier for apigee events to hammer it.

B

This, though, adds another network overhead between apache bench and triton, but you can see next that it's not significant and also in real world deployment. You usually need a place to pre and pass processing the input and output of your model, and in fact this is what we end up, deploying all our services at goodpad.

B

You're, probably familiar with how to load a model and deploy using python, so we're moving next to how to deploy it on triton and a few slides ago. Jose showed you this in at goodpad. Our our teams have to provide their own manifest when they want to deploy a service, and there are some tools available to abstract that for you, but we're not currently using that. The reason for that is a whole new story and for triton, because we as the platform engine uh platform team, provide and manage the deployment we've taken.

B

Those manufacturers out of the equation and then what's left for every team we want to deploy a model is that they just need to package the models put it somewhere accessible. In this case. It's our private s3 packet and put that as a uri to that package in our shared config.

B

And one thing to note is that we need to package each model with a structure that triton can work with, and it looks like this, so it's very simple.

B

Those who've worked with tensorflow surfing, probably familiar with the structure or tensorflow. This is what came up if you do a safe model using tensorflow at the root is the name of the model you want to expose it as and the next is the fusion number of the model and then the model file itself. You can put any kinds of any model. You change with multiple kinds of platform here, and this is the minimum that you need, excluding that config.pbtxd, that's optional.

B

Triton will try to make sense of your model and then generate that file for you, but it's there if you want to make an explicit configuration of your model, for example like this so as mentioned before, you can deploy model from multiple platform. Multiple framework triton supports tensorflow, pytorch scalar model, and one of the models we use is that tensor rt, which is like an optimized compiled version of a model.

B

So you free to provide this at the model level and try to reload it after you specify the platform, you specify the input and output signature, and this is also where you're going to put your models inference. Tuning config, which we'll see later in the presentation.

B

Okay, so compress the whole directory um and then upload it to s3 and what they need to do next is just add this line specifying where the model is and boost the changes and then let your ci cd process that file, which is actually with in our case it's a repository full of kubernetes manifest. We are. We use flux in our case to synchronize those and wait for it to roll out.

B

This is usually when I go to procrastinate and browse stuff on the internet, but you don't want to see me doing that today. Now we're going to see how they perform so simple tests at the top is the cpu deployment and at the bottom is gpu deployment.

B

It takes quite a bit while so, let's just skip um so this is the baseline number. um We start with no concurrent requests at all. This is single threaded. This number are as fast as we can get out of these services. um It's already fun. Looking at this, when we first deploy triton, we'll, try next and remember. This is including the network overhead between our front end service and triton.

B

Now, let's try with two concurrent requests.

B

This will take a little bit longer than the previous one, so we'll skip again- and this is a result- not much changes from triton, which is good, but, as you can see, the the one running on cpu already doubles the latency. So then a little bit more work.

B

Now we are going to look at how the resource utilization with these two surveys, so we're going to leave the single charter benchmark running out for a bit longer in the background- and this is what we get- let's break this down at the top- you see the cpu first gpu and memory usage, as you can see, on the cpu. It uses one which is 100 of a single cpu available for that deployment.

B

We don't limit the cpu resource for this one, so that's the most. They can use the reason being. If you do an inference on cpu, that's they only use one cpu at all at a time and for gpu we are only using 50 of a single gpu, so a little bit of more room there on the memory we see the reverse of that, um so the cpu only uses around 200 megabytes, but tryton uses almost twice that number.

B

But this is that's vram video memory and not ram on the server itself, so it's not apple to apple, but you get a sense of how you should plan your capacity with this and the fun part is at the bottom. We get almost 10 times throughput and then 10 times faster, latency out of gpu.

B

So this is fine, double the double the resource from memory, half the processing time, sorry processing capacity. No, but you get 10 times the throughput. um Also, of course it's not all magic. At some point you exhaust the results at as you can see here. We are starting at two concurrent requests: gpu is at one hundred percent uh request, read already max at 100 requests per second and each time you double, which you can see it. The two arrows there latency will only increase because requests are waiting in queue.

B

Well, to be honest at this point, this is already good for us most people can just wrap it up and move on to the next model, but I do understand that we need to scale from this point. You have multiple options, so so the easiest one is horizontal scaling.

B

Okay, that stuff explanatory, uh the other one is fortica scaling. This is using single gpu on a node. If you use multiple gpu triton will spread the models on whatever gpu available, so you get parallel processing. The other two is the one that jose mentioned. Previously.

B

You have two options. You can specify an instance group, so you deploy multiple instances of your model in one gpu. That is, if you have a room in the gpu capacity, so triton will try to provide multiple instances of the model, and then you get parallel processing, the other one.

B

If your models allows dynamic batching, that's the next thing you can use to squeeze more throughput out of it, so you can specify the config on the right to tell triton to pull all the incoming requests and then execute it in one pass, there's benefit in memory, data transfer between the server and the gpu, and that's it. You can also combine these techniques to get more out of your deployment and that's it for me. That's it from us! Thank you for your time.

B

A

B

So it's time for q a um we are hiring and also you want to mention about. I.

A

Yes, so we will be having some drinks later next to the city of arts and science. So if anyone wants to join us to talk about email, cooking or anything, just come and see us, so we can send you an invite. We've got some budget from the company to pay the first round so.

A

That's good to take questions: yeah, yeah yeah! Of course, if there's time.

C

D

C

Ends are currently are.

D

A

C

A

uh You've got pi torch, tensorflow tensor rt on x.

B

Sk um scalar, you can use that as well.

A

C

In order, I know pi touch, but in order to run on triton, you need to convert the model to torch script or torch serve. Something like that. Do you know, uh did you did something like that need to convert your model.

B

C

This step forward.

B

Okay, sorry um for pythords, for this demo um we try two stuff, actually the tritone itself and then the tensor rt, yes, you're, correct you need to for tensority. You need a torch model, pythons model that supports touchscript. So if everything works on tensor, basically, you can use that.

C

Is it working straightforward? Do you have any problem with that.

B

Identifying which, which part in the graph that you stands for or not that's the hard part, but once you isolate them, you can take it out for preprocessing and then move the pure tensor operation into this. Thank you.

C

D

D

Questions so triton inference. Error is actually one of the few uh servings that supports mixed precision on tensorflow. uh Have you done any benchmarking on the mixed precision versus the tensor rt in float? Point 16.

B

um Let me repeat that.

D

B

D

The difference between the floating point 16 operation on tensor, rt and mixed precision on tensorflow, because they are kind of comparable, but not entirely. Have you done any benchmarks between those like models.

B

Yeah yeah, um so what we? What we check for floating points, the different floating points, um 16 versus 32, for example. What we do is that we check the the difference between the output and then we compare it. We find we find the difference between the two and then we kind of compare the performance of the output and then come to a value that we accept.

B

So there are differences, so I believe for this particular model, it was 10 to the power of minus nine, and that's if you, if you don't reduce the precision of the plotting point. So just by floating point three to thirty two from pythog to tensor rt. You get that much difference.

D

Okay and what about the perf, not like the accuracy, differences between the like floating point, 32 and 40.16 like? Were you like? What was the scope there.

B

It's not much difference, it's not very significant, but I'm not exactly remember the number yeah okay.

D

D

Any other questions.

A

E

Hi, uh thank you for the, um so I'm wondering there there are more inference surfing frameworks out there. What what made you guys pick the triton inference server.

A

Yes, uh so we found with triton influence server the ability to host multiple models in a single gpu being the killer feature for us, because that allow us to share the cost between several different applications so that the decision between deploying a feature or not comes back to just understanding the value that you add into the users and the the one. The value that you get back, whereas the cost element of the decision has been reduced and because now you can share the cost again with multiple multi-multiple applications, as we do with the cpu workloads.

B

And also that it supports multiple frameworks, that's why we choose triton, because you cannot force your machine learning team to just use one framework yeah. They have preferences. So just in this case, we we initially have pi torch and tensorflow deployed on our cluster. Just combining that into one instead of having to manage two deployment. Stack is beneficial to us.

C

A

D

All right, thank you very much.