Red Hat OpenShift Santa Clara 2019 | OpenShift Commons Gathering, 18 Mar 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenShift Commons Gathering Santa Clara 2019 Deep Learning Inference with Nvidia GPUs on OpenShift

Description

OpenShift Commons Gathering Santa Clara 2019
Deep Learning Inference with Nvidia GPUs on OpenShift
Production Deep Learning Inference on Nvidia GPUs
Tripti Singhal (Nvidia)
Peter MacKinnon (RedHat)

A

Here's a quick agenda of what I'll be talking about today, so first I'll do give like a high-level overview of deep learning and focusing more on the inference side and then I'll dive a little deeper into the Nvidia 1030 inference server, which is the product that I focus on and the features that go into it. The overall architecture, the ecosystem, I'll, do a quick performance slide and then the demo and then I'll hand it over to Pete to talk about the inference server on openshift, but before I get started.

A

I did want to do a quick show of hands how many of you all are involved in deep learning, workloads right now.

A

Ok, how many people plan to work on deep learning soon.

A

A

So, if you're not aware deep learning is the technique of using massive amounts of data to train neural networks to be able to make human-like decisions. So you start off with an untrained model, and you till you do several iterations of a large data set on this neural networks. When an outcome is a trained neural network that can now make human-like decisions.

A

So in this case it's a simple image: classification use case where images are passed in and the network can now figure out if it's a cat or a dog or whatever and now inference, which is what I'll be focusing on.

A

Most of this talk is the is the idea of taking that trained, neural network and then putting that and deploying that into the real world to make decisions on data that it hasn't seen before so training its uses, the same kind of set to train on and then an inference, it's new data and then there's this idea of plaster which stands for program, ability, low latency accuracy, size of network, throughput efficiency and rate of learning. So this kind of goes into wide.

A

Gpus are necessary for this inference process of deep learning and, as you can see, latency is not the only factor that goes into a successful inference deployment. There's things like accuracy as well. So, if you think about use cases like it's vehicles, you not only need your decisions to be made fast, but also highly accurate and using GPUs for inference delivers on all these factors. So here are some of the pain points that we typically see with end users of these deep learning inference deployments.

A

So the main, the main problem here is the problem of inefficiency. So the first one here is being only being able to run a single model on a single GPU at a time and so being able to. So, if you have a use case like here on the far left, if you have an ASR model, natural language processing model and a recommendation model all running in your data center and they each have a dedicated GPU and one, let's say the ASR model traffic spikes, the rest of the GPUs are left underutilized, which is highly inefficient.

A

Another another pain point is having only a single framework support and this restricts teams to only use one framework. So if there are several teams and they want to develop and PI torch print, tensorflow and Caffe, they would all have to build out their own custom pipeline, which is not efficient as well and then, similarly with custom development, if you have several teams each doing their own pipeline, one for each use case that they're deploying if every single team builds out their own their own custom pipeline.

A

That's not efficient, because they're really they're all really doing the same, underlying task, which is inference and so having one cusp. One solution that manages all these pipelines makes that makes that person easier. So this is just a quick high-level overview of the ten-thirty inference server and where it sits in the ecosystem.

A

So here on the left, you see the clients sending their requests into some cloud application or several applications, and it might be, and then those requests are sent to a load balancer where those requests are then tracked, the traffic is sent to the appropriate instance of the inference servers. So you may have several instances of this inference. Server and all underlying hardware is visible, including heterogeneous GPUs, which is why we have Tesla t4v, 100 and P 4 listed here, and this is helpful.

A

If you have new hardware being integrated into a data center, you wouldn't have to reconfigure your software for that it integrates with orchestration systems such as kubernetes and it's now open source for further customization.

A

So the inference server has several features and just to point out a few here in green I mentioned, have being able to run multiple models. Concurrently on a single GPU, you can run multiple models and multiple versions of the same model at the same time on a single GPU, and this is what really lets you utilize your GPU to its maximum capacity and then another feature. Useful feature is dynamic, batching, and so, when client requests come in, they come in without any logic they come in at batch size, one and being able to the inference.

A

Server can take those batch size, one and batch them internally to be sent to execution in the backend at the same time, so the user wouldn't have to build that logic outside of the inference server and those are based on a user-defined, SLA, latency SLA inference server also exposes Prometheus metrics and those metrics, including utilization memory count and latency, and these metrics help with making auto scaling decisions knowing when to scale up and scale down your instances of the inference, server and then I also mentioned multiple model format, support and the models.

A

The format's supported our tensor flow graph, def and save save model. We also have this tensor flow tensor, RT integration, that's also supported, of course, temps or RT plans and then cafe to net dev through the onyx path, and then the newest feature is that we'll be announcing soon is the streaming API, and this allows for support for sequence, models that have input that have state associated with it. So use cases like speech, recognition and translation are also supported now, so this dives a little bit deeper into the inference server architecture, the internal architecture.

A

So this green box here represents the inference server and at the top you see the client request come through HTTP or G RPC, and so there's also this Python and C++ client library that that helps with this interaction between client and server, and so once client requests come in. They go through request and response handling, and here you can see it's a simple image.

A

Classification use case, similar to the example I showed before for classifying these images, and so then what happens is that request goes through the per model, scheduling cues, so, for example, if this image needed, if this request needed a resident 50 model, it would go to the ResNet 50 queue. If it needed an inception model, it would go to that one and so forth and then through there it goes to the framework back-end. So if that model was in tensorflow, it would go to the temps or flow backend. If it was in.

A

If it was a tensor, our team model, it would go to that back-end and so forth, and then from there the request gets sent back through handling and back to the client. A few things you'll notice is the the metrics are exposed through a separate HTTP endpoint and then at the top you'll see the model repository, and this is where models are placed after training is done and they're already in the format. That is, for the inference servers so like the ones that I mentioned graph, def save model and net def and plans.

A

So those will be placing to a persistent volume here called the model repository and like I, mentioned before. All hardware is visible to the inference server, including CPU. So, like I mentioned, there's a new streaming API and just to explain, explain kind of briefly what this example is showing.

A

So if you have speech recognition, sort of use cases, the the packets come in one at a time and they have these correlation IDs that go along with them and once they come into the inference server they go to similarly to the per model, scheduling queues, and that what's new here is that before I showed that the the batcher, which was the per model, scheduling cues and the dynamic batcher now what's new, is the sequence batcher that takes those correlation IDs and uses them to direct the packets to the the particular inference server instance that executed the previous packet.

A

So it's a bit complicated there, but what's important here is that packets of the same sequence need to go to the same batch slot here to for execution, and so this sequence, batcher, is what has been built out recently in the inference server. So up until now, we've been taking kind of a zoomed in view of what the inference server has internally and now we're taking a step back and seeing how it fits into the larger ecosystem here and so here on. The right, you'll see the green boxes.

A

Those are that's, the temp start a inference server and you can have several instances of it running in your data center and then on. The Left you'll see this user client request coming in and then this box, with these three components- the client, API, pre and post-processing. So this is like this is what we like to call the app whatever it may be. So for, like I've, been using this image classification use case.

A

This would be where an image comes in and maybe it needs to be resized or cropped to be able to be fed into the neural network. That's needed for the result, so that's where that would happen, and then the request would then go to a load balancer that directs traffic and then at the top. Also I mentioned that, if you have, you may do all your training of all your models and you put them in some network file.

A

Storage that would a subset of those models that want to be deployed would have to be separated out into this persistent volume, which is called the model repository and that's mounted into the tensor RT inference server from there. The tensor RT is taking these requests and then performing the inference and sending the result back, and while it's doing all that, it's also exposing these Prometheus metrics through HTTP and those can also be connected to an auto scaler for further scaling up and end down.

A

You also notice this just added line box and that's where we've had our collaboration with cube flow. It's ten thirty sentiment, servers now supported in queue flow, and the goal is to kind of extend this box and include the load balancer as well. So that's also supportive and so here's a quick performance lie. Key takeaways here are that this is a resume at fifty several deployments of resume at 50 on tensorflow FB 32 on CPU, which is this blue.

A

This blue line and then temps are flow, the same model on GPU and then the tensor RT plan, which is which is a highly optimized version of the same model through the tensor RT framework and you'll, see the old notice that the there's a significant increase in the throughput here- and this is all under 50 millisecond SLA. But the key- the more important thing to take away from here is that the inference server supports CPU, GPU and multiple model frameworks like tensor flow and hence RT.

A

So here's a video of if you were at cube con last year or in December, you might have seen the live version of this demo being shown at our booth, but here's just a quick, video and I'll, probably just explain, what's going on since it's a little busy but at the top is our flower.

A

This is our flowers demo to show off the inference server and at the top you'll see, is the flowers client which is doing several classifications of these flower images, and that flashing light is indicates how many images are being performing inference on right now, so you'll notice at the bottom left you'll see demand and delivered so there's 800 images per second of demand and it's being met, and so on. The bottom left.

A

There's there you'll see that this is 8 GPUs at the bottom, so to blue, and the different colors at the bottom indicate different models. So the blue indicates this flowers model and right now, you'll notice that the utilization across all the dials that shows per GPU utilization and then the bigger dial is the overall GP utilization so fairly low as of right now and then so, that's kind of what you're looking at and I'll skip forward a little bit just to show some changes. So what you'll see soon is that will increase?

A

The demand will increase the demand to the same cluster that so we've increased the demand to thousand images per second now and all the traffic is being sent to this. This cluster that has one GP one GPU supporting each model. So you'll notice the spike over here on the chart where the mod of the flowers model has spiked in a request now to 5,000, and these two GPUs supporting that model have maxed out in GP utilization and while the other and you'll notice that the other GPUs supporting other models are not being utilized at all.

A

So obviously, this is not efficient and we're not even meeting the demand of 5,000 images per second here, and so, if I were to skip forward a bit. A typical solution here would be to increase more hardware to support that model, but you still are left with GPUs underutilized. So now, what of what has been done is the traffic is now sent to a different cluster at the bottom right of the screen.

A

You'll you'll see that this is a different user, all kubernetes clusters by the way and each GPU has every single model loaded onto it with. Hence the multicolored colors inside the boxes of each GPU, and so now you'll see that it's easily meeting the 5,000 images per second demand on this cluster that has the tensor RT inference server running on it. So all the GPUs utilization is about 30-40 percent and the overall GP utilization is around the same and we're easily meeting that demand, and so with this.

A

With with this configuration using the tensor RT inference server, there's fairly there's a lot of capacity to add more traffic, and so what we can actually do here is increase.

A

The traffic to or increase the demand or now what we're doing, is actually simulating a kind of jumpy traffic pattern and going up and down and to the and we've actually scaled it all the way up to 15,000 images per second and the inference server enabled cluster is maxing out on all eight GPUs utilization and getting the maximum utilization of all of those GPUs and looks like we actually went all the way up to 17,500 or 18,000 images per second and it's able to manage that load as well.

A

So that's pretty much it for the demo and so I've have some resources here, including our data center, inference page where you can learn more about our deep learning inference. Products such as 10:30 and the inference server as well, and the temps RT infant server is available on our NGC and video GPU cloud registry as a container, and it's also open source like I mentioned, then the github repos right there as well and for getting started.

A

We have this engineering developer blog that walks you through a really useful simple example that you can just get started with and and there's another guest blog that we did with Q flow, explaining our entry integration with them as well. So that's it for me. I'll pass it over to now. So.

B

Obviously, you know the start of the show is tensor RT and server. It's a nice sort of quick little demo that hopefully illustrates how this can be deployed on OpenShift. So the structure of the demo is we have two pods there's one for the inference server, another for a inference, client and then a open chef service to find for the server.

B

That's the communication path for the client to the server, and then there will be a PVC and PV making use of local storage where we store some pre-built models for ResNet and Inception, so sort of architectural e. This. What the demo looks like again the communication path there is in what we call sort of the pod layers from the client to the service endpoint for the inference server and the inference server. Pod is basically attached to this storage layer where it has the models. Sorry, I, don't know. If that's use that NVIDIA inference server.

B

Sorry tensor RT inference server. It's sort of a shorthand for this this demo, that's my bad I think so yeah, that's the idea there. So the server image in terms of image we're talking about the image that was created by Nvidia the client image. That's used in this demo is built from the publicly available docker file, that's on the Nvidia github repo, and then they have a set of models that we can load into the inference server.

B

Just a simple one, resident 50 and inception v3, and we run the demo on an Nvidia, be 116 gigabyte card there. So this is all using open shift, 3.11, alright. So let's look at our project again this shorthand this an official nomenclature. Trt is so we have basically our application here and para pods. So let's have a look at the inference: server, pod and the logs for that indicate a successful start of the infrared server in that pod. It's actually exposing a couple: endpoints there's one for HTTP rest and the other one for G RPC.

B

As Tripti mentioned, those two capabilities are in place and we expose them through OpenShift and then from the logging. We can tell basically it'll sit there, I believe in pull for models in the storage location, so we've dropped some models in there and it picked them up and it's basically gone through its setup for setting an inference serving of those models. Okay, so that's the server itself well and that services let's go back to pod. So let's go to the client and let's do a terminal here and let's see if it remembers any other commands.

B

So let's run so. The image client here is a pre-built C++ client within this, this client pod. So we're going to work against the the rest service to in with and we're going to run inference on a couple of images. In this case it's a mug and let's see how it does coffee, mug, Oh Diane. Of course there is it's: it's not.

C

B

Yeah, let's run that again how's that better and what other? Okay, let's see if we can throw some other commands at it. There's a nice suite of tools there that.

C

Live demo Diane: this is what you got. This is what you get there. We.

B

Go so we have some sort of commands teed up there that first one was the basic HTTP test. You can run again against G RPC will find it's. Let's go back over here and up here.

B

And it's quite a bit faster I'm gonna have to scale that back sorry and then finally, there's a benchmark test. So there's various different modes with the clients you can do batch inference and- and things like that, we'll finish off here in the demo, with a performance benchmark again, that's provided by Nvidia for us.

B

Let's make that full screen, so we can see the full.

B

Yeah, so we ran the benchmark there, that's pretty much it so you can also you can see it through the console, basically, there's interaction between the inference server, the models that have been stored for it. That storage could, of course, be CF for the purpose of the demo is just put in the local storage and then the interaction between components in OpenShift and kubernetes we've put together a client for that. So that's pretty much it for the demo. Thanks.