Red Hat OpenShift Seattle 2018 | OpenShift Commons Gathering, 13 Dec 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Deep Learning on OpenShift with GPUs Tripti Singhal, Nvidia OpenShift Commons Gathering Seattle 2018

Description

Deep Learning on OpenShift with GPUs | Tripti Singhal (Nvidia) | Tushar Katarki (Red Hat)

at OpenShift Commons Gathering Seattle 2018
https://commons.openshift.org/gatherings/Seattle_2018.html

A

A

Thank you. Thank you good afternoon. Hopefully it's coming loud and clear. So this is a very exciting topic for me, and so hopefully you'll find that, but in terms of a show of hands how many people are using AI on OpenShift and kubernetes or intent to use in the next few months.

A

Okay, cool, there's a good good set of people.

A

How many people are getting to like inferencing in those advanced topics on OpenShift a little bit, so you know clip D is going to talk about that and she's going to explain to us what influencing is in the way of introduction, I'm shark, a turkey I am a product manager on open shift kind of involved in a bunch of AI initiatives and we'll talk about that a little later towards the end and and trip the single is a product manager for Nvidia, and she focuses on deep learning and influencing software there, and she has a very exciting lineup for us.

A

So without a further ado, I will hand over the mic to Tippie all right.

B

All right thanks so much ok. So this is just a quick agenda. I'll start off with a brief overview of what deep learning is and focus a little bit more on the inference side and then I'll jump right into the NVIDIA tensor RT inference server, which was announced in September, so fairly new I'll go into the features, the internal architecture where it fits into the larger inference. Ecosystem I have one quick performance slide and then I'll jump into a demo and then I'll pass it back to Tushar to talk about open shift and kubernetes.

B

So deep learning at a high level is the idea of using large amounts of data to train neural networks. Teach these neural networks how to make human life decisions, and so it's typically broken down into training and inference and inference, is what I'll be focusing on. Mostly today, training is using large amounts of data teaching these neural networks, how to you, how to make these human-like decisions and then inference is taking that train trained model.

B

That's been iterated over with that data several times and then deploying it into the real world and giving it new data to make new decisions and new predictions. So that's deep learning at a high level, like I, said, I'll, be focusing more on the inference side and the ten-thirty inference server, so focusing more on inference and why GPUs are necessary. There's this idea of plaster which stands for program ability, low latency accuracy, size of the network, throughput efficiency and the rate of learning.

B

So, as you can see, latency and speed is not the only factor here. All these factors contribute to a successful inference deployment and using GPUs delivers on all these factors.

B

So the main problem that most people run into when deploying an inference workflow is the problem of inefficiency, and this includes not being able to run several models at the same time on one GPU.

B

So if, on this on this example on the far left, users may want to run several different types of deep learning models, whether it's a speech, recognition, model, language processing model or a deep recommender model, and if each GPU is dedicated to that model and one spikes like in this example, the speech the speech model spiked up, the other GPUs are left underutilized and that's pretty inefficient.

B

Also, solutions today typically only offer support for one framework and that really restricts your team's internal teams working on developing these AI models. It restricts them to that one framework and it made yeah, and so they, if they, if they feel like they, some teams work in pie, charts some teams work in and tensorflow there's, not one that uses it all.

B

And so that's the that's the issue today and then, when it comes to custom development, there, like I, said there's several teams developing different solutions and solving for different tasks, whether it's visual search, recommendations and so on.

B

If each team builds out their own pipeline, their own custom pipeline for those, that's not very efficient, because they're all really doing the same underlying task, which is inference and really having one custom. One solution that handles it all makes managing those pipelines much easier.

B

So the Nvidia tensor RT inference server like SOS, announced in September and is also now open source as of about three weeks ago.

B

Excuse me- and this is just a high level overview of where it fits into the larger ecosystem on the Left you'll, see the clients sending in requests and to some sort of cloud application running on the in the data center and from there. Those requests are sent to a load balancer which directs traffic to to the appropriate instance of the tensor RT inference server.

B

So in this case, there's three instances running and all the underlying hardware is visible to the inference server, including heterogeneous GPUs, which is why I've listed t4v 100 and P 4 there, like I, said it integrates with all orchestration systems such as kubernetes and yeah, and, like I, said it's open, source and available on github.

B

So here are some of the current features that the tensor RT inference. Server has to offer and just to single out a few, we can kind of separate them between two performance features and usability features so for performance. There's a feature called concurrent exit model execution, which is what allows you to run multiple models or multiple instances of the same model on one GPU at one time.

B

So this is this is how you're really gonna maximize the utilization of your GPU and get the most capacity there and dynamic batching you're able to match up your inference. Requests inside the inference server based on a user-defined SLA latency SLA, rather than having to build that logic outside of the inference server and so now for more of user usability features. The 10:13 print server exposes metrics such for utilization, count and latency to enable auto scaling, and we support multiple model frameworks such as Facebook's, tensorflow and.

B

Nvidia's tensor RT and cafe 2 through the onyx import path.

B

So this is just a deeper dive into the inference server internal architecture. So the green box here on the Left represents the inference server itself.

B

You have our client requests coming in from the top through the HTTP or G RPC endpoints and there's also a Python or C++ client library that enables that interaction between client and server from there requests are sent through to request handling- and in this case it's a simple image: classification example when identifying between dog and cat, and then they go through to the per model scheduling cues, and this would be based on what kind of a model it is.

B

So in this case it's an image classification model so based on that, in the in the request and the request itself, it will go to the scheduler queue for that particular model. That's needed to execute that in format inference and from there it's sent to the framework backends that actually does the infants compute, and if that image classification model was written in tensorflow, it will go to the tensorflow back-end and from there the result is sent back up through the response handling and back up to the client.

B

One thing to note here is at the bottom: the metrics are exposed through an act. A separate HTTP endpoint and the models that are used in the inside the inference server are are visible through a model repository. That's a persistent volume.

B

Okay, so that was that was kind of a zoom. What we just saw was kind of a zoomed in view of what is inside the inference server and now we're taking a step back, zooming out and looking at where it fits into the larger inference ecosystem. So you'll notice that at the far right is the inference server. I guess I can move my mouse here so that that's what I just showed you, the zoomed in portion and now starting on the left hand, side the user. In this.

B

As an example workflow, you can say a user uploads, an image and it's sent to an to some sort of application through to a client, API and some pre, pre and post processing is done on it, and this could be image cropping or scaling, or maybe the model requires a mask over it, and so that's sort of the pre, the pre-processing that that happens.

B

After that, the pre process data goes through to a load balancer that can direct traffic to the proper instance of the inference server and like I mentioned, the model repository is mounted into that inference, server as well, and so now the inference server is running that model, whether it's image, classification, whatever model it may be, and it's sending responses back and it's it's doing this as requests come in and at the same time metrics are exposed through and to enable auto scaling and to be able to spin up new instance of instances and so metrics, such as queue time and GPU utilization.

B

When those kind of go up. It's a good indicator that it's time to spin up a new instance and then you'll also notice a dotted line around the around that portion, and that shows our collaboration with cube flow to support the tensor RT inference server, and so there's a detailed blog describing that collaboration and all the code is available on github.

B

Okay, so this the chart on the right shows the performance gains you get when using the tensor t inference server across three separate deployments of ResNet 50, which is an image classification model. It's the typical one used for performance benchmarking and so tensorflow F P 32 on CPU and GPU, and then you'll obviously see the most performance gain when using the NVIDIA tensor RT version of the ResNet 50 and the main takeaways.

B

The key takeaways here are that the inference server supports CPU and GPU as well as tensorflow native and the Nvidia tensor RT plans that I mentioned- and this is all under a 50 millisecond latency SLA across all deployments.

B

Okay, so what I'll do is I'll jump into a demo. So this is a video.

B

So this is actually a video of the of our tensor RT inference, server, flowers, demo and there's a live version of this that we'll be running at our booth, so feel free to check it out. So let me just describe what you're seeing here at the top is our flowers, client, and so all those little images are images of flowers being classifieds, whether it's a daisy or a rose or so on, and this flashing bar going down indicates that classification.

B

So all these images are being classified and it's it's moving down like that, and so at the bottom left corner over here, we've put images per second, the demand and what's being delivered.

B

So at this point it's 800 and we're meeting that demand at 800, and so at the bottom excuse me: we have two dashboards which give us some insight on what's happening in the data center on the left is is a cluster that is a manual deployment of these models without the tensor RT inference server and on the right is a cluster with 1030 inference, server and abled, and at the bottom you'll notice that it's broken down into a per GPU utilization and the percentage of models loaded on to the GPU is indicated by color.

B

So focusing on the left side, the left dashboard each each GPU is dedicated to a single model. So there's a blue model of green model orange model and a yellow model, and so the blue model in this case will be the flowers, the flowers model that you see running right now.

B

The this dial over here shows average GPU utilization, which so it's 20% right now and soon what you'll see happen is I can skip forward a little bit.

B

So what just happened was the demand increased. We increase the demand to five thousand images per second and keep in mind that we're still directing traffic to the the manual no tensor RT, inference server cluster and so you'll notice that the two GPUs that are running this flowers model are completely maxed out, while the other models, whether it's a deep recommender or anything like that, those those GPUs are under remain underutilized.

B

You see the spike on the chart happened there and the average the average GP utilization is around 38, and so a typical solution here would be to increase the hardware and just add more GPUs to support this. This flowers model, but you're still left with underutilized hardware in your data center, which is really inefficient and you'll also notice in the images per second on the bottom left corner that we're not meeting the 5,000 images per second demand we're only getting around 4,800, so that's not ideal in a production workflow.

B

So soon what will happen is will drop, will stop the traffic going to the the non tensor RT inference server cluster and will move all that traffic to the enabled one with tensor artainment server and you'll see that drop as soon enough you'll see it drop in on the left and peak on the right there. It goes and.

B

At the bottom, all eight GPUs have all four models loaded on to it. So when this peak happens, you'll notice that beforehand the GPU utilization was around, maybe 17%, I. Think and now it's back it's up to 39 or 40 percent similar to the manual to the manual deployment. But this one you get the same average GPU utilization.

B

But in this case all your hardware is being utilized and you can also see that we're easily meeting the demand of 5,000 images per second, with plenty of capacity to even spin up a new workload or and in or increase the demand. And so another thing to make it more realistic is that we show that it can also, with the 10:30 inference server, enabled and having your models evenly distributed across your hardware.

B

You can increase and decrease your demand like a typical spiky, bursty workload would actually be like because any model can spike at any time, and your hardware really has to accommodate for that, and so maybe I can skip a little bit.

B

So here we go, we've increased the demand to 15,000 images per second and we'll go back down to five and go back up and kind of imitate or simulate a spiky workload here and you'll see that the GP utilization goes up and down and all the GPUs all eight GPUs are being utilized.

B

So you can see that happen, and none of them are really being maxed out. Yet the fact that they're, the fact that they're distributed across all eight GPUs allows for more capacity to be able to handle that spiky workload, and so in a little bit, you'll see that, with this configuration, we're actually able to get to 18,000 images per second, if you notice the grey box there 18,000 images per second and where we've completely maxed out our GPU utilization- and this is using the tensor RT inference server.

B

So there you go see, that's 95 percent, all GPUs are being used and we're we're pretty much meeting that 18,000 images per second there. It is okay, so I'll go ahead and hand it back to the charr who will talk about openshift in kubernetes and on the road ahead.

A

Wow, that's good so with that, let me know, jump back and see what we can do with open shift here just a moment in to escape bad.

A

They go all right, so, as I said so, so what can we do with this beautiful demo and the ten sorority infants and worse, they wrote ahead for this on open chef. A deep learning on open shift is what I'm going to describe next, we'll start with that basic ten celerity influence server, that we saw the demo for that trip. They described a little earlier and then we'll say: oh, that's, actually a bunch of containers, and so the containers need a container platform, and so we talked all day about it today.

A

So we got open shoe container platform, which is basically Cuban Eddie's. It can run across data center and cloud and by the way it supports GPUs and media GPUs and therefore you can run it either on the data center or on the cloud or a combination of those two okay. But now you have. The tensor are the infant server. But now you need a you, you actually need to use. It may be asked to describe.

A

Maybe it's part of a you know, recommendation system or you have a you- have a chat bot which is using that a natural language processing. You know, so you have an app that you want to create and you want to deploy it in production. So you have your cloud native, intelligent, app and that's running on tensor RT, inference, server and and and by the way it needs a things such as a load balancing it needs things such as routing it needs encryption.

A

Then we have the openshift service mesh that was talked about earlier, so this is kind of your production setup. Now you have a cloud native, intelligent app, which is actually using the tensor RT inference server and the underlying infrastructure that is provided by openshift to do for this to be deployed in a shown in production right. So now, okay, but where did these models come from right? Somebody has to actually create these models, so you can use openshift as a platform to train your models.

A

Your data scientists can do that using bring your own or actually and media has a bunch of pre frameworks such as for a tensor flow, etc that they call a nvda ng c. Then we, the GPU cloud container, so you can run them on top of OpenShift, and so there you got your data. Scientists have access to the best of the best frameworks to create the models for to feed into this. Tensor are the inference server?

A

Okay, so what's next so, but as you saw earlier, you probably need to do some pre-processing of the data, some post processing of the data you need to set up the data pipelines. Your data scientists might want to you know, visualize the data that is coming in using Jupiter, something that they might be used to. They want to use Kafka for as a message, bus and maybe use you know a spark for real-time processing. So all those patterns are available on OpenShift and you'll.

A

I won't go into the details of that, but you know I'll describe some references in the end and all these different quote-unquote framers can run on top of openshift. So now what do you have? You have your inference server. You have your models. You have set up your data pipelines on OpenShift. Now. What do you want to do? Is you also want to? Actually you know right here, cloud native application? You want your developers to do that.

A

So that is something that obviously you know open ship is not just a container platform, but also a devops platform, and that's where you obviously can use existing patterns Jenkins that we already ship or in the future, Kay native with you, know things such as building rent and service on top of open ships.

A

So that's that gives you an idea of how open ship can be used in this entire ecosystem from end to end and some of the things that have happened so far really are things such as device manager, support which enable the support of GPUs. They have been available. I'd say for about a couple of releases already: we have introduced other features with the community, such as priority and preemption, which can be used, which will become as essential if you are trying to do things such as model training and and using jobs, for example.

A

So, recently back in October, we announced jointly with an media support for our Enterprise Linux, which is certified now on DGS DJ, x1 and Tesla GPUs, which is basically the x1, is the GPU appliance from Nvidia, and that support is available now on OpenShift and rel, and then here we're showing you a preview of tensor RT inference server on OpenShift. We are planning to write a reference architecture and obviously the road ahead is really to make this a much easier deployment deploy.

A

Experience, install experience with operators, we have other exciting stuff that we're going to talk at cube, corn or the next couple of days with the community and doing things such as GPU sharing heterogeneous clusters, for example. How do we have I mean this picture on the left? Here it looks nice, but some of these things- they're true genius cluster, still is work that needs to be done up in the upstream community. So there are some things GPU. Our topology is another thing that we are working on.

A

So so, if you step back now, you know what? How does that hat CAI? You know I think one of the the four pillars you know here, as you can see, running AI as a workload on top of openshift, and you know our ecosystem is obviously the first pillar- and you saw a little bit of that today on the rightmost. Is how do you build intelligent applications? I thought I touched a little bit about that. Also.

A

So, if you're going to build an intelligent application which uses AI, how do you build that and then they're the two in the middle, which is basically the first one really is? How do we and continue to enhance our code business using open source tools and AI and the second one really is you know? How do we enhance our products itself using AI? This is where Chris Wright was pointing out earlier about automation and self driving. You know products and self-driving components.

A

Oh so that's kind of how reddit sees with oh I missed the most important piece with data as a foundation, and so you'll you'll hear some of this. You know throughout the discussion over the next few days. So here are some references and cube con highlights to round this up. You know and media is going to present basically scaling GI entrants workloads. What you saw today at a much larger breakout session. First, that's tomorrow, Tuesday December, 11 I believe it's at 1:45 p.m.

A

then we have on Thursday, read at some of the red engineers are going to talk about that last bullet that are showing earlier of the last box about how to build our deep learning through K in native civilization,.

A

Talks about how to use the tensor RT the influence server at the second one is using cue flow through dimension. How to use cue flow to deploy the tensor RT influence server on Cuban Aires and then the last one is actually on how to basically enable GPUs on open shift. So so these are some of the resources that you have there's a webinar which is coming in Joey's it'll.

A

Introduce me in just a moment about which talks about how to maximize GP utilization that we have booth setups both at the ingredient, Red Hat, both so come and check it out and as we mentioned, there's a live demo going on. There, then I want to also mention that the tensor RT influence server is open source. It's available on github you'll, find a link there and then the open ship, Commons machine, learning cig, that's something where we as a community, from open shift and in media and others.

A

We get together and talk about machine learning, and then we have open data hub, which is some of the four pillars that I talked about. This is a Red Hat CTO office initiative and they are trying to build a data hub that I described earlier and the four pillars in a open way. So there are plenty of resources, lots of excitement- you know so hope you guys can participate in that and with that I think. Thank you. Thank you. Thank you. Nvidia and I ran back to dad one of the one.

B

Of these will work all.

A