Cloud Native Computing Foundation KubeCon + CloudNativeCon North America 2018 (Seattle), 16 Dec 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scaling AI Inference Workloads with GPUs and Kubernetes - Renaud Gaubert & Ryan Olson, NVIDIA

Description

Join us for Kubernetes Forums Seoul, Sydney, Bengaluru and Delhi - learn more at kubecon.io

Don't miss KubeCon + CloudNativeCon 2020 events in Amsterdam March 30 - April 2, Shanghai July 28-30 and Boston November 17-20! Learn more at kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects

Scaling AI Inference Workloads with GPUs and Kubernetes - Renaud Gaubert & Ryan Olson, NVIDIA

Deep Learning (DL) is a computational intense form of machine learning that has revolutionize many fields including computer vision, automated speech recognition, natural language processing and artificial intelligence (AI). DL impacts every vertical market from automotive to healthcare to cloud, as a result, the training and deployment of Deep Neural Networks (DNNs) has shifted datacenter workloads from traditional CPUs to AI-specific accelerators like NVIDIA GPUs. Leveraging several popular CNCF projects such as Prometheus, Envoy, and gRPC, we will demonstrate an implementation of NVIDIA’s reference scale-out inference architecture, capable of delivering petaops per second of performance. This is a new and challenging problem in the datacenter and we will discuss these challenges and ways to optimize for service delivery metrics (latency/throughput), cost, and redundancy.

To Learn More: https://sched.co/GrVq

A

Okay, hello: everyone welcome to our talk, cube guns, us killing, AI inference with kubernetes GPUs. So, let's present ourselves. First, my name is Renault. Gerber I've been working at a video for past two years on containers, kubernetes and open source and I'm.

B

Ryan Olson I'm a solution architect for deep learning, HPC and I guess cloud. We call renault mr. kubernetes because he's kind of our main guy so I'm I'm kind of a GPU user and a person who's developing applications for GPUs and then using kubernetes to scale those out well. Renault is actually working on kubernetes, directly yeah.

A

And we've actually started interacting with the community. Two years ago, when I joined the Nvidia with the container team and a lot of the work that we've been doing in the in the kubernetes ecosystem is making sure that GPUs are acknowledged as a first-class citizen and not something that you hack your way inside a lot of effort has been and has gone in the community. A lot of things and interactions have been involved and, for example, you can see that, and the last face to face that was in March was actually actually happened at Nvidia.

A

So today we're gonna talk about scaling, AI interest with kubernetes and GPUs sure.

B

And so to do this, I'll suppose with you, yeah give us some like obligatory like. Why do we care and then I want to break down the title of the talk, scaling, ai inference with kubernetes and GPUs, and really talk about what that overloaded scaling term means scaling. The GPUs means something different than scaling with kubernetes, normally with kubernetes. We talk about scale out. The scaling with GPUs means scaling up or adding some new capability to your nodes that weren't there before we'll talk about that, then scaling the AI inference pipeline.

B

This is a very different pipeline than your traditional microservices. So what are the new parameters of that? How to look for potential bottlenecks within that pipeline and how do they fit into your infrastructure and then, of course, we're gonna go into actually scaling out kubernetes. So why do we care it's an AI? They eye inference, defund a machine learning it's everywhere in the news, it's kind of a big deal. It's touching every vertical and here's the examples of video speech and recommenders they're used that it's a hyper scale level.

B

Everyone kind of knows about this kind of stuff, but it's also being used in other verticals and so I kind of like to like switch gears a little bit and talk about some examples that you might not have heard of. But these are some examples of how these new ways, these new state-of-the-art methods, things that have been traditionally extremely difficult for traditional machine learning to do now with deep neural networks. You know the ability to perceive the world and to make predictions has just has fundamentally changed how we've approached some of our business aspects.

B

So first example here is medical imaging this one hits pretty close to home. This is doing a stroke analysis, so 40% of radiology images that come in are classified as high priority, like if you're having a stroke and you need. You need a quick evaluation you're still in a queue of 40%, of high priority emergency cases, and the sooner that you can get evaluated the sooner that they can actually deliver potential medicines that could be good or bad for your stroke condition so having an AI that can do.

B

This could potentially be life-saving in some of those scenarios. The next two examples, an infrastructure in this industry. These are examples of predictive maintenance, so, in the example of emergency pothole repair is it turns out like if you pretty, with all predictive maintenance. If you let the problem go too long, it costs way more to fix it, then, if you can fix it early. So this is an example of using data and using some deep neural networks to apply that to say where we have a limited amount of budget for our road repairs.

B

Where do we specify? Where do we actually go? Do that before those potholes become a problem, and so try to predict where those emergency cases will be and correct them before they become an emergency, and the same is true in the industry. This is with GE with these big gas turbines. When you go and take those big gas turbines down, you actually want to like go in and do your repair specifically where you need to.

B

If you know about bearing is loose, if you know like a fan, blade needs to replace you want to have all these sensors attached that until you use these deep neural networks to then predict when you take it down and then, when you do, take it down where you focus your efforts doing that maintenance. So it's really pretty much touching every industry, so it's really exciting, and so now we're gonna switch gears and talk about scaling with GPUs.

B

So these networks over the last you know, eight six to eight years are just getting more and more complex. We throw more data at them, we throw more compute at them and they get better and better and better. So this is just an example that the big dot is basically a the multiplicative effect of the compute times the memory bandwidth, and so you can see that the complexity of the compute complexities gain greater over time. So how do we combat that? We combat that with our tensor core GPUs.

B

So if you want tenser course, we have these Jenga sets in our booth, our tensor cords and our GPUs are specific Asics designed for the key compute in deep neural networks. Specifically matrix-matrix multiply as it accumulates.

B

So these are a sixes are added into the sm units, so our traditional FP 32 into 8 FP 64 in some cases and now add on tensor cores, for deep learning and in our training products RTX for real-time rate tracing these tensor cores, provide an incredible amount of extra performance and and that helps combat that growing curve of compute needed to evaluate these models. So it was pretty exciting. Is this small little GPU? This t4 GPU is 70 watts you get 75 watts of power on your PC IU bus?

B

This is 70 watts, so it fits in that profile. You can drop a t4 into any server and get incredible amount of extra compute capability on your device. This is what we call scaling up. I'm gonna show you an example of that right now this is resident 152 running on traditional CPU. Every time you see a flash, that's an inference. Request comes going out and coming back to the server, so you can see our this is a 24v CPU.

B

Skylake server is getting about three to four images per second, and actually this demo might not work so well, if you guys are all on the network because we're on the Wi-Fi- and you can actually see this as running live. So if you see a pause, that means we're actually getting some Network interference. This is when we switch over to the T poor. This is the same exact model running on a part that has the same wattage and you can now get 1,500 images per second.

B

This is why GPUs are so valuable for evaluating deep neural networks.

B

Next, the inference pipeline says the inference pipeline is very different than well, not very different, but different enough than your traditional micro-service pipeline. These are I'm going to talk about some tools, so upfront, I'm gonna, give you some acronyms if you're in the industry, you understand these. If not, maybe it gives you some places to go to understand, so the tools that I'm gonna talk about is NGC nvidia, GPU cloud. This is a registry of docker images. These work really well on the kubernetes land.

B

Everything that's below here are basically packaged within those containers, so we have envy RPC, which is really just a simple wraparound gr PC. This is to help build micro services, basically billion compute, bound micro services and compute bound async micro service, simplifying all the simplifying all the extra overhead and extra boilerplate code that goes into writing an e-signature RPC, the tensor T product is a library for optimizing. It's basically an inference optimising compiler, it's one of the reasons why we can run so many images per second on such a little GPU.

B

The tensor T inference server actually wraps, not the capabilities of it's. It's. The actual service, it's a rat's, tensor, rt+, tensor flow, plus Chi f8 and then deliver it delivers, and so this is one of our containers and then kubernetes for GPU orchestration, the cloud days. It projects that we're going to use we're going to talk about G, RPC envoy, for load, balancing SEO for service matched and also for load, balancing Prometheus for metrics and then rook for model store and intermediate results.

B

So what these are compute Python would like. So in the example, I showed you of images flashing. That's our input. Data is the image itself, so this could be speech that would just be a sentence and it could go through a network you can convert it to friend, which so we've ever known. I can have a good conversation.

B

But before goes into the network that input data has to get converted into some raw tensors, and so there's this transformation stuff that has to happen and that chance transformation step either happens on the CPU or on the GPU, and we call that pre pre processing. Once it's converted to the input tensors of the network, it has to move into GPU memory.

B

So that's a transaction over the PCI bus that you have to make into account more then the compute itself, which is the evaluation of the deep neural network and that's using usually using tensor, RT or potentially a back-end framework inside of our your inference server. And then you have the same kind of reverse: onion appeal: going back outside to output, tensors, going back to hosts memory, some post processing to convert those output tensors into some consumable output.

B

So we might output a whole bunch of bounding boxes, but then we need to actually convert them to the dimensions like the XYZ coordinates of that bounding box within an image. So if we want to get the best use out of our GPUs and performance, we want to make sure that we keep this pipeline full and that there's no particularly long pole in the tent.

B

So we want to make sure that this path, this pipeline is well balanced, both with data movement as well as compute, and so what that means is we're integrating a lot of HPC best practices into the data center workloads. As you saw these networks as they perform on CPU versus GPU, they have an incredibly different compute balance and what typical data center workloads have had in the past, and so because of that they also have an incredible amount of data movement that can happen as well.

B

It's a balancing those two things out of this critical foot performance. So what are the bottlenecks? Moving data is always a bottleneck, so somehow we have to get it from our input to our compute and there's a lot of different places where data can be moved and a lot of different conversions of data from one format to another that we have to account. For that big one. Is that input to input tensors or output tensors back to output, depending on your problem? It's extremely problem-specific.

B

So an image image is almost your best-case scenario, whereas potentially for language models, the the input might be incredibly small, but the output tensor could be megabytes or gigabytes. So you have to think to yourself if I'm building completely discrete micro-services and I converts a small sentence into this hundreds of megabyte tensors I, don't want to move that over the network again do I, because all of a sudden I'm gonna become Network balanced.

B

So thinking about how big these memory objects are and where do I move them is going to be critical for success and so success here is really defined by making sure that you understand your pipeline, that you choose the right hardware and the right set of software to be successful, so we're gonna kind of start at compute and work our way back for computes the actual valuation of the deep neural network. This these are our options. Ideally, we use the tensor RT product.

B

This is a library that is designed to both optimize and compute a neural network. It gives you the best possible performance, the best possible memory footprint. You have really precise control over all of the the memory buffers, as well as the precision that's being used, and it's an incredibly deployable package. You just packages up as a library and you can ship it off as a C++ application or Python, but it also has the lowest DNN compatibility so to come back combat that we integrate tensor RT into the some of the frameworks.

B

So tensorflow has it integrated and just recently PI torches integrated tensor RT operations as well, which covers about 95 98 percent of the deep learning community. The the pitfalls of integrating the framework is now you have. The extra framework overheads frameworks are generally designed for the training process, not necessary the inference process and who owns the memory is kind of a slight battle.

B

So it's a little bit less efficient and generally it requires you don't get quite as much throughput as long pressing and finally, the worst case scenario is if the model doesn't convert to tensor RT at all you're, just in in the realm of using the framework to evaluate the inference model, so preferred is right, green to red.

B

So what test our tea does as a product? Is it evaluates a neural network and it does a whole bunch of symbolic optimizations on the graph as well as runtime the valuation, but what's really really neat about it is the same network evaluated on a small GPU like a t4 or a big GPU like a v100 will have based on the different compute characteristics of the device. What tensor cores are available?

B

How much memory bandwidth is available that the temps, Archie auto-tuning, will choose the proper kernels for the device and make sure that you have the most optimal run time? So it's really a pretty amazing product, and the result of that is these incredible gains in performance. So we see 21 X, speed, ups in deep speech, 27 X in resin at 50.

B

Actually, sometimes even better and then the 36 X for nmt next is: if we step back from the compute, we have the pre and post processing, and this is actually where you have to really start thinking about it from a deployment perspective as as I kind of alluded to before, you can do that. Transformation of the input to input tensors on CPU or GPU, an example here would be. Video decode is a great example of just doing everything on device.

B

H.264 video coming in can get decoded on device, go directly into GPU memory and being be inferred directly, meaning that the CPU component of it is very small which changes your deployment. If you have a huge amount of CPU compute that is required to do that transformation or that the sizes are really big, then the location of it is very important. So there's four primary locations that you can do it you can do it in process, so we open source a10 start to the infant server.

B

You can build your entire pre and post processing pipeline directly into the temp directory in print server. You just download the code. Add your stuff in recompile it now you're in process. That's the fastest possible way to go from pre and post processing to compute problem is you're coupled in your scaling right, so that kind of breaks, traditional micro services now to get more pre and post processing. You have to gaile up your compute as well a similar way.

B

If you want to keep your logic, decoupled and separate, you can do in pod, so you have a container in pod that does pre and post-processing and the tensor RT inference server. That way, you don't have to touch the tensor T inference server. You can build your your logic external from it, but you can still use things like system v IPC and it's basically the shared namespaces of the pod to get better performance on moving the data between the two and you might still have to make some modifications to the inference server.

B

In this example, you use shared memory, but we're looking on adding that. Next, though, your back would be like in node, and this actually represents a few problems, but you so you can use positive finding these to co-locate your pre and post-processing containers. On the same note, as the thing that's actually doing the compute, but you might need to have some hacks to break down the name space barrier so using system v. Shared memory is an example.

B

There is no way to say: I would like to use the same IP the same IPC namespace as this pod in kubernetes.

B

The only way to do it is to go down to host which really kind of breaks this a whole containerization strategy, so I think working with the community they're on better, like namespace affinities, would be pretty great and finally, like fully independent, like traditional micro services, pre, post processing puts all of its data on a network stack and sends it, and that can work in some scenarios, but in other scenarios this would be prohibitively expensive and at the end of the day, is you really have to think about data movement?

B

Data movement is, is absolutely the critical bottlenecking components serving, so those are the compute and pre and post-processing step those steps that you have to kind of go through and benchmark and think about what it is that you're trying to deploy. This is how you actually serve it. So we have this tensor RT inference server. We package all of this up into not just an application, but we also it also lives in a container that you can download from our ng C, so ng Co and video comm.

B

The advantages of this is we do all the logic for you, so we offer tunable concurrency, you can choose to optimize for latency or a throughput or some slider in between, and usually you can find the sweet spot in between that allows you to get kind of the best throughput, with only the minimal increase in latency because of some of the challenges that were kind of alluded to in previous talks about how to oversubscribed GPUs.

B

We support multiple models within the same process and this is a big advantage, because it allows us to simplify the memory management on the GPU and to get to interleave and get good to get good overlap between model different models under different loads, and you can actually see this example down in our booth. We're running four different models being perfectly interleaved on the same device. We essentially support all AI frameworks, either through conversion for tonics or with tensor flow directly.

B

We support tensor flow and cafe and $0.10 RTS backends, but Onix helps gives us to the rest of the frameworks.

B

Now, if you so, if you don't integrate your pre and post-processing steps into your into the inference server itself, you need to do have some sort of service. That does that. So you can build that with your favorite micro service library. We have some examples here.

B

Actually those examples will be published later, but we call them the middleman service in the bathroom service we use this and the RPC wrapper around G RPC. You can find that into the inference server itself and we'll publish some more examples on it. In the future, we have some examples on this pre pre post-processing, where the data comes in and basically acts as a middleman, and then the priam, the middleman service talks to the tense RT inference service on your behalf or the batching service which collects low batch requests before the load.

B

Balancer types them through a load balancer to a back-end. So there's some examples.

B

Metrics, so metrics are really important. You can always get note level metrics, but no level. Metrics only give you half the story. If you really want to like make smart intelligent decisions on when to scale your deployment, you want to actually have application level metrics and so our ten start to the inference server.

B

These are the metrics that we provide, so we have different ways to look at GPU utilization on the inference load as well as the latency, and so all of these are exposed as Prometheus metrics and can just be absorbed and viewed in group on a-- and again, that's an example. That's being run in the booth right now.

B

So with that, you get some examples here with helm charts and you installed the Prometheus software Thank You Prometheus as usual, and then you get these dashboards. Here's one example and then here's another example- and this is what's actually running down in our booth and yep.

A

So the question that you might ask when you actually and when we actually run through this whole pipeline, is how does this intervention, Bernays and kubernetes is actually a really interesting area of focus right now, because it's synergize really well all containers, kubernetes and GPUs synergize really well.

A

The way that we see things is that before kubernetes and containers, people were actually building their model on their own machines and actually going through the process of training on these machines, and once they actually got their applications, they would actually give it to their IT, ops or sysadmin.

A

These people would then start a VM or, if we run them on their cluster, and so the that old way of doing things actually is starting to change, and this is where people are who are actually building clusters and their own clouds are thinking about how this integrates with kubernetes and how this interface with GPU and HPC, and so the rise of actually compute, has allowed this new way of seeing the cluster as the thing that you're gonna.

A

Let your user actually run their GP workloads and their HPC workloads, and so people start building their data center. Not just around one single place where you saw or where you saw store all your machines, but also as a place where you want to be able to have HPC, and you want to be able to have data flow, and you want to be able to have your user run.

A

One model on one machine, one model on multiple machines, multiple models on multiple machines, and so this new way of seeing your cluster has been enabled not only by kubernetes containers, but also by the rise of hyper. So high performance computing and the Asics irrelevance, and so a lot of that has been- or at least NVIDIA has been in this space. For the past two or three years around the Nvidia can, with the Nvidia container runtime, also known as in video docker. This interprets pretty well was like containers.

A

It actually integrates with a lot of runtimes out there, docker CIO, singularity, Alec C, and we integrate at the runtime level, which allows us to actually be able to do to have one software that you can use for the diff all the different runtimes and also have a behavior. That is exactly the same on all the different runtimes.

A

How this ties Apache in communities is allows you to realize, allows three major use cases the first one- and this is something that is enabled through communities with namespaces, with quotas, with priority and preemption, and a lot of tools that kubernetes is going to allow you to use, which was around resource attribution. How do I actually say this user is going to be able to have this machine or how many GPUs, or how many machines in general do I.

A

Basically chord on a few nodes and then give him SSH access or her X SSH access or do I, allow him or her to wear a pot speck that described what his or her job is going to be and then try and run that on the available nodes. Second, one- and this is the use case that communities has been mostly built around- is around running production workloads. How do I do and for instruction and as we've seen was Ryan running production.

A

Ai is a lot more complicated than your traditional micro service, because it is a pipeline and the the slowest element in your pipeline is going to define how fast you're going to go. How fast you're going to serve your requests and the last one. Then the last few skis that kubernetes allows you to to do. Is cloud bursting? How do I take my on-premise cluster when I don't have enough resource? What do I do and cloud bursting is a really interesting way of seeing things instead of trying to plan for capacity.

A

I can now instantly burst in the cloud, and this is a really interesting use case that kubernetes allowed so we're going to see how we actually took the different CNCs projects and use them with NVIDIA GPUs with Nvidia products to build that AI production pipeline. Here you can see one of the example production deployment that we have.

A

If we run through the user paths, you can see that you, the user, is going to communicate to your API endpoint using G RPC and that API endpoint is going to go through some post-processing steps and a pause and then send it to your sensor. Rt inference, server, GRP, CN and voi are actually used all along that process, so that you can actually have these services communicates your 10 Sorority server or you're.

A

Generally, your inference, server is gonna, start doing the inference, work and then send it back to your service to do some post-processing to send it back to the user. Finally, the models that you will see in the tensor are used in for a server. Are you going to be served by a model repository which might or might not be backed by things like Luke and on wall? Everything is running on this.

A

You can actually see that prometheus is going to gather there metric and serve them to the consumers, and consumers might be, for example, your autoscaler, which is going to be able to scale your different inference. Server pots based on load.

A

B

So we've actually seen so when we showed the flowers demo running in the past. You saw exactly this pipeline. We're using this client here is an OpenGL application that is sending G RPC requests to a middleman service that middleman service is in V RPC that basically well, we kind of for demo purposes. We do a little trick. The data set because we can't guarantee the Wi-Fi is gonna, be good. The data set is actually a service in that middleman service, so we're.

B

Basically, the laptop is sending a request to say please in per image mm, that's handled by the middleman service which put packages that data into a request, another the G RPC request that goes into the ten sorry to you inference server. It comes back. We pre-pro we post process that to like provide the label and we send it back.

B

So, let's go back and now look at the demo again and we're gonna add one more component, so you can see that network is a little bit jittery here, but one of the things that we did was we built a control around convoys. So we have our own.

B

Basically envoy discovery service that we can control the endpoints and if we all go on airplane mode, maybe this gets better because I'm on, like two-point 4G Wi-Fi here, but if we jump back into our communities container urn to our granny's deployment, I'm just gonna scale, this up, I'm gonna scale my deployment by 16. So oh boy, I'm not attached.

B

Live demos, gotta love it, okay, it's scaled! So when we go back here, we should start to see them start showing up in our in our web UI. Our web UI is basically just monitoring the kubernetes api and we're controlling this directly.

B

So normally that would just go into your load balancer, but because we're telling the story didactically and we have really bad Wi-Fi- we're not gonna, try to actually scale this up to 16 man it'd be great if we could just get it smooth again, but we so we can go and add in come on make the request there we go, and so now that that basically modified the Envoy discovery service and had the second GPU into the mix. And you can see now we can jump up to about 3,000 images.

B

A second, if all goes well on the network, and so we can crank this up and and play with it on demand, and so it was kind of a fun. The fun demo, except for the Wi-Fi, is miserable. No.

A

And so actually running this on kubernetes and running HPC pipelines on kubernetes has some of its pitfalls. The first one that you're gonna hit is around resource management. So this is a slide that I actually pulled from the queue Connie you reach inside kubernetes resource management, but one of the thing that you're gonna find out is that you want to be able to set the quality class of your pods, and the quality class of your pods is defined by how you set your CPU request and limits and memory requests and limits.

A

This defines this basically allows you to tell kubernetes my pod is important or not, and based on that kubernetes is going to evict your the different pods based on this priority scale, and this also allows you to tell kubernetes whether you're using static, CPU, pinning or not. So one of the issues that you currently have with communities right now is when you're using HPC services, so services were pods that are bound by a compute.

A

Most of the time is like kubernetes, runs, CFS quotas and right now, there's a known bug affecting well-behaved applications in the linux kernel, where, where applications that behave correctly, will actually get cpu star old, and so that's a big issue when you're, when you're running CPU, bound applications and.

B

To take that a step further on top of just the Linux scheduler, one of the really important things to do when you're thinking about doing your deployment is to align your CPUs and GPUs together, so that you're not having CPUs on one socket trying to communicate with the GPU that exists on the other socket. So part of topology. Aware scheduling for these resources is really important and something to keep in mind as well and.

A

Unfortunately, unfortunately, with kubernetes static, CPU pinning is possible, but you don't control on which Numa socket you on their own. So that means that you might actually get static leap in and get the best CPU possible. But your transfer transferring your data from your CPU to G pew is going to be really slow because you're not on the right sockets. So Cuban Andes has a lot of pitfalls around running HPC applications, but this is an ongoing conversation in the community and expect to see some solutions to this. As time goes on,.

B

Yep with that, we have questions time for questions and I. We don't have a mic, so you have to present your question, we'll repeat it and then.

B

Yeah, the tents RT imprint server has a model ingest that looks very similar because it is the tents it's based on the tensor. It's based on tensorflow serving model stores.

B

Yeah, so the question was above saved model formats from tensorflow and yes, a the tents RT inference server uses consumes save models as a preferred way. The new preferred way, as well as dot PP or graph deaths from cafe I think there was one over here.

B

So the question was relevant: I'd tenants, and can you explain just what you meant about a little bit further.

B

So different different inference, small to performing inference on two different models on the same device. That's actually what M sorry in print server does you have male model? B they'll have their own set of buffers. In some cases they can share buffers, but essentially transactionally they're completely separate, but there are also interleaved. If you need discrete, then you just display them as separate services. But if you don't mind that they're in the same service, you can you just get better ROI and better efficiency that way better throughput.

B

One more question.

B

It's something that we're working on I would say at this point: no, but we could easily add it.

B

No, so each each molecule has its own. Has its own thread, that's driving it. So essentially it's almost basically at the contention of the lock on some of those sources. It's not it's not that much, but if you wanted to define it, you could go in and and make some explicit priorities and something we're looking at.

A

Yes, yes, that's definitely an effort in the community that we're for sharing right now, one of the documents that would try that we're trying to push right now with the community is called the new map manager.

B

One more question.

A

So fractional gpus is is an interesting concept, but the way that you want to talk you want to see it right now is, can I share. My models? Can I can I have multiple models running on my GPU fractional GPU is not something you cannot see the GPU as the same way as a CPU. So.

B

If we know and I were both two different processes on the same GPU, without something like MPs, I get exclusive access to the device or he gets exclusive access to the device. So if I'm performing a copy and then doing like a pipeline, my pipeline gets interrupted. Well, he takes out well his contacts takes over. So we don't get that efficiency where we get that good interleaving between our two processes on the device and that's why we have the ten story inference server to combat that for inference.

B

Right now is that by co-locating the models in the same process, we get that interleaving and we get that performance. But if it's an action that can be really, it can be pretty big. In some cases,.

A

So if we even take a step back, the way to see GPU sharing is that it's not only a problem about how do I run two processes on the same GPU, as we've highlighted during the talk like running GP application is, is really a pipeline and every single step of that pipeline might break, and this is exactly the same problem. Where is it going to break on your PCI throughput? Is it going to break on your Linux? Threading?

A

Isn't gonna break on getting the data from you know the right CPU from the right CPU to the GPU, we're so multi-tenancy and like sharing pyou, is a really complex problem. That does not involve only GPUs and that's that's the problem that we have.

C

B

Yep, yes, exactly I think we have time for one more.

A

C

B

Why envoy, instead of you, know, I think when I first wrote built the demo envoy was one of the few uh l7 load balancers out there that handled HTTP 2 with G RPC. You know now, nginx. Does it as well, and you know other things, so you can use like ambassador, which is just a wrapper around envoy, or you can use misty Oh, which is another wrapped around on voyage.

C

B

Certainly, although it was a great choice because of III I thought that the the separation of the control plane and the data plane was really nice that allowed us to actually build that discovery service to like control it explicitly, which made it kind of fun and like a cool project to use, but the examples that we use, you know with the students you have projects, you know you can pick you can choose to replace those with.

B

What's ever in your infrastructure, those are just the ones that we chose to like build out this demo and kind of like evaluate this inference pipeline how a customer might build it. But of course, if you have your favorites or you know your infrastructure mandates, they use certain tools. You can just swap swap out any of these tools for those soon yeah.

A

Yeah, of course, yeah Indian. These are blocks that you can take and I choose to arrange. However, you think is better fits your architecture. We think here in this case there. These are decision that we can make in terms of performance and adaptability, yeah.

C

B

It's pretty fast thank.

A

You okay. Thank you very much.