Cloud Native Computing Foundation KubeCon + CloudNativeCon North America 2020 - Virtual, 4 Dec 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Accelerate and Autoscale Deep Learning Inference on GPUs with KFServing - Dan Sun

Description

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2021 Virtual from May 4–7, 2021. Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Accelerate and Autoscale Deep Learning Inference on GPUs with KFServing - Dan Sun, Bloomberg & David Goodwin, NVIDIA

Large-scale language models, such as BERT and GPT-2, have brought exciting leaps in state-of-the-art accuracy for many NLP tasks. BERT requires significant compute during inference, which poses challenges for real-time application performance. KFServing provides a simple model serving interface across common model servers with a standardized REST/gRPC inference protocol to serve single or co-located multiple models on CPU or GPU. KFServing enables hardware acceleration and autoscaling of Bloomberg's own BERT models trained on a corpora of specialized, financial news data. In this talk, we will discuss how we use KFServing in a production application to address scalability, latency, and throughput with Knative’s Autoscaler and Activator. We will also discuss some performance debugging tips and show the GPU benchmark results with TensorFlow/PyTorch BERT models deployed to KFServing.

https://sched.co/ekC5

A

Hello: everyone welcome to the talk today for accelerated and standardized deep learning. Friends with care serving the presenter today is dan song from from bloomberg and david grubbing from nvidia. We have two agenda today. The first part we will talk about, accelerate deep learning, friends with kf serving the second part. We will talk about the capstone v2 influence protocol.

A

Deploying ml models at skier is one of the most challenges for companies, learning to create values through ai, as models becomes more complex and deep and more deep. This task becomes even harder.

A

As a data scientist or emerge engineer, I want to serve standardized deep learning models like tensorflow or pytorch, with minimal efforts and at scale in a unified way.

A

I may also want to bring in custom play and post processing before and after the influence running inference on deep learning models can be slower on cpu. I also want to accelerate the influence by deploying models onto gpus.

A

Gpus are powerful computer source, but deploying a single model per gpu can underutilize gpus. I want an easy way to serve multiple models behind a single unified endpoint, which can scale to hundreds to thousands of models in order to save the computer resource.

A

I will also run to auto scale based on workload and allow me to scale down to zero when there is no traffic sent to the service in order to ensure safe production routes.

A

I I want to deploy models with zero downtime and can use multiple deployment strategies like shadow canary and the blue, green layouts.

A

In order to solve all these problems, cafe serving is an open source project founded by google, seldom bloomberg and media microsoft and ibm on the floor. Crew flow was the perfect meeting ground for these companies.

A

Cave survey is following all the open source principles built to build open and cloud native, serving solutions cafe serving provides a combination, custom resource for serving mr models across deep learning frameworks, with a simple, intuitive and consistent experience.

A

It encapsulates the complex complexity of auto scaling, networking health checking and server configuration to bring cutting-edge, serving features like gpu, auto, scan and canary rows to your ml.

A

Deployments it enables a simple plugable and complete story for production and mr survey, including pre-processing prediction and explanation.

A

Case serving codifies the best practice of culminating standard patterns with canadiaf kef serving is able to enable serverless inference and automatically scale up and down according to the influence workload cave serving extracts. The common model serving features like request response, logging, multi-modal polling batching into a cycle agent, so that all integrated model server can benefit all these features.

A

Cave summit departments are immutable. Every new deployment results in a new version and the traffic is migrated over to new region only after the new parts are ready and passing readiness check to ensure safe production routes.

A

It also employs various other road surges, such as canary and the progressive roads.

A

The main cave serving components is still ingress gateway, native, auto scaler and infrared service. Eastern ingress gateway lots. The external requests to the influence service each event service pod contains two or three containers, depending on the features enabled on infrastructure.

A

The main containers in the in the key in inference service part are q, proxy cab serving agent and model server after the ingress. The request first hits the queue proxy, which involves the concurrency limit and timeout. It then goes through an optional cycle agent, if requested, bronze logging and batching are enabled the request. Finally hits the model server for inference the model server for each framework implements rest or grpc handlers and can load multiple models.

A

The infraservice parts can be auto scaled based on cpu or inference workload with kpa.

A

Kf serving 0.5 release promotes the infrared service api version from v152 to v1 beta1. It supports a standard influence for tensorflow pytorch scaling sg boost mm frameworks.

A

It also provides sdk for users to plug in custom components, while still benefiting all the common serving features from cafe serving the v1 beta1 api further simplifies and enables a data science friendly interface and also maintains the flexibility kubernetes part templates back, provides in about a few few yamu lines. You could describe all the infrastructure. You need to get up your uh your models up and running.

A

On the other hand, user can still specify advanced spheres. We need to cap serving not only provides a unified interface for control plane. It also tries to standardize the data plan protocol across ml frameworks, which we will cover later.

A

Since 0.5, we also added the multi model serving capability to improve the resource utilizations for tensorflow models. Kf serving uses the tf serving as the underlying model server, which is a flexible high performance serving solution which supports saving model format and graph depth.

A

Tf7 uses the tensorflow less than grpc prediction protocol, which is similar to kef serving v1 protocol for pytorch models. Cable server integrates touchserve, which provides an easy way to serve both eager model, and this touch scraping model which can serve the model without a python environment.

A

Kf7 is working currently working with touchserve to conform to the kf serving v2 prediction broker in v1 beta1 api. We mainly support three components: predictor transformer and explainer predictor is a required component and interferes on predictor is mapped to kubernetes deployment or template fields such as replica service account, node affinity under predictor user can specify the model framework which naturally maps to container fields like commands, arguments, environment variables, sim applies to the transformer and explainer component.

A

Nvidia triton infraserver provides a cloud infrastructure solution, optimized on a meteor uh gpus. The server supports multiple, deep learning frameworks such as tensorflow high torch onyx, with both rest energy, rc endpoints. It supports model report with versioning and allow multiple models to run simultaneously on the same gpu y, with batching support.

A

On infinite service spec, the storage ui is pointed to the model model repo which can contain multiple models. As you can see from spec, we also set the omp num spreads in in random variable to improve the influence performance, reduce the resource contention as by default python. Just response number of openmp threads same as the number of cores available on the node, which could overload cpu as a default cpu limit on for influence service part is one.

A

You can also choose to schedule the trading infospot on gpus by specifying it on specified gpu on the resource section. You can also choose to add nodafinity or tolerance to schedule to a particular node such as t4gpu.

A

As you may know, bloomberg's customer service is an essential part of roomwork terminal. Ours representatively work very hard to provide the best customer service to our clients by answering questions with high quality and fast speed. The reps are pushed content to help answer questions in the smart resource window. This use case is powered by our fine-tunable model and deployed onto cab service production for the fine-tuned stage. The data we are using are categorized and annotated faq, which gives us half million question pairs.

A

The problem is formulated as question similarity with inputs of two questions and output. As a similarity score, the burn model is saved using export, save the model api which contains complete tensorflow program, including word weights and computation.

A

It does not require the original model building code to run, which makes it useful for sharing and deploying.

A

As you know, bird model at influence. Time requires significant computation time and doing this on cpu can be slow, which can take a second open. The time mechanic seconds this health and challenge to meet the latest requirement for this. Real-Time for this real-time use case deploying the bird model to meet all the production requirements is a challenge task. You need to take all latency, throughput, scaling health check, safe, draw out monitoring or into consideration, and and also how can you scare to serve hundreds of bad models with your limited gpu resources.

A

First, to accelerate the budget influence we deploy the fine tuner model on cave serving with tensorflow serving component which expand the tensor inputs since the user input. Here is the question pair in text. We also deploy the kev serving transformer for the sentence, segmentation and tokenization cave serving then automatically welds up the core to tf. Serving in this way, you can skill transformer and predict differently why deploy asset, while also you can deploy it as a single unit.

A

By deploying the model on gpus, we achieve 20x speedup. We also experiment deploying the bug model on a meteor triting conserver, which achieves a similar performance gains.

A

We have tested the performance on a meteor v100 gpus, with 24 layer, fp16 bird scored model by deploying bird model on gpu. We can get to 15 milliseconds, which usually takes two to second. Three second run imprints on cpu on a single tritone input, server, part loaded with word model, the latency increased linearly with the concurrency. Without batching the request.

A

We can scale up the tritone into the server instance automatically with k serving as you can see from the previous table. You get the best difference when continuous concurrency slow. So here we try to set the container concurrency to one and let k native, auto scaler scale things influence service part automatically, based on infrared infrared, concurrent request within a time window.

A

By setting concluding currency to 1 the input service, auto scan automatically scales up when container reaches to the max concurrency in this experiment, the perf client generates the influence service request to your model and measures the throughput and latency over a time window and repeats the measurement until it gets stable value. For example, when the observed infrared request, 5 is scaled up to 5 parts to handle the concurrency from the result table. You can see that the average latency for higher concurrency can maintain as low as when concurrency is low.

A

There are some latency spikes which are caused by the part code startup time, which includes model downloading and part spawning time when containing concurrency specified. Canadian activity is also injected on the request path. To do smart load, balancing to make sure container is not overloaded. With request more than the configured concurrency limit.

A

As you can see, the support does not scale linearly perfectly because of the infrared service code of startup time. One way to alleviate the issue is to cache the model on pvc after downloading from the remote storage, so that the models can be directly mounted to one mounted onto other parts.

A

Gpu, auto scaling based on gpu metrics can be hard. Canadian implements a requested volume based auto scaler, which allows you to ask gear 2 and front 0, both on cpu and a gpu. The number of paths needs to scale up. The load is calculated by the infrared request and the concurrency target. If the container concurrency can handle 5 requests concurrently and your infrared requests are 50, then can it automatically scale to 10 parts.

A

Back influence is a process of aggregating inference, requests and sending sending this aggregated request through the deal framework for influence, or at once cave serving, was designed to natively support batching of incoming influence requests.

A

The functionality enables you to use your computer resource optimally, because most dna frameworks are optimized for batch requests. You can configure max batch size and max latency on infrared service batch section. The capsule cycle agent then knows the maximal batch size that the model can handle and the maximum time the agent should wait to fulfill each batch request on the inverse service. Spec user can also enable scale down to zero to save computer source after batch is done.

A

Over the time you need to both, you need both pre and post processing. Before and after the inference, capsule provides the sdk to user to easily implement a transformer which can communicate to model server with a standardized data plane. Protocol which expects tensoring and a tensor out transformer and predictor can also be deployed and rolled out as a single unit in future cab serving also plans to support a few outbox transformers like text tokenization and image transformation.

A

Conserving allows chaining transformer with any infrared server. Here we can easily swap to use different one server as long as they speak. The same data protocol, the pre-process handler transform the raw data into the tensor. According to the v2 data plan protocol- and you can see the post post process handler transforms the raw prediction um into a response for downstream consumption.

A

The transformer does not enforce a particular data schema. It could be a list of text or images.

A

The probably single model on gpu can under utilize the precious gpu resources tf serving touch, serve triton or allow co-locating multiple models onto the same gpu in the container. The two requests arrive simultaneously, one for each model.

A

They can be scheduled onto the sims gpu and work on both inference, computation, parallel cab, serving as a retraining model customer source, twinbox scheduling models onto the input service at scale which obstructed away the model placements logic away from the user.

A

The models that are placed on the same inference service customer source can be accessed from the same service url. With the capstone multi-modal serving design. We started to decouple infraservice and model artifacts trainer models can be placed onto an infraservice before it reaches the configured memory limit. Conserving trainer model scheduler can spawn new shots for the internet service to host models at scale.

A

It also send cash check purpose to each model endpoint to reflect the model deployment status on training models, customer resource on training model, customer resource users assign the mod to a given influence service users also specify the model, storage, ui and a model framework. The memory's resource estimation is used for scheduling models onto the input service based on the capacity left off given in the service shot and the schedule automatically spawn new shots to host new models.

A

If all current shots are at the capacity, the trained model scheduler decides, which service charger to place the model and writes the model configuration into the shot config map, the cave, serving sidecar agent, then reconciles the config map to port new models and remove deleted models in the model repo that is mounted onto the part.

A

The agent then sends a signal to the model server to low download the model after model is successfully loaded on model server. It can then be accessed from a unified service endpoint, with the model name specified on the uiopath.

A

We developed the multi model serving we hit fuel our scalability issues as well for a single steel, ingress gateway. We handle limits about running on 2000 services, but we want to actually scale to 100k models. Can we support the 100k training model customer resource we actually bound by the customer resource limit on scale std?

A

Even we can simply deploy 100k models on 2000 service. We may still hit the limit of the number of scratch services we can create on the grid rate for routing the models.

A

So the phase two multimodal service is to find out these limits.

A

We are excited that kf7 0.5 is released with the v1 beta1 api and triton waste v2 protocol, and here you can check the view on beta1 ifc ifc dock multimodal serving is in offer steps, and the next step is to make to make it a production. Rating cap service is an open community. So we love your contributions and you're welcome to join our back bi-weekly um working group meeting and here's the github and example links.

A

Thank you. I will hand over to our deregulating, for the second part.

B

Hi, my name is david goodwin, I'm going to be talking about the kf serving inference protocol version two. You might have also heard this called the v2 data plane protocol.

B

So before we talk about the specifics of the protocol, let's look at kind of the domain where this protocol is important and is meant to be used. So the diagram here shows kind of a representative way of deploying can say a containerized inference server. So in the gray box on the right, you see there's this inference server that is capable of performing deep learning or machine learning inferences uh on behalf of whatever clients there are, and it is, is inside of uh some kind of deployment environment.

B

It has access to the model repository which is holding all the different models that it might need to serve, and uh there also may be some load balancers and some auto scalers etc. Those those aren't required. This is just showing kind of a representative example and then on the left, you see there's some clients, so these are clients that want to directly or indirectly take advantage of some deep learning or some machine learning models, and so they can communicate.

B

For example, maybe there's an asr, an automatic speech recognition kind of service that they want to use, which will in turn need to do some inferencing and there could be multiple of these workloads and again there can be load balancers and in the deployment.

B

The protocol is concerned about how the clients communicate with the workloads and also how the workloads then communicate with the uh inference services themselves, and the protocol is meant to be a single protocol that can be used in any of these areas. So, given that background.

B

For the the kf serving inference protocol, why do we need this standard? Well, like most reasons why you need a standard? You want the say, the inference clients which, in the previous diagram, were the things on the left that needed to use some inferencing.

B

You want them to be able to talk to multiple different servers or services, to increase their portability, to make them more interchangeable and on the server side, you want to allow as many different types of clients as possible, maybe ones that weren't originally meant written even to use your server to be able to use your server or service, and that increases the utility of your service.

B

And, of course, you want them to operate seamlessly on all sorts of platforms like kf servings, which is standardized around this protocol.

B

So some of the high level requirements uh are support, support both ease of use and high performance, which we'll talk about some. uh They need to be extensible, so the kf serving protocol talk about. It has a core protocol which is kind of the required part, but there's a extension mechanism in it that allows either for in the future there to be uh standard extensions added to it. So optional parts of this of the core or uh you could have a server, we'll see the server specific ones which are individuals. Inference.

B

Servers can decide to implement their own extensions and uh there needs to be both a grpc and an http rest and a json based implementation. For this, which we'll look at also in a minute.

B

So I mentioned: there's the core protocol and the extension. The core protocol, which we'll go through in some detail, is required for all conforming servers. So if you want to, if you want to write an inference server- and you want to say that your inference server supports the kf serving inference protocol, you need to implement all these and they include uh endpoints to understand if the server is live and ready to get kind of metadata about the server and then for all the models that are available on the server.

B

All the deep learning and machine learning models that are available, whether the model is ready and and metadata about the model and, of course, the primary reason for doing all this. There needs to be an inferencing endpoint to allow you to actually perform inferencing and then again, extensions are optional, and currently the standard doesn't have any there's no standard extensions. So there's no optional parts of the core, but we'll see there are some extensions, have been implemented by uh specific inference. Servers in particular we're going to look at some.

B

The triton inference server has implemented some extensions. We're going to talk about those briefly, okay, so the liveness and the readiness endpoints together. You could think of these as the health endpoints and on the server side. There's two two different endpoints that you can see whether the server is live and or ready to receive requests, and so, for instance, in something like kubernetes.

B

You can directly use these for liveness, probe and readiness probe and on the model side, there's just a ready and that can be used to indicate if the model is ready to receive requests, and so this is some example here again, there's an http uh rest style version of this protocol and it just returns for these returns that uses the http status code to indicate whether something is live or ready, uh and here we're asking if the server is live and you get an okay back. So that means the server is live.

B

Grpc side. Has you know it's very similar, of course it's using uh the rpc style, but there's a server live and it it would turn a bull or two frost response that would indicate whether the server was live all right. So, let's move on metadata. There's another core part of the protocol and the server metadata is, you know things like name versions. What extensions are supported. The model metadata is more interesting. You can kind of ask so what you know for a given model. That's on the server you can.

B

You can find out what versions of the model are available.

B

The protocol allows you know the server to have multiple versions of the same model, and then you know, most importantly, maybe what input tensors are you required to send into the model and what outputs can you get back and for the inputs you can see here, I'm showing for this example and and going forward here, I'm just going to kind of show the uh kind of the json http uh examples. The grpc protobuff based grpc examples are very similar. It's just a kind of different syntax, so I'm not going to bother to show both.

B

So here you can see from the url we're using the slash. V2 is slash models. Slash resnet gives us the metadata for the resnet50 model. We see that there's just one version version. One is available: it's a tensorflow graph, def model, which is just the type, so the platform is a the type of the machine, learning or deep learning framework or representation for this model.

B

That's kind of informative. One of the advantages of the of the protocol, of course, is that it's the same protocol, you you send, you use the inputs and output tensors and you communicate with the servers the same way, no matter what the actual underlying machine, learning or deep learning model is, but for informational purposes.

B

It's there inputs in this case, there's only one input in one output on this model, and you get the name of the input, the shape and the data type, which is very important because this allows you then as a client you can. Then, if you want to discover like well.

B

What's you know what kind of data shape of the data is expected and what data type you can get that from the metadata and similarly for the outputs, it'll tell you what type of uh output tensors are going to be returned to you and their their shape and their data type so and now for the again the primary um uh uh endpoint or api. That's in the protocol, of course, is for the inferencing and again this is just showing the http, uh the rest style, but here's the endpoint again.

B

We saw before v2 models resnet50, so for the resonant 50 model. We want to do an inference. That's that infer there, and in this case this is the resin at 50 mile we're going to send in this is just a picture I picked, um and so resnet50 is, if you're not familiar as an image classification uh as a very famous image classification, uh deep learning model, and it takes in uh a an image, in this case the intel just by the shape.

B

This is this is uh 224 by 224, so a very small image, with three channels and uh and it'll do a classification, resonant 50 will do a classification and out of and tell you what that image is, and in this case resident 50 it's been trained with the imagenet uh data set, which there's like a thousand different uh uh objects. It recognizes and one of them you can see from here. So we on the left.

B

Here you see the post post this um we send in the input and um tell the shape that we're sending in and the data here. I didn't list all the data because we're going to talk about that. There's a lot of data so just but there would be just an array of the json array of the data um and there'll be you can tell from the shape there'll be 224 times 224 times, 3 elements in the state array, which is a lot and then the output coming back. This is what the protocol requires.

B

The format that's required by the protocol that the server needs to respond with you know repeating back the model name and version and then giving you the outputs back and the outputs. In this case you can see the name of it there and the shape it's just a single element, and in this case it's a single element that tells you the classification of this item and it's a coffee mug. So it did quite well actually on that.

B

On that classification, and so here by looking at this example, we can see you know the inference protocol, like the other pro the other ones. I just talked about earlier. It's very simple, very simple: it's just the necessary kind of basic uh information that needs to be sent to the inferencing server and back and all based around having input tensors and output tensors with their data types and shapes.

B

So um let's take so that's the core protocol. We just talked about again the required core parts of the protocol that every conforming server must implement and in doing that and implementing that the uh you'll have a server that can talk to in any client. That also influences protocol for any type of deep learning or machine learning model that that supports and send in the tensors, get back the results and kind of a standard, seamless way which satisfies the primary requirements of of the protocol or any kind of standard protocol for that matter.

B

So um tritone inference server here trite inference server is an open source inference server. I have a reference link later on, if you're interested in taking a look at it, but it's it implements the both the http rest and the grpc kf serving inference protocols.

B

It's a multi-framework multi-model. It puts gpu and cpus so multi-framework I mean if you have a tensorflow or a pi torch or a an onyx model again that all those types of uh frameworks are supported for inside the inference server. It implements again the core, so it's a conforming server. It also has some extensions, and that's a briefly going to talk about extensions here. So the core is completely sufficient. That we just talked about is completely sufficient for making an inference server.

B

However, especially in the area of performance, it can be lacking, especially for the http protocol or primary for http protocol, so the extensions that triton implements there's some per model statistics, so you'll notice. In the core protocol there was no asa to find out from the server how many inferences possibly had been performed for a particular model, or you know how long on average were those taking that you know kind of statistics like that uh triton adds an extension that provides that information also an extension for model repository management.

B

So not only can you query which models are available on the server, but you can load and unload those models or load new models and unload models, uh also as an extension and try it in their support for stateful inferencing, which there's a lot of language models and other models that are implemented say with an rnn type type. Implementation, where there's state in the model- and so you have to the order of the inference is received- is very important. There's support for that for performance.

B

Now the last two items were primary performance bullets tensors, as we saw in the example uh above you actually send. The tensor data is part of the json message that can be very expensive, sending that over the network, and so you, if you're, if you're, communicating to the triton inference server on the same system from another process on the same system, you can instead communicate by system shared memory or by gpu shared memory, and that then you don't have to send the raw data over the over actually over the network.

B

Even local network um in or and you don't have to encode it say into json or into a grpc protobuf. You don't have to do that encoding decoding. You can just access it directly in memory and similarly the last one, which is just an extension. That only applies to the http rest. Part of the protocol is a way to communicate tensors using binary data still going over the network, not using shared memory, but a much more efficient way and talk about that in the next slide.

B

um So- and I call this one out so for all the extensions I just showed here again- these are extensions. uh They are optional and so the uh servers don't have to implement these. In fact, they're not part of the standard anyway they're, just trying to specific servers could implement these they're triton's open source. These are the the the specs for all these is published and I'm going to have a link later, but again, they're optional parts.

B

However, the in particular the I want to call it: the http json, the http rest, the json implementation for high performance. If we look at if we go back on the on the we have this binary data extension in trying which resolves a very important problem.

B

So if you look at the code on the left or the the output on the left, where we post this is the same, what we did before, where we posted um we send in a single small image, a 224 by 224, with three channel image, which is quite small, there's still 150 000 fp numbers in this data array, which of course I didn't list them all, and so that's a lot. Not just for a single small image and the problem is that this is very readable right. It's very usable, I mean a lot.

B

It's easy to generate this json in a lot of different clients, a lot of different languages. A lot of different toolkits can already do all this stuff. However, to do that, you first have to so. If you have these this image or this 224 by 224x3 set representation of this image as floating point numbers, you first have to basically encode them into strings so print them.

B

Basically, so that's that's time consuming and then you send them over the wire to the server which is on um you know on the other end, and then it has to read this json, it's basically just a string right, so it has to actually parse these strings into floating point numbers, 150, 000 of them just to do your inferencing, and so that becomes a huge bottleneck, basically makes for any any unless your model has very small uh input and output tensors or if you know you just don't really care.

B

This is a research model or something on the side. You don't you know the form doesn't matter to you, then this is not really uh acceptable. It's not really useful for a production kind of deployment, but still the advantages again going to the ease of use. You know this json is very easy to generate and there's a lot of advantages to doing it this way, uh but so less can we combine those, and so this binary data extension does just that. So on the right we can see using the binary date extension.

B

Basically, the the header part, the json part is more or less the same, except you have this extra parameter in here, which you basically say that hey, I'm not giving you the data here in json, it's just, but I'm just telling you how big it's going to be, uh which is a little bit redundant, but that's done intentionally just to be redundant and then you, just after the json message, uh is just the raw binary data and there is an extra header required in your post just so, the server can figure out where the the json metadata header is separating it from the actual raw data.

B

But in doing this now, there's no more you're still sending um you know, six hundred and two thousand bytes of data. You can't that's that's kind of unavoidable. That's actually your image. You have to send it over, but there's no uh encoding or decoding overheads.

B

The data can just be pulled right off the wire and kind of sent right into the into the model. So and then so. For instance, this is just one data point uh running on a um running on a local network. So actually it's minimizing the uh it's kind of removing the network part of this. If we send a 128 of these requests and time that uh just to kind of even it out and make sure there's no blips, uh you know doing the binary data really has like a 17x speed up a huge amount.

B

So you kind of remove that bottleneck from from the problem, so that again, that's important example to show how the protocol, through its extensions in this case, but in general the protocol, does allow the usability and because the flexibility of the protocol, you cannot have extensions like this. That also allow you, along with, say the shared memory to get great performance as well, and so, as I mentioned, there's a couple references here in closing one is to the protocol itself.

B

uh You can find it in the kf serving uh github uh and comments are welcome on that. Take a look at it and then for triton. uh This is a link you can see from this github. You can find actually the full triton repo, but this link directly takes you to the extensions and the inferencing protocols that triton implements.

B