ONNX June 2022 Community Meetup, 13 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Build your high-performance model inference solution with DJL and ONNX Runtime

Description

In many companies, Java is the primary language for the teams to build up services. To have ONNX model onboard and integration, developers faced several technical challenges on the resource allocation and performance tuning. In this talk, we will walk you through the inference solution built by DJL, a ML library in Java. In the meantime, we will share some customer success stories with model hosting using ONNXRuntime and DJL.

Qing is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Ads with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration. Qing is also a PPMC of Apache MXNet

A

Thanks very much for joining um my my name is jinglan, I'm one of the software engineer at amazon web services, and today I would like to introduce you uh with the dvd library integration with onyx runtime. Basically, we have a few kind of the use cases that are alarming inside of amazon and also outside of amazon that have been successful, proving that the onyx onyx runtime actually have a very good performance.

A

So let's go with the next slide.

A

uh So, firstly, let me introduce what is a uh deep job library, uh it's one of the machine learning libraries that are built on top of the java, but more on the application layer.

A

It's actually doing an abstraction of all of the of the all of the deep learning libraries just like keras and it it currently has a set of the numpy's like operator that we have been implementing inside of java and the inside of the non-prime, like the operator we're actually leveraging the deep learning engines, that's like apache, mxn, tesla and pytorch by using their c and c plus api to offering the the operator support and also on the engine side, dpr library.

A

Since it's an abstraction layer, it's offered multiple backends, it can run apachem extent, tensorflow, pytorch, onyx, pedal, pedal and uh a lot more and and even with it, and even including the machine learning library. Just like the actual boost.

A

We also have a set of the models, do support with more than 70 for trained model that can be run directly within java, like the hugging popular hugging, face models or like models from the torch hub from from the basic image classification to object, detection sentiment, analysis and also action recognition and the best advantage for dpr library itself is basically service ready.

A

We have been doing a very uh sarah test, run with deep div library to ensure it has the best performance and also have a con control over the memory, and we have been consistently running, uh and currently, the service has been running with dhl has been successfully for more than half a year with no arrows happening.

A

Okay, so here I would like to share with you guys about like a cup of the successful customer by leveraging uh awnings and also on its runtime in in by by leveraging and also like with ddl, you use along the way. The first use case is basically from came from amazon advertising. We have been helping with the teams that are trying to um launch a model that with a very heavy infrastructure uh and basically previously the latency in in advertising that was, that does not meet the the actual infrastructure.

A

Basically, it's like 22 milliseconds, but after using uh onyx runtime, switching from that, it can actually reduce the influence time from 22 milliseconds to less than 15 milliseconds.

A

That's basically the number on the p90 um and also like, after switching to onyx runtime, we can also providing the um the stress saved multi-threading inference by leveraging the multi-core cpus uh on the other side. After doing the conversion between the magnetonics, we we do the correction test to ensure that the outputs, uh after the conversion is still being the same and we observe, is less than e to the minus seven difference on the floating numbers.

A

uh Similarly, we have the stories that uh successfully converting from pi torch to onyx, which is the in the hybrid factor since the cdo is here, so I will probably just allow him to introduce more. On the other side, we also have the tensorflow to onyx conversion successful by using deepdraw libraries.

A

We have amazon search team here that is, is actually they have a old tensorflow models and they they would really want, like a performance boost on their input side.

A

So, after converting to uh to onyx the the inverse time, are re reduced from 12 milliseconds to to less than four milliseconds, and since deep jaw, library is supporting multiple framework they're, also converting a couple of but of their scalar models to onyx is also have a very good performance and finally, is something that is that that wasn't from our expectation, it's actually from the chinese market, because there is a huge amount of people, are actually leveraging. Paddle, paddle and paddle paddle is also one of the the deep engine that dpr library actually support.

A

There are a few of the companies reach out to us to see if there's any performance gain they can get from paddle paddle because of their our original engine doesn't support the thread safe, multi threading, so they have no way to accelerate the inference by by running on the cpu, after converting to onyx that have the supporting from us, they they successfully like like one of the top banks from china.

A

uh They they successfully making their uh ocr models run, run time from from one from one second to less than 400 milliseconds on a single image.

A

However, most of the people, especially for people who is using onyx runtime, they always facing a very big challenge, is how because there's no built-in operators they they cannot easily doing the pre-processing and post-processing inside of java, like uh on the like, especially on the un-dimensional array operation. So deep jar library offers solution.

A

So we introduced a new concept called the hybrid engine, and essentially is uh I mean in fact it is just trying to load the both two engines together at the same time, and also it provides us a smooth trend transition for for customers, especially when, when they're trying to switch the engine be between like to run for the inference. uh So basically you all you need to do is just have a master imprint engine in here we just use uh the onyx runtime and one process engine in here.

A

We also use the pi torsion here and since we we we're supporting other operators by using high torch, tensorflow and mxl. The user can choose one of the engine to be served as the processing engine and to to reduce the data copying time. Instead of like trying to copy the data from java heap into the c process layer, we introduce the way by using the direct buffer to to send the pointers from the from the pythons directly to to onyx runtime.

A

This avoids the the data copying and also providing a very huge performance boost on the on the infrared side. If you're trying to comparing you run the end to end inference with pytorch and this hybrid mode um and finally, the way for us to to control the memory management we creating something called nd manager. It's a tree-like architecture that has been implemented in deep drill library to replace the chicago collection system in in inside of the java to to provide the uh the more cost effective memory uh collection stuff.

A

So, basically, having this any measure build in mind, we can actually um making all the inference very clean without any kind of memory. Leaks. um Okay,.

A

So here you can actually see how the how how the simple the the experience is going to be so deep draw library offers us a criteria class where the user can define their, how how their model is going to be load. They can offering a model url. Basically, it can be a file url. It can be a general hp url.

A

It can even be an s3 link and and also hdfs link and and also you, you may need a translator and the translator is the abstraction class that deep draw library offers that can be doing for the pre-processing and post-processing.

A

um So previously, when the users are actually leveraging the pi torch they uh all they do is simply doing the the pi torch in the in the in the option engine side and they load the pythagoras model to run for the inference and if they want to use the performance boost by running the by running on engine, all they need to do is just convert their model to onyx and then without changing a single line of code, just change that engine to onyx runtime they can have the performance gain.

A

It's very simple for customers to do the transition, and that's why that's the debt library actually having a lot of customers are leveraging onyx runtime for their influence, optimization um and final slide.

A

Yeah, so what's next with onyx, runtime and deepdrop library, so in the next phase, we're going to support for the arm server, because we see there's a trending demands where people are trying to leverage our arm based device for the cost advantage, and also the performance boost and and also we're trying to hit into the edge device, especially like the android support, we're trying to run the onyx runtime there as well, and hopefully we will bring the hybrid engine solution um to there since we are already supporting the pytorch inference on the android device.

A

That's it! Thank you.