ONNX March 2021 Community Meetup, 21 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: ONNX20210324 V13 QSforONNXusingINC

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Day, it's my pleasure to deliver talk about quantization support for onyx, using intel, low precision, optimization tool. The work is closely collaborates with shrub and many other colleagues.

A

As you may know, low pressure is gaining more attention and is being adopted in various domains in the industry. So to support the rapid growth of a low precision solution. We created internal precision, optimization tool, helping intel customers, deploy a low precision solution.

A

Either for the planning frameworks on both cpu and gpu, the support mixed precisions are intake, bf-16 and fp32, and the verify and hardwares are all the generation of xeon scalable processors from skylake to upcoming ice lake and sapphire rabbit, and also the intel xe architecture-based gpu.

A

You can see the right top diagram showing showing the output architecture and the top is the episodic model as the input the middle is. The output, including auto tuner as the user-facing apis, and the other key components like quantizer, pruner, benchmark and tuning strategy, and also the frame adaption could including onyx, runtime adapter and the other frame adapters.

A

Now, let's look at the simple unity flow and the code example at the left bottom side, given the fpc to mode as the input you don't need, just need to prepare, configure as shown below and launch output with minimal lines of code change, the configure is template-based, so user just need to um take the template and update some minimal item, for example, the quantization approach. Here we use the post training, static contact, quantization and the quantization data set um launch code is also pretty simple.

A

Four lines of python code created from the configure, do quantization and then save the context model.

A

Then you can see the right button diagram, showing the auto tuning flow, given the configure with accuracy criteria and the timeout airport will generate a quantized model with the quantization configure and driven by the tuning strategy. Once the accuracy is meeting the criteria or the other objectives, meet the criteria or is reaching the tuning timeout. The tuning flow will stop with the best contact model with the trader of the accuracy and the other objectives.

A

Now we want to show the quantization productivity with output, reducing the time frame days to minutes showing this significant productivity improvement um compared with human expert. Usc airport is helping in all those 3k aspects for quantization, including effective model calibration advanced quantization recipes and the systematic, auto tuning flow.

A

Effective model calibration saves the efforts to collect the tensor statistics from their scratch and advanced and quantization recipe. Help reaching the higher accuracy and shorten the tuning space, and the systematic, auto tuning flow, as described in previous page, is greatly releasing the menu effort to tune the accuracy per quantization recipe generally for a big model. For example, 50 there are 53 convolutions, then people need to tune player for those quantization recipes, which is a very huge space.

A

So, overall, we expect the output will improve up to 90 quantization, productive improvement, and we already received very positive feedback from our customer and in the real use case below. Is the table to show the um the model using our port compared with the ap cylinder baseline and the models cover the typical workloads uh like razer, 50, vg, computer vision and the various birth and the birth variant models?

A

You can see. The accuracy is kept within one percent loss and the tuning time is less than the 40 minutes and for those birth model actually is just like one minute. We expect to improve the uh productivity further with more advanced tuning strategy or the quantization recipes, as well as the more optimal quantized kernel for onyx runtime.

A

And the latest output really supports the two context: op categories, while the q lineups the other one, the integer ups and we notice the uh latest onyx runtime has more corners. Operators supporting the 1.7 release and airpods will also include them in the next release.

A

Meanwhile, airport supports two quantization approaches, uh which is the static quantization, dynamic, quantization static. Quantization is meaning for the computer vision workloads, for example, vdg and the dynamic organizations meaning for the nlp models for birth and the other transformer models.

A

Now, let's talk about the radius collaborations and the plans, um airport v1.2 was supported. Onyx runtime 1.6, with the offset and the operator wise quantization tuning and the later release 1.3 was support. Onyx runtime 1.7 with the new quantized operators and the airport 1.4, will integrate the python optimizer tool introduced by the onyx runtime 1.7, and support the more flexible craft transformation for amp models on community collaborations.

A

We welcome the contribution from the community to improve airport together, so please submit a pull request on the new features and the final issue in case some is not working as expected.

A

uh Just take a look at the contribute guideline on the airport github.

A

So in the future uh we plan to continue improve the output and contribute the context model to onyx model zoo and enrich our product release distributions channel. For example, we will release a docker binary and also the nitro build binary to the community.

A

Well, this is the end of my presentation. Thank you very much for your time.