ONNX June 2022 Community Meetup, 15 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: INT8 Inference of Quantization-Aware trained models using ONNX-TensorRT

Description

Accelerating Deep Neural Networks (DNN) inference is an important step in realizing latencycritical deployment of real-world applications such as image classification, image segmentation, natural language processing, etc. The need to improve DNN's inference latency has sparked interest in running those models in lower precisions, such as FP16 and INT8. In particular, running DNNs in INT8 precision can offer faster inference and a much lower memory footprint than its floating-point counterpart. [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) supports Quantization-Aware Training (QAT) techniques to convert floating-point DNN models to INT8 precision. In this talk, we shall demonstrate end-end workflow of converting Tensorflow QAT models into ONNX, which is a standard intermediate representation to deploy using TensorRT. We use TF2ONNX package to convert a quantized Tensorflow model into ONNX. ONNX format makes it easier to visualize graphs via netron which can provide users information about placement of quantized nodes

Dheeraj Peri works as a deep learning software engineer at NVIDIA. Before that, he was a graduate student at Rochester Institute of Technology in New York, working on deep learning based approaches for content retrieval and handwriting recognition tasks. Dheeraj's research interests include information retrieval, image generation, and adversarial machine learning. He received a bachelor's degree from Birla Institute of Technology and Sciences, Pilani, India

A

Hello: everyone I am dhiraj a deep learning software engineer at Nvidia and welcome to today's session on intake inference. In this session we demonstrate an end-to-end workflow for deploying qat models using on extensority.

A

This talk is divided into three parts in the first part of the talk. I will discuss basics of quantization and how tensority supports quantized networks through various fusions, but we will talk about the quantization toolkit for tensorflow that is open sourced and texturities deployment via olinx.

A

In the end, we will see an example and case study for a resident, 50 qat model. In this talk, we focus on models that are trained using tensorflow 2.0 framework.

A

Let's look into deep neural networks, quantization.

A

What is quantization quantization is the process of converting continuous values into a discrete set of values using linear or non-linear scaling techniques generally. Higher Precision is necessary during training for fine grain weight updates, but it's not required for inference and sometimes may hinder the deployment of AI models in real time.

A

On the other hand, intent inference results in lower latency and lower memory footprint. Hence there is a clear benefit in converting fp32 models into intake. However, sometimes there can be a trade-off to accuracy.

A

There are different quantization schemes by which floating Point tensors can be converted into lower precision. One standard way to express a real value in terms of quantized value. Q is R equals to S times Q minus Z, where s and z are scale and zero point.

A

These scale and zero points are together, referred to as quantization parameters or Q params Q params can be determined either by a post, training, quantization or ptq, and a quantization aware, training or qat. Let's discuss ptq and qat in detail.

A

In ptq, we start with the pre-trained model and run it on the calibration data set, which can be a subset of training or validation date. We calculate the dynamic ranges of weights and activations, which are then used to compute quantization parameters or queue patterns within quantize. The network using q params and perform inference.

A

Alternatively, in qat, we start with a pre-trained model and introduce quantize ND quantize notes at desired layers. Qdknotch is a single operation which performs fp32 intake and intake to fp32 conversion, thereby simulating the quantization process that occurs during inference. We fine-tune this quantized model for a few number of epochs and then store the final Q parents. The goal is to learn the Q params or model parameters which can help to reduce the accuracy drop between the quantized model and the pre-trained model.

A

One of the ways to perform qat is to use TF mod toolkit released by Google. This toolkit implements TF quantization recipe designed for tensorflow Lite in this toolkit. Contact session is performed using fake, want with min max verse up which performs a symmetric quantization.

A

On the other hand, at Nvidia we have released a separate toolkit built on top of tf1. This offers features such as quantizing layers by using both layer, name and layer class. As attributes, we also support pattern based quantization and the quantization is performed using quantize ND quantize V2 up, which is a symmetric quantization, where very to get the best performance for a qut model. On a GPU using on an extensor RT, we use we recommend using Nvidia TF2 quantization toolkit.

A

Here is a quick difference between the nvidia's quantization toolkit and TF mod on the left on the nvidia's quantization toolkit. We see the quantization nodes being placed at the inputs of the way inputs of the convolution and weights of the convolution operator on the right uh using TF, smart toolkit. We see that the quantization nodes are placed at the weights and the outputs of convolution layer.

A

In this slide, we see how we place qdq nodes according to 10 Charities recommendation, which is implemented by the quantization toolkit for weighted layers such as convolution are tense. We place qdq nodes at the inputs and weights of the particular layer and for non-weighted layers such as concat and pulling we add, qdq nodes at all the particular inputs for residual ad. We see that the qtq nodes are being placed on the residual branch of the network.

A

Now, let's look into how we can deploy a model, that's trained using Nvidia TF2 quantization toolkit with 10 or an extensority.

A

Flow consists of taking a pre-train, tensorflow 2.0 model and quantizing it with our Nvidia toolkit. We fine tune it for a small number of epochs to simulate the quantization process that occurs during inference and starts the quantized model. We then convert this model into one index using tf21 and X converter.

A

Once the olinx graph is generated, we generate tensor attic engine out of it using tensority apis.

A

In this slide, we see how easy it is to perform quantization about training using Nvidia toolkit by introducing just four lines of code.

A

We load the pre-trained checkpoint of the model and use the quantized model function from the toolkit, which transforms the model and introduces qdq nodes. We then perform fine tuning of the model for a fewer box and save the final mod.

A

Once the fine tuning is completed, we convert the fine tune model into 1nx using TF to all next convert. Tf2 Linux is a standard way to convert tensorflow models into onnx representations. It has conversion support for many standard, deep learning operators. We have also added support for converting quantized quantization operators.

A

Quantization operator in tensorflow is represented by using quantize and D quantize API, which is then converted into two separate Ops, which are named as quantize linear, ND, quantized, linear in our Linux.

A

Once we have the on index graph, we we use the on Linux parser intensity, which parses the on index model and converts it into an optimized tensor RT engine optimizations and fusions are performed internally, which are then used to build a tensority engine.

A

Now, let's look into some accuracy and latency results, for instance modules.

A

As for the accuracy results, we can see that there is little to no difference between the Baseline fp32 and the qat models across different variants of resnet in general. Test Nets are pretty stable for quantization, which is why the gap between ptq and qat is very less efficient. Nets are a good example where qat preserves accuracy better than ptq.

A

From the latency aspect, we can see that ptq and qat has similar times and introduced more than 10x speed up compared to the fp32 counterparts.

A

Ptq models can sometimes be a bit faster than qat models, since ptq algorithm only quantizes layers, which result in Faster model inference, whereas qvt model performance depends on the placement of qdq nodes and their fusions.

A

So to summarize, the quantization of your training provides an alternative to deploy deep neural networks in lower precision. Qrt models might be less prone to accuracy drop during inference compared to ptq models due to model parameters fine-tuned.

A

We demonstrated an end-to-end qat workflow from tensorflow to 1030 deployment via on Linux TF2 on Linux enables converting tensorflow models into all Linux graphs, which is then optimized by tensor RT and finally, our experiments with resnet models showed that intent accuracy is on par with fp32 Baseline accuracy and that the qat latency is on par with ptq.

A

Thank you for attending this session.