ONNX June 2022 Community Meetup, 13 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: QONNX: A proposal for representing arbitrary-precision quantized NNs in ONNX

Description

We present extensions to the Open Neural Network Exchange (ONNX) intermediate representation format to represent arbitrary-precision quantized neural networks. We first introduce support for low precision quantization in existing ONNX-based quantization formats by leveraging integer clipping, resulting in two new backward-compatible variants: the quantized operator format with clipping and quantize-clip-dequantize (QCDQ) format. We then introduce a novel higher-level ONNX format called quantized ONNX (QONNX) that introduces three new operators —Quant, BipolarQuant, and Trunc— in order to represent uniform quantization. By keeping the QONNX IR high-level and flexible, we enable targeting a wider variety of platforms. We also present utilities for working with QONNX, as well as examples of its usage in the FINN and hls4ml toolchains. Finally, we introduce the QONNX model zoo to share low precision quantized neural networks.

A

um So, as I was saying today, I'm also going to be talking about the uh quantization, but in particular today I'm going to be talking about a low precision, quantization, meaning a quantization below 8 bit and if you're not familiar with the literature. Low precision quantization down to binary values can be extremely effective in trying to extract additional performance gains, and you know latency benefits, power and energy benefits.

A

Just to make an example for some work. I was recently involved with in the area of radio mail, so machine learning for wireless communication.

A

We were looking at a modulation classification task with a 4-bit convolutional neural network managed to achieve on one of our uh rf soccer platform, 1.7 billion samples per second of throughput and at less than three microseconds of latency, thanks to reduced precision. So it works. It's very effective thanks mostly to quantization, aware training in this presentation, I'm going to talk about quantization, aware training and just uncover the extreme fundamentals of uniform quantization, but I'm going to be concerned with how we represent in our tools in our workflows, a low precision, quantized neural networks.

A

So, just as the 40 very basics, it was also mentioned in the previous accent presentation you can think about quantization, particularly uniform of fine quantization as a combination of two functions, the quantize one, which maps a floating point value to an integer one by means of uh scale, zero point as a choice of precision as well as sign.

A

You know, determines your clipping boundaries as well as the quantize function, which maps your uh integer value back to a floating point representation, and the combination of these two functions is typically referred to as fake quantization, meaning a float to float, mapping through quantization and then dequantization idea, as it was also presented earlier, is that you can then map this representation to an integer. Only.

A

You know acceleration strategy to appropriate transformation, infusion strategies, and so uh currently, if you look at how uh quantization in onyx is represented in particularly at eight bit, you have two predominant style: the uh qdq1 which leverages uh uh quantize and quantize nodes to represent uh faker quantization, and uh these so called q. Linear ops, which actually was the original way to represent the quantized neural network in onyx. That is more oriented towards representing already diffused uh operatoring, particularly typically with fused output quantization.

A

uh It comes with limitation, though, and the main issue when it comes to low precision quantization, is that there isn't any standardized way to represent the uh data quantized below eight bits. Nor you know at other precision like I don't know, 12 bits or 16 bits which can be relevant, for example, in lstms, and so our first suggestion for extending the representational power of onyx is to leverage for low precision.

A

Neural network is to leverage clipping, and the idea of this strategy is basically to maintain a retro-compatibility with all the existing libraries and tools around quantization, but to add an extra clipping function, which is, uh you know, really supported over integer boundaries between quantized node and dequantized node, to um represent basically the precision that we want to induce at that boundary.

A

So, for example, in this case, I'm showing here on the left to represent a four bit quantization, I'm keeping between minus eight and seven, and the idea is that, with existing tool, you can still get the correct result and potentially, even up to you know. 8-Bit acceleration and newer, smarter back-ends can also then, potentially by you, know, analyzing the integer boundaries take advantage of the extra acceleration opportunity offered by um by the reduced precision, and so you know go down to actual four bit acceleration.

A

um One of the issues of this uh approach, though, is that it extends only as far as quantized linear allows to go as an operator and currently quantized linear is limited to only an 8 bit output and it also comes with some other limitation. For example, something that we are sometimes concerned with in our line of work is adopting different type of roundings, which is something that you cannot easily represent with this style or presentation, and so uh something that we have been working on for a while.

A

Now is this quantization on exponentiation dialect that we like to call uh theonix that basically tries to uh simplify uh the quantization representation by merging this sequence of operation for uh fake quantization into just one node, but at the same time it extends it to a wider. um You know set of scenarios by taking, for example, a broadcastable bit to it as input as well.

A

You know it's giving you option around different type of roundings, as well as having you know, options for binary, quantization and so on, and this format is something that we have been leveraging a lot also as part of our efforts for deployment on fpgas, in particular.

A

As part of this common effort called the fast machine learning, and so we have available a variety of tools for dealing with the q onyx that integrates with onyx run time, as well as various uh pre-trained low precision models that you can find in our model. Zoo qrx is already integrated as part of the brevitas python quantization library, which is something I work on, that we leverage a lot internally and with customers and nqcdq is also going to be integrated in the next release.

A

We use qrx a lot as part of our student as part of two libraries for data flow inference, low precision, neural network on fpgas, so the fin tool chains and the hls4 ml punches, which count uh very good adoption in the fpga community, and the format is also being adopted both by the hawker quantization library, as well as by qcarus, by means of a converter that we are currently implementing.

A

uh We obviously would love to work with the broader community on this topic and particularly on q onyx, so feel free to reach out. If you are interested in learning more and thanks for the time.