ONNX March 2021 Community Meetup, 21 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: ONNX20210324 V06 ONNXRuntimeforMobileScenarios

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hi, I'm tom wildenhane, a software engineer on the onyx converters team I'll be talking about converting tf light models to the onyx format.

A

Tf lite is a stripped-down model format designed for running tensorflow models on mobile and edge devices. The tensorflow format has over a thousand ops and is designed for both model inference and model training. In contrast, tf light has roughly 130 ops and is targeted for just model inference.

A

Models are typically trained in tensorflow and then converted to tf lite for use on less powerful devices. The open source, tf2 onyx tool can already convert most tensorflow models into the onyx format, but previously tensorflow lite models could not be converted.

A

I'm happy to announce tf2onyx now supports creating onyx models directly from tf lite download the tool via github or pip, and use the dash dash tflight flag to perform a conversion. Direct conversion is particularly useful when a model is only published in the tf lite format or when you want to utilize optimizations that are only present in the tf lite version of a model such as quantization also tf lite models are generally simpler than their tensorflow counterparts, so tf to onyx may be able to convert the tf lite version of a model.

A

Even if conversion from tensorflow fails. The conversion process itself is relatively straightforward. In the rewriter phase, we look for sets of common tier flight graph patterns that can be efficiently merged into individual onyx ops. Next, in the handler phase, we convert the remaining tf light ops to onyx. Since tensorflow lite ops are often similar to their tensorflow counterparts. We reuse our existing tensorflow to onyx logic. If possible, we may need to use, cast, reshape, transpose ops, etc to account for differences between the tensorflow, lite and onyx ops specifications.

A

Finally, in the optimizer phase, we simplify the graph folding constants and removing unneeded ops like cast transpose sequences that may have been introduced by conversion.

A

One aspect of tf light that requires a bit more care is quantization in tf light. Quantization data like the scale and bias values, is associated directly with the model tensors. If an op has quantized inputs and outputs tf lite will automatically use the quantized versions of that op in onyx. We can do something similar by converting each quantized tensor into a pair of quantized d. Quantized ops, onyx runtime will automatically substitute in the quantized version of an op if all the inputs and outputs are quantized.

A

The advantage of this as opposed to writing the quantized op directly into the graph, is that other runtimes using the onyx format can still understand and run the graph, even if they don't have the quantized versions of every single app implemented.

A

If you want to remove quantization from a tia flight model, you can use the dash-d-quantize flag during conversion. Quantization will clip values outside the expressible range of the quantized data types and tflite often uses this method instead of relu and relu6, ops tftonix will use the range of the quantized tensors to automatically detect and re-insert the removed, relu and relu6 ops.

A

That's an overview of tf light to onyx conversion. Please check out our github page to try it out. This is new, so if you get a message that an op doesn't have a conversion, yet please open a github issue and hopefully we'll add a conversion for it soon.

B

Thanks tom, my name is scott mckay and I am the onyx runtime mobile technical lead. I'd like to give you a quick introduction to onyx runtime mobile onyx. Runtime mobile is a variant of onyx runtime that minimizes the binary size for usage in mobile and edge scenarios.

B

He utilizes the same code base as onyx runtime and has been available since last year.

B

In order to minimize binary size, we include only the required operator kernels in the build to satisfy the models that you wish to deploy. Additionally, you can reduce the types supported by these operator kernels for further reductions in binary size. A custom format is also used for the model file.

B

Runtime usage of onyx runtime mobile is the same as regular onyx runtime, the c c, plus plus python and java apis are available. We also support an n api execution provider on android and a coreml execution provider on ios.

B

The onyx runtime format model is created from an onyx model. A python script will handle the conversion and during the conversion a number of things happen. The onyx runtime optimizations are applied, such as constant folding and nodes are assigned to kernels.

B

There is no onyx schema dependency, which results in a significant binary size and memory usage. Saving the onyx runtime format model uses google flat buffers for its format when.

A

B

The operator kernels to include in a build a configuration file is used. The model conversion script can automatically generate this configuration file from the models you convert or, alternatively, it can be manually created and edited, as you can see from this example configuration the syntax is quite simple.

B

With the domain the offset and the operator names, you can also limit the types that operator kernels support again the model conversion script can automatically detect these types and generate a configuration file with them or, alternatively, you can specify a global list of types to support when using model based type reduction. You will generally see a reduction in the kernel, binary sizes of between 25 and 33 at a high level. This is the onyx runtime mobile usage. You will take your onyx models and put them in a directory to run the conversion script against.

B

They will be optimized and an optimized oit format model will be produced along with the configuration file that contains the operators that are required and optionally. The types that are required. This configuration file is used to build the onyx runtime package, which is then deployed to enable inferencing on device.

B

There are a few choices that can be made to determine the binary size. The operators and types to include will make the biggest difference, but whether you choose to enable exceptions or support for traditional ml operators or use a shared libc plus on android will also have an impact.

B

As you can see here, the binary size for a build with the operators required to support mobilenet is well under one megabyte if we enable the reduced type support we'll get a 31 reduction in the size of the kernels and that package would have a size of 325 kilobytes when compressed in the android archive that you would use to deploy. Your app in an api usage is possible on android, based on the device capabilities and whether in an api is available, the model execution will dynamically adjust to use in an api where possible.

B

If not, it will fall back to using the cpu execution provider for node execution. Finally, for any questions or feature requests, please reach out to the onyx runtime team by github. Thank you.