ONNX June 2022 Community Meetup, 13 Jul 2022

Previous Meeting

⏯

youtube image

►

From YouTube: What's New in ONNX Runtime

Description

This talk will share highlights of the ONNX Runtime 1.10-1.12 releases, including details on notable performance improvements, features, and platforms including mobile and web.

Ryan Hill has been with the AI Frameworks team for the past 4 years, where he has mostly worked on operator kernels, C APIs, and dynamically loading execution providers. Prior to this he worked on the Office PowerPoint team, where his most widely seen work is many of the slideshow slide transitions. For fun he likes trying to use the latest C++ features and hitting internal compiler errors

A

Hello, everyone, my name, is ryan hill and I'm a software engineer who has been working on onyx runtime since it started about four years ago.

A

So what is onyx runtime? I figure. Many of you will already be familiar, but for those who are not, it is a run time for onyx models. It is cross-platform, can target multiple cpu architectures for all major platforms and has language bindings for many current programming languages. Currently we're doing releases roughly quarterly.

A

Who uses onyx runtime at microsoft? It's used across all of our major product groups like windows, office teams, azure, cognitive services and vs code. We have over 160 models in production with a rough.

A

Onyx runtime is also widely used outside of microsoft across various industries, with a handful of companies shown here and a few of which, which you'll also hear from today.

A

We continue working on improvement with each release as and features of all those new models and scenarios come up from our users. Here are a few notable new features to call out from the last couple of releases.

A

One is that we added new apis to allow op kernels to be called directly from outside of a model run call so that they can be used. Like a math library, we had users adding custom ops that extended an existing op and they would copy the internal code for the op and then add a relatively small change around it. Now they can just call the ops directly.

A

We also added the ability to feed external initializers as bite arrays for model inferencing. This is useful for large models. Previously, this required a file on disk, but now can be done entirely in memory.

A

Performance continues to be one of the key priorities for us as we support many large-scale high-volume production. Workloads here are some performance improvements that you get for free just by updating to the latest version.

A

In 1.10, we added a transpose optimizer to push and cancel, transpose ops, significantly improving performance for models requiring layout transformation in investigating some performance numbers. We noticed we could optimize away some heap allocations by doing a small size, optimization in our op kernel code to handle shapes and in related code using shape size, standard, vector classes.

A

So now, instead of using standard vector, we use our optimized tensor shape and inline vector classes and then so, instead of doing a heap allocation on every instantiation, they are now only done in rare cases involving shapes with larger than normal numbers of dimensions.

A

To give an idea, the improvement one test showed a drop of 70 heap allocations down to just six on each run, and another team saw their performance improved by around 100 microseconds, going from 479 down to 360. Microseconds quantization has also made some nice improvements. You can see here a bunch of common cnn based models running up to 50 percent faster in the latest version.

A

Execution providers are how we enable onyx runtime to perform its best on today's various hardware possibilities. Some providers are just onyx runtime code implementing kernel, ops, using a hardware. Api like cuda and others are, are a using a software library that already implements the opmap optimally using a particular hardware api like the tensorrt library, which uses cuda and openvino using various intel hardware.

A

So why not just use a library like tensorrt directly well for running models? Onyx runtime offers a single api that gives you the flexibility to run on almost any target hardware optimally with very low overhead and onyx runtime has a complete, cpu implementation. So, if something isn't supported in a provider, it will fall back to the cpu version in developing these providers.

A

Our engineers work together with the outside company engineers, to ensure the best results and to create an ongoing relationship where everyone benefits, as you can see, as these hardware apis and libraries continue to be updated. We continue to update our providers along with them, to ensure that we maintain top performance and the latest hardware support.

A

We saw the need for a performant low footprint model, inferencing solution on mobile devices and released the onyx runtime mobile packages about a year ago. Since then, we've continued to invest in these platforms to improve usability for mobile developers like how we can now do nhwc conversion at runtime, which is not mobile specific, but we ran into this on mobile, it's used when you have a kernel implementation that prefers a specific layout, for example, running on arm or using nn api.

A

The converter is aware of the layout sensitive operators and can replace nodes using them with an nhwc version internally. It does this by wrapping the appropriate nodes with transpose operators, and then thanks to our transpose optimizer, it then removes any transposes that effectively cancel each other out.

A

We tested this using a production model that uses our new xn pack, support which uses nhwc, and the input was also nhwc, since the model came from tensorflow and all of the added transposops canceled out, including the the initial input transpose. Ideally, this is what typically happens for c-sharp users. They can use onyx, runtime and cross-platform apps, including apps, targeting android and ios. We currently support xamarin and are adding.net 6 maui support in the next release fingers crossed this is. This will be a very interesting thing for c-sharp developers on mobile platforms.

A

We also added android and ios packages with the full onyx runtime builds. This will make it simpler for users getting started using onyx runtime in mobile scenarios. They can use onyx models and all offsets operators types that onyx runtime supports are included. It has a larger binary size, but this probably incident isn't an issue for most users, as it's still relatively small, to give an idea, a minimal build that only includes necessary ops might be around 2.5 megabytes and the full build is around 8 megabytes.

A

Onyx runtime web is one of our newest offerings. As we've been see, as we've seen growing interest in directly running inferences in the browser we used to have a side project called onyx js, but this wasn't ideal. We needed to maintain two separate co, two separate versions of the code, a javascript one and the primary c, plus plus one- and we needed to ensure that the behavior was identical.

A

Now we just compile the main c plus plus code into webassembly, so that the onyx runtime web is backed by the same core code base as a bonus. It's faster it uses less memory and the resulting binary is smaller, basically like having another target architecture. To compile for another big achievement for all of our javascript users is that we introduced a javascript library called onyx, runtime common, which has multiple back-end implementations, web node.js and react native, but one single common api. This allows the same javascript code to run on all the main web platforms.

A

We also added xnn pack library support and if the performance is good enough, we'll have opengl support in the upcoming 1.12 release.

A

Challenges with data pre and post processing has been a recurring hurdle. We also recognize that custom operators used with onyx runtime that weren't officially in the on expect could be shared. This led us to create onyx, runtime extensions, which is a library of shareable custom ops that can be built and run with the core onyx runtime operators. These are currently focused on model pre-post-processing work. The user no longer has to implement this in an outside language.

A

It can be done entirely inside the model run call it has things like text: conversions, for example, changing strings into uppercase, lowercase math functions like splitting positive and negative values into separate, tensors, etc.

A

Currently, the ops in the library are mainly focused on nlp, natural language, processing, vision and text, domains which present some of the most challenging data processing steps, but we expect this to evolve as new needs are identified.

A

There are flexible, build options for it like having it be a part of the onyx runtime library through the static library, build option it's currently being used by our internal partners in the microsoft office team.