ONNX October 2021 Community Meetup, 21 Oct 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 009 ONNX 20211021 Knight ONNX TVM for dynamic shapes, control flow, quantization compiler OctoML

Description

LF AI & Data Day - ONNX Community Meeting, October 21, 2021

TVM: Dynamic Shapes, Control Flow and Quantization compiler

Speakers: Jason Knight (OctoML) and Andrew Luo (OctoML)

A

Hi there, I'm jason knight, co-founder and chief product officer here at octo ml and I'm here with andrew lowe who's, one of our amazing ml systems, engineers and we'd like to share a little bit about our experience with onyx building a platform on top of the apache tv and open source project, and some of our learnings that we'd like to share with the onyx community and uh show some cool things that working that might be useful uh for the onyx community as we think through future improvements.

A

So without further ado, just you know who is octomel? What are we doing? Well we're building the machine learning deployment platform, uh which is an automated platform for any ml end user to take their model in onyx or other formats uh put that into our system, optimize, benchmark and package that on a variety of different platforms, uh both in on the cloud and on the edge for many different devices and we're using a lot of technologies like onyx and tvm to enable maximum performance and performance portability uh while making a push button experience.

A

But uh the the focus of the talk is on the the open source, tvm and onyx side and interaction. So let's, let's focus back on that. If you haven't heard of tvm, what is tvm tvm is an open source project started about five years ago, and it's a toolkit designed to enable anyone to achieve high performance, machine learning on any device in the cloud or on the edge or and also to support future devices as well. And so what is the project composed of well at the top?

A

We've got importers from a variety of formats in the tvm, notably onyx, is one of our best supported and offset 13, and we're test coverage is really high up there alongside something like onyx runtime, so really have a great experience there, with both users and the community, then, underneath that all these importers unify down to a single high level, differentiable intermediate representation called relay that enables rich types of system that supports dynamic, shapes and dynamic control flow, et cetera, tensor ir is then our one level lower that allows us to uh represent and optimize over mutable loop nests.

A

So we can encode patterns like dense and sparse linear algebra in a variety of complex control, flow schemes and regimes for both operators today and into the future, and then one of the most important parts of tvm is the machine learning based set of optimizations that we have that enable a naive lowering approach to instead now lower to very high performance uh code on a variety of different platforms.

A

uh You know from cuda and opencl to wasm and web gpu spear v, direct c l, vmir et cetera, and it's it's auto tvm, auto scheduling and auto tir that enable uh high performance code to be generated uh without strong manual intervention or human expertise involved, and so with that, I'm going to pass it over to andrew to talk about some of our experience using onyx in production and um and and where we're going from here. Andrew. Take it away.

B

Thanks jason, so, as jason said, onyx remains our preferred way for importing models into tvm. It's probably the most well developed uh frame uh front end for use in pvm and has helped octo ml uh launch our platform quickly, while guaranteeing high model coverage uh and being able to talk with a variety of different customers easily.

B

Furthermore, our roadmaps are actually pretty in lockstep, with onyx as octoml and tvm thinks about quantization and training support uh onyx. Also svc. Has it in the roadmap too, that's not to say that onyx hasn't had some low lights in using it.

B

For example, some frameworks, most normally pi torch, still have problems exporting into the onyx format, which prevents us from talking with certain customers easily.

B

Finally, onyx has a certain level of dynamism which pressure tests tvm's design as a compiler, which I'll go more in depth in the next slide.

B

So by dynamism I mean shapes and values and tensors, uh which can only really be inferred at runtime. So you can't look at the mech compile time and generate the code properly ahead of time like that, as uh we, the one thing we've noticed is that as onyx gets older and as offsets gets, older, more and more operators are gradually becoming dynamic.

B

For example, reduce sum uh takes the sum of a tensor across the provided axes and older offsets. This axis was a parameter which was an integer that you could look at compile time and generate the proper code easily now in offset 13 axis is actually an input tensor, which it's a little bit harder to do, and we've also seen that this uh increasing dynamism in the operator upset is sort of done inconsistently at this point.

B

For example, while reduce sum has axes as inputs and reduce min, which is a very, very similar operator, we see axes is still an attribute, and the reason for this, as we've discovered, is probably because you know onyx kind of made the minimum amount of changes in order to support execution of training graphs within the framework, which makes sense.

B

So if we go to the next slide I'll talk about, maybe why this is going to be a problem in the long run and kind of the main point I want to get across is that onyx really kind of is fundamentally going to be a dynamic framework, because onyx runtime the thing which executes onyx models, uh the first class citizen, which run executes onyx models at least, has a much more interpretive approach towards executing these computational graphs.

B

Meaning dynamism is actually a lot easier to implement in that sort of framework, rather than in tvm, which has a compiled approach where we need to know things like the axes, we're running reductions upon in order to generate the fastest code possible.

B

So at this time, tvm overall does not support dynamism super well, and even when it does, it doesn't necessarily generate the most fast code for it.

B

Luckily, even though I've said all of these problems a lot of the current models, we see, which use quote unquote, dynamic offsets from the onyx framework um aren't actually dynamic because things like the axis parameter or axis input I described previously, aren't actually you know true. You know black box tensors that you can yeah. You can only examine that run time in order to get the values of they're actually constants, which we can fold away running some additional analyses so right now.

B

This is not a super major problem, but it does show cracks uh potential cracks in the future. So we go to the next slide. uh We're going to talk about some of the things that we're trying to do in order to fix this. So at tdm and octo ml, we sort of recognize, maybe some of the deficits with dynamism in tvm. So we of course, are trying to come up with projects in order to fix this and make other various improvements to the compiler infrastructure to better support these new, more modern workloads.

B

So uh we call this project relax and it has a lot of exciting things that we're going to plan to extend into tdm, uh for example, we're going to support symbolic shapes and being able to propagate these symbolic shapes with computation deduction and establishment of invariance throughout these computational graphs. So if we look at this example before we have a tensor of size, n and m, which can be sort of uh inferred at compile time now, whereas in the past, we couldn't do this.

B

Furthermore, we're going to add support for mutation, side effects and better control flow, which should help us implement more complicated graphs like those required in training.

B

Finally, we are looking at uh better establishing communication between lower level irs and higher level irs, and this might be useful, for example, in operator fusion, where at a lower level ir, you might see very easily see that maybe we want to fuse these two loops together and that might have form graph level optimizations further.

B

So as sort of a sneak peek, it turns out that if you implement like a subset of the provided changes, I mentioned before in sort of a somewhat hacky fashion with the tvm, you can actually use tvm as a compiler for training graphs, and this is really really exciting because um of course, the central part about tvm is the auto tuning proportion where we can find the fastest kernel for certain tasks.

B

um So what we see in this graph here is that when we use tvm in order to execute a training graph, um taking a sort of a best of both worlds approach, where we take the either the fastest dvm kernel or the fastest um native executor for whatever gpu we're using we'd, see that we can actually see major improvements in performance. So uh you know on the left. We see burt with one set of parameters and we get a 40 improvement on the right. We see like a 27 improvement.

B

uh We believe this will uh set the framework for using tvm as sort of a backend executor for many training frameworks in the future.

A

Thanks andrew and uh if hopefully you enjoyed that, I know we compressed a lot into a little time and if you're interested in learning more definitely check out uh tvm craft.ai, where our fourth annual conference is coming up just around the corner in december and we'll be opening registration soon check out our website and sign up to learn more or reach out to me directly. If you have any further questions, we'd love to chat thanks. So much for your time.