ONNX March 2021 Community Meetup, 22 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: ONNX20210324 V19 ExperienceImplementingONNXimportforGAPprocessors

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Recording okay, um so my name is Martin krung from uh Greenways Technologies and vice president of marketing, with Greenways.

A

um We've been doing some uh work recently on onnx uh supporting our tool chain and I wanted to highlight our experiences good and bad. With doing that.

A

um Greenways is a company making a processor so we're a fireball Semiconductor Company we're making processes that specialized in doing Ai and DSP on uh extremely energy constrained devices in markets like hearables, iot, sensors, iot devices, medical devices and so on.

A

um Our first product is actually shipping was production qualified at the beginning of last year, so we have customers actually using this and seriously um and and putting it into products um part of the way that we actually control energy um on on Gap. One of the ways that we control energy um on gaap is to actually use a strategy to to move data across the chip um in a way that's predetermined at compile time.

A

So we don't actually use data caches um inside the chip um and the reason for that is that um data caches tend to be extremely inefficient. On stream, workloads um You probably get about a 30 cash hit ratio, which means that 70 of your loads are being thrown away, which is a lot of energy being used up for no particularly good reason.

A

um So what we do is um we use software tools and particularly a tool that we call the autotyler, which is essentially a memory planning tool. It's it's a um a tool that that takes a model of the operations that need to be done and produces, um in fact, searches for an optimal memory movement across a memory hierarchy, whether that be external memory to the chip or internal L2, one memory inside the chip. We also have a compute cluster, which is multi-core compute cluster um in the Gap 8 product.

A

That's a cores with a shared memory architecture. So really the most important thing for us to do is bring data in and hide all of that data movement behind computation on the cluster and keep the cluster busy um to do that. Basically, we have a complete flow of whether it's part of that flow um and one of the things that the Autotire takes in essentially is quite a high level model of the kernels that need to be executed. And then we have implemented specific kernels for the operation.

A

So we heavily use fused kernel operations which are handcrafted to really uh get the maximum Energy Efficiency out of the platform. We then have a tool called NN tool which can suck in a TF light and now an own out its graph and produces that model. Essentially, it's a lowering tool, that's lowering the representation in terms of a light or on X onto the kernels that we've implemented inside I wanted to give some backgrounds.

A

So you understand some of the comments I'm going to make afterwards Gap 8, which is our first product, only operates on fixed points. So quantization is extremely important to us. An NN tool can handle those quantization steps and with very various different strategies inside it, or it can suck in quantization information, currently only from tensorflow like and then use that those tensor statistics and some indications of the quantization which tensorflows are applied to apply quantization, that's compatible with our kernel.

A

um So the experience with omnx um so so really what's gone very well, and what I really appreciate with our next particular versus attempts for the light um is a really understandable operator set and structure. There's, there's fairly little duplication of operators doing more or less the same thing in a different way. There's great documentation, The Operators, are really really well documented and that's really really appreciated by any one being development work with it, um there's also a great operator versioning system.

A

So when you update your operators, um the versioning systems really appreciate it and allows us to uh to uh to import with confidence across multiple different versions, which is a difficult thing to handle. uh There are two areas that I really wanted to um say that I think there could be Improvement on the first is quantization and the second is something I called Fusion friendliness and I'll come to that in a in a following slide. But let's deal with quantization first.

A

So at the moment it seems like an omx there's, a mix of some fake quantization operators and scale quantization operators and then there's a few more complex ways like a convolution, I think and um perhaps a linear layer where there's actually a specific quantized um implementation of that kernel um and I'm really wondering what's the goal here.

A

um Is it to express um a quantized graph uh on X graph directly and then and then run that, or is it to provide the necessary information for a back end to provide its own quantization scheme for the graph or essentially lower onto its own quantization scheme, whatever that might be, I mean if it's expressing a quantized graph directly so that you can run it in its form in on an X?

A

That seems like a really open-ended subject, because you're going to provide quantized operators for every single scheme with every single different quantization technique, um you know, are you going to start support? Sub byte quantization um variable bit width quantization?

A

How are you going to support tensor compression, there's loads of things which which, which you know, which really map close to the hardware as well, that have really direct effect or in in terms of performance and Hardware um and I would prefer that you were concentrating on the latter or at least that that was something which you considered strongly and if the latter is something that you consider strongly, then um more information is needed and currently that's there.

A

um We obviously have you know, parameter statistics, because we have the parameters. The activation statistics we don't have um and tensorflow like currently with all the tensors that are actually brought in in a quantized graph, gives you at least minimum maximum um information on every single tensor. It would be really nice to get standard deviation and mean information, um and, and that with that we would we would be very happy and would be able to do quite a bit uh of uh you know: we've built a map onto our existing Quantum quantized operators.

A

um It would be nice to have some more statistics, um particularly having them in Max standard, deviation and mean by Channel, and also potentially adding some outlier statistics in terms of weak and strong outliers. Those those would help us do a better job in terms of the quantization, and so my suggestion is that you add a statistics metadata um to to every tensor where you could do it just for the non-constants.

A

But but you know, I mean in a sense, if you're going to put do this in, why not do it in the sort of homogeneous Manner and just do it for all the tenses, even though it's it's somewhat redundant on constant, so I agree anyway, we would be happy to participate in any effort to do that.

A

The second area I'd like to cover is what I call Fusion friendly. um There's there's one thing which is, which is kind of you know a big problem for us, which is that we have you know optimized fused kernels and it's very difficult to get to highly optimized used kernels um from elementary operators.

A

um It's uh always use a kind of it's a little bit um uh unfriendly uh um um way of describing but I always kind of describe it like moving from a cow to minced me, to throw different, very easy to go one way and difficult to go the other way.

A

um So so you know you do have confused operators. You know, for example, you have a GRU operator which is great because sensorflow Lite doesn't have it, um but that seems to be a move towards functions containing subgraphs of you know, being composed of subgraphs of operators for a lot of the fused operators. That's fine! As long as we know what they are um and and I think the solution would be to encourage or in some way make it kind of the nicest thing to do.

A

I said Force, maybe a little word, but but really encourage the exporter writers to wrap any native high-level operators in the platform that they're expect exporting from in perhaps a function and then with a function, namespace function, name that somehow indicates to us. Where did that come from what was it before it got turned into this subgraph? This would allow us to choose to say: okay, we we have a good fused version of that operator. We can map straight onto that.

A

Let's not look at the sub graph or if we decide hey well, we don't know how to handle that then go into the sub graph and start lowering inside that sub graph.

A

um So those are the the two points that I really wanted to make and with both of those points implemented. I think on X would be uh definitely the best solution um for for graph export um available at the moment. Thank you.