ONNX Roadmap Discussion, 8 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: ONNX Roadmap Discussion #1 20210908

Description

1. Takuya Nakaike (IBM) – New operators for data processing to cover ML pipeline (eg: StringConcatenator, StringSplitter, Date)
2. Adam Pocock (Oracle Labs) – C API for C++ components of ONNX (to assist in wrapper for model checker functionality)
3. Adam Pocock (Oracle Labs) – Better support for emitting ONNX models from other languages beyond Python

A

Okay, so, as you know, we have 30 minutes for a representation of about 10 minutes each. uh Please leave some time for questions and discussions, and you should know that it's also recorded, which I think I haven't started, recording but I'll do it for sure. As soon as I finish, I.

B

Actually started recording, oh thank you. Thank.

A

You I appreciate it uh and we will be posted on youtube. So if you have colleagues or friends that are interested also by this topic and could not attend, and today we have three presentations, one by takuya from ibm and two presentations by adam, which will be somewhat merged together and uh to be aware, the next roadmap discussion is on september 17 and european friendly time of 10 am pst all right. So let me now ask we had to present and.

B

A

Your topic and thank you, let me share my screen.

A

We can see it.

C

Okay, hello, everyone, I'm takiyah from ibm research. So today let me briefly introduce our proposal about the new machine learning pipeline framework with the new onyx operators.

C

So this is the summary of our proposal so that the pro motivation of this proposal comes from some findings in cargo use cases.

C

So, as you know so, pandas data dataframe is very popular and frequently used to write the data preprocessing code in use cases, but I found that most of them do not use the pipeline existing pipeline framework.

C

I thought that that is because uh not through farms, there is no true functionality in existing construction framework to represent a typical pattern of data. Pre-Processing, especially, I found that a current pipeline framework cannot represent the new feature. Calculation from the multiple features and also so the onyx blocks are huge operators to represent difficult patterns of data processing.

C

So we prototyped the new pipeline framework called data frame pipeline on python, which converts the typical data pre-processing patterns on pandas data ram into onyx format, and at this prototype we protect three new onyx operators.

C

One of them is date operator which empowers a date string to extract a time feature such as a year or a month or week and so on, and also we protect two simple operator string, concatenators, which concatenate multiple strings and the string splitter split the string based on a given separator or index.

C

Next slide shows the uh overview of our pipeline um machine learning pipeline framework called the data frame pipeline. So this framework is already open sourced on the github. So the if you are interested in please look at it and in this framework, so the you can define the machine learning pipeline by using our live api data from pipeline, and you can define the pre-processing steps like this.

C

So the in this step, so that you can, you need to use our provided api, such as string, concatenated, transformer under encoder and column, selector and so on.

C

Under each transformer, you need to specify the currents, input, currents and output currents and so on so, and uh this framework uh runs uh on the python uh when, in that, at the training phase like this, so the uh so I mean that this pipeline framework works on the python for the training and after that, the model such as the xgboost model is trained by using the data which is which is an output of our data frame pipeline and our framework can consume the uh already in uh converted.

C

Onyx machine learning model and pipe model and the out can be under output. Export the onyx file, which includes the preprocessing operator and the model operators, and as you can and as you already know so, this onyx format can be consumed by some onyx runtime, such as onyx runtime and provided from microsoft.

C

So this is our framework, and uh in this prototype we implemented the 11 data frame transformer on python and mapped to the onyx operators.

C

So the most of the operator are simple mapping such as level encoder or scala, and so on, one hundred, equal and so on, but some tricks are needed, such as the function transformer which takes along the function and our framework analyzes the lambda function and convert it into the alignment operator, such as ad operator or mean operator, are multiple, multiple operator and so on.

C

And another difficulty is that the allegated operation, such as frequency, encoder and aggregated operator operator, so this operator takes uh one or two columns and uh that uh some values uh obtained at the training phase is uh used at uh test prediction. Inference phase.

C

So this example how we convert the frequency encoder into the onyx operator level encoder, so that this operator counts the frequency of values in a current, and we create the mapping table like this, so that in this case, so the a appears three times. So the a is mapped to c and b appears two times so b max is mapped to the two, so the uh generating this kind of the dictionary on python. We converted this mapping data to the level and embed this mapping data to the level encoder like this.

C

So uh in other words, so the we did we, our approach is- does not perform the aggregation operator on the onyx. Instead, we generate the onyx operator with the charged body properties.

C

And also the last three one: uh three operators are new operators, which we pro prototypes so string concatenator and the string splitter and the date transformer.

C

This is the final style so that this is a preliminary experimental result by um for the performance when we convert the uh data frame pipeline in from the python to the onyx. So this is a speedo to the uh python implementation and the erova is the performance when we learn all of the uh all all of the data pre-processing and the machine learning model on onyx.

C

So the blue bar is performance when we run only the pre-processing on onyx, so the machine learning model is long on python and graber. You can't visible, but this pattern is that the pre-processing is running on python and the machine learning model is running on the onyx.

C

So the as you can see that when we learn the pre-processing on onyx, we was a. We were able to get a great speed up, such as 300 times performance improvement for categorical encoding and so on, and also we compare the prediction accuracy like this, and there is no was no much difference between the original code and the accuracy on the our data plane pipeline using onyx partners.

C

So this is our proposal, so any questions or comments so far.

B

The the set of um operators that you showed in the table in the previous slide is that the the data frame transformers is that the complete list of transformers that we would need, or are there or is this only based on some models that you looked at.

C

uh Kind currently, so the uh we introduced only the three neutron uh neonics operators, so these are the already implemented our data frame pipeline framework and the upper edge transformer can be mapped to the existing onyx operators. The last three are needed to introduce needs, introduce touch tooth of new onyx operators, yeah.

A

Maybe another way to face this question is like: if we have all of those operator, both the black one and the red ones, how much coverage do we get in the cable benchmarks? Do we get.

C

One of them yeah.

A

C

In in the current benchmark, these operators, all of the pre-processing yeah.

D

Do you have a notion of how much of the date usage it's going to cover right, so pandas support some subset of of old dates? Is it only going to work on like the gregorian calendar dates from like common era, or is it do you need to be able to support different calendars?

D

um Do you plan to do that.

C

I at this time so that we did not use the other calendar so that we- uh I I don't remember in detail, but I used the existing uh c date, parser library, so that the implementation is not so tricky. I think.

D

But so are you just doing dates? Do you pass times as well and if you pass times, how are you going to cope with time zone updates and that kind of there's a lot, an awful lot of complexity to date, processing once you get past this.

C

D

Yeah yeah yeah.

C

Maybe it may need some such functionality, but at this time, so that we did not pass any time on data so that it is, it is used. uh I see ah yeah yeah, yeah uh yeah. We need to yeah, I I I remember this so that yeah we need to pass uh some base time to calculate the ear or mass or and so on. Yeah.

A

C

A

So it sounds like this is some work that would be needed to be done, uh while uh you know doing the formal proposal of the date to make sure that it covers yeah. Many of the uses that the community needs yeah one will say one more question from the community.

A

All right so then, let's go to the second presenter, adam floresios. Thank you very much for this presentation.

C

D

Zoom was being temporarily grumpy, um so I do not have as much detail as uh as the previous presenter. I have just have a couple of relatively quick suggestions, um so I'm adam pogba, I'm in the machine learning research group in oracle labs and we've sort of been working with onyx more recently from java, as you might expect to oracle- and this is just sort of some suggestions I have based on the difficulties we've had working with onyx emitting onyx things and interacting with them on the java platform, rather than sort of based in python.

D

So uh the first one that I sort of checked was um so the onyx core project has a lot of functionality in there like the onyx project and like there's, the optimizer package and various other bits have lots of houses code which just has python endpoints right. It's all wrapped in python, and it's not clear if the c-plus plus is, is a valid target for binding right. um You know, there's lots of things in the onyx project and all them very useful.

D

I'm focusing on the utilities because they're the bit I need sort of I want to care about at the moment, um but as onyx sort of spreads out and is it is, does appear to be spreading out across the ml ecosystem.

D

We're going to need to interact with it in other languages and python and so principally like we would like the model checker to be visible in languages that are not python, so that I don't have to shell call out to python as part of my unit tests or when I'm developing something I don't have to ensure there's a valid python environment in my system or when I'm trying to deploy it, and we had a use case um with onyx runtime, which percent might know about where the uh the model checker would reject models, but ort would occasionally um segfault when it consumed them um due to various issues in the way it was parsing them.

D

But the model checker was something we could have used to validate that if we could have easily got into it. But we couldn't because we'd have a python vm installed where we had the java system running and it would have been very complicated.

D

So this model checking functionality seems to me to be core functionality. It'd be very useful to be able to expose across other languages. um Also there's some things about modifying operations like upgrading between offsets. um Some of those are offline operations. You might be okay to have python with, but but the model checking seems very useful um and whilst I would like to get it in java, I do not expect everyone to write a java api right.

D

That's that's too much, but a c api is something that most languages can easily bind to without a lot of user code. So java has an automatic or is getting an automatic system for for binding to see apis and running with them. um There's things like swig and um and lots of other languages have sort of ffis that let you automatically bind to libraries.

D

If you have a c header, um they don't necessarily cope as well with c plus plus apis for various reasons, um and also it's not clear if the c plus plus api of any of these tools is considered to be an appropriate target for binding, which I think is is part of the real problem that I myself have with this. It's part part of the real thing.

D

That's that's the ask: is it's just figuring out what is a stable api and what is an appropriate thing for people to bind to so the library that our group creates called tribio is a java ml library we're adding onyx export support, but at the moment we have to use onyx runtime virus java api to validate that we exported the models correctly, because we don't have anything else that we can call into that's within the java ecosystem that doesn't require us to spin up the python vm, which is, is sort of tricky to do in our test environment.

D

um I think the situation is similar for ml ml.net. I don't think they have access to the checker. I am not as familiar with ml.net as some of the people on this call will be, um so I don't want to speak for them uh too much, but in general, like, I, don't think it's that easy to get hold of this functionality um from from other platforms and, as I said, see, apis are better at interop than c plus apis, and you can also.

B

D

Bind them if that's right, writing that will require some effort and some design effort and some thought and construction and maintenance burden right, which might be too much if, if that is too much, then is it possible to sort of denote which of the c plus apis would be stable? Entry points will be things that we can easily access and we're okay to bind from other languages, because they're not going to change out from under us.

D

So this is it's particularly a point, because it's very common to bind super plus apis in python, using pyrap or pi bind or something so, for example, tensorflow does this and linux runtime does this and they get extra functionality because they bind directly to the api and get access to all the c plus internals, rather than going through a single header file that provides performs a sort of a barrier that is the coded two interface, so it is sort of a question of defining. What is the useful coded two interfaces?

D

Is it just like the whole thing and there's no subdivision between python and c plus plus? It would be useful if there was a subdivision. It would be even more useful if that subdivision was via a c api that everybody else could use as well. But um you know each of these have different development costs and may well not be of interest to the community.

D

So that's all really. I have to say about this specific point. um I have another slide, which is about um sort of dealing with onyx from from other languages of python as well. I can roll directly into that, or we can take questions about this bit specifically if people are interested.

A

I don't want to speak for ibm, but I know that java is also a very interesting interface for us, so I think you're definitely not the only one. My other question is like: would you be willing to contribute some of uh to some of this effort.

D

um I'm already maintaining three java open source machine learning projects, um so uh I would be happy to participate. I don't think I could write the whole thing myself and to some extent I am a java programmer and I can write some c. I am not a c-plus plus programmer and I'm especially not familiar with modern c plus plus so there's some aspects to that where I just I don't really have the background and do not have the time to get up to speed. On that background. So.

A

I didn't mean to participate. That's wonderful, I mean that's a good start. I do, and I didn't mean you personally. It could be also somebody uh you know. Yes,.

D

So um so I can, I can talk to to a wider people in the company and see if people are interested right. It doesn't seem worth trying to galvanize a large effort on this like unless there's any interest, and we didn't want to try and fork anything because that's not what benefits.

D

Seem like there are no other questions. Shall I move to the next topic.

D

D

There we go so um this is about exporting onyx models and emitting them from other languages in python, so ml.net. So I talk about ml. on it a little bit here. I have not been in direct contact with the developers on this topic. I've just looked through their code, um so I do not wish to speak for microsoft. Anyone correct me so, but ml.net and trivial library are two uh two projects which I know of that export onyx models from languages that are other than python, so c-sharp in case of ml.net and java.

D

In the case of trivia, um both of them have a pile of onyx related helper functions.

D

um You can see microsoft's package for it there and you can see our package um underneath both our package, in particular, is currently under active development and is expanding to cover the set of models that we support, we're in the process of doing this is just the way the roadmap overlapped with our development cycle um and all this functionality or much of this functionality, as far as I can tell, is also in this onyx converter, common project uh used in on xml tools and a few other places.

D

That means that there's three different implementations of basically identical functionality. uh I I'm not clear on exactly what the ownership I'm gonna use it under the onyx project.

D

uh Our java, one, as I said, is still very much under construction, um but it seems uh relatively wasteful to have three projects that all let you interact with onyx models and generate them appropriately and try and ensure that their invariants are verified so that you produce nodes with the appropriate attributes and you don't construct malformed graphs.

D

As I said, I would like the model checker to validate that I'm not producing malformed graphs, but um we would also like our code to prevent us from producing malformed graphs by ensuring that those are type errors or other kinds of errors.

D

um So it seems very strange to me that there are three different implementations, none of which share any code, and so each one of them could have bugs in how it exports onyx models when we fix any one of those bugs nobody else benefits apart from the project that depends upon that, and that seems to be a waste of effort.

D

In my opinion, admittedly, the python one, I think, is far and away the most used one. um So that sees the most development effort. It's probably got the fewest bugs um r1 is still under development. I'm not clear on exactly the status of ml.1, um but as a lot of this stuff, like the onyx converter, common project um has a bunch of things. You see um that it uses sorry, it uses a bunch of things from the onyx project, the main one.

D

If we had a sort of common api that everyone was willing to use that we could buy and see, um then we could all use that same api, that api could end up being part of the spec. It's the approved way of generating nodes and graphs, and that means there's only one place to validate them.

D

There's only one place to ensure that everything is up to date and there's only one piece of code that you have to update when you add new ops or when you modify and open changes, attributes or you know the other things that happen is the evolution of the spec.

D

That seems like it would be beneficial. There would be work, definitely in migrating all the different use cases on top of this common api, but then we'd all be working on a common api which would hopefully let us share some strength across that and make it easier, and it would make other projects which are trying to write onix models outside of java or c-sharp or even outside of ml.not attribute.

D

Our package is certainly reasonably tightly bound into how trivial views the world metal.net is also relatively tightly backed to how ml.net views the world um and the onyx converter. One is kind of bound into how psychedelic views the world at least a little bit, um so it might be better to have an implementation that we could share across all of them and then other machine learning, libraries in other platforms or other packages with other paradigms have a common language with which to emit onyx models.

D

You can definitely do it by writing the protobufs by hand. It's not very pleasant to do so, and it's also difficult to validate that your work is correct. If you are taking that approach, so I think that it will be generally of use um if there was some shared effort uh on that again.

D

This is a lot of work to do these kind of things right, you have to sit, you have to sit the different projects on top of it and I'm not saying it's a short-term effort or anything, but if we want to grow the onyx ecosystem beyond python, it seems like this might be of interest.

B

Hey adam uh from from what you said, uh am I hearing right that you think onyx converter common might have the core set of functionality that you'd want.

D

I think it probably does. I am not especially familiar with it. I only sort of hit it every so often so as part of my as part of the work we're doing to add onyx export functionality, I'm basically looking at how onyx models are exported via our xml on xml tools or ml.net, because the the documentation in the onyx project is not quite specific enough really for me to quite understand how it's used.

D

Particularly I've been looking at the decision tree stuff recently and that's I feel like that- could probably do with some wording clarifications, because I I really had to go and look at the python code to try and figure out what exactly it was doing and how everything was bound together, um possibly because decision trees are such an alien um concept to the sort of tensor-based view of the world that onyx generally has, but, um but I think that a lot of that functionality does exist in there.

D

Certainly so one core thing that happens in all three of these libraries is: there is some notion of a scoping that can generate you fresh names of a particular attribute type or a particular subtype to ensure that the namespaces are all kept separate. The names are all unique.

D

We all need that um because ponix requires the names are unique um and there's sort of top topology stuff in there about managing different graphs. If you're patching graphs together. That's something that again, we all need like if you do, especially if you're dealing with ensemble models, especially arbitrary ensemble models, where you just have a vote on top of a variety of different classifiers.

D

um So I think that a lot of it is there it's just getting it exposed in the right way to other places, and so it's easily consumable from other projects. I mean you know the ml.net is going to have its converter.

B

In trivial, we're still going to have our.

D

Java onyx converter, um because any effort here will take longer than our release cycle for when we want to have onyx support. But I think if we all start to work together, it might be beneficial.

B

Looking at onyx converter common, it looks like the code is all in python. Actually.

D

Yeah, but so I I I misspoke earlier, there's some other stuff that it uses from onyx, um which is is uh c and has some sort of wrappers around the product of generation. There's also a helpers.pi in the onyx project, which is sort of uh aids with the generation of the protobots, but is actually relatively uncommon, with the other things in the other languages, and then there's the ionic converter comma, which is has a sort of pythony view of it. But all three of these packages have something that does scoping something that does naming.

B

D

That allows you to patch graphs together because that's the functionality you need when writing on response.

A

Is there a sense of which seg it would fall under? It feels like it's several.

B

uh If we want it to be kind of a common tool that is used by different folks, I think it's probably under the architecture. Infrastructure sig. uh Just like you know the model checker and some of the other kind of commonly used tools.

C

D

Any any other questions.

A

So I don't have, um I know I know there are folks that do on java and and are interested in that so not here in the corner, but maybe I can put them in touch with you.

A

Okay, so that's about the time we had so thank you so much for the presentation that was really very interesting. You kept to the time which is fantastic.

A

If you can send me the presentations, I suspect we need to put them somewhere if that's okay with you and the recording will be also saved and put on youtube. Thank you. So much and we'll see you again on september, 17 at the european friendly time of 10 am pst. Thank you.