ONNX Training Working Group, 23 Jul 2019

Previous Meeting

⏯

youtube image

►

From YouTube: ONNX Training working group meeting on July 23, 2019

Description

Recording of ONNX training working group meeting on Webex from July 23, 2019.

A

Hello, everyone thank you for joining our call on onyx training, and today we have people from IBM and several people from Microsoft and I believe we have people from other companies too right.

A

Okay, how raishin have some update I believe? Let's listen, go head, Bishop, I,.

B

Worry there will be not enough. That is just a question for all of you. So, according to my investigation, high order, auto differentiation would be Navi, stable, I mean numerically, and it will be hard to implement at least I. Don't find that I don't find it good I.

B

Don't find a good code or I didn't come with give a simple information for it. So do we want to read, restrict that there can be only one gradient operator inside the training curve in this first really releasing this first proposal.

A

Okay, so the higher order of the differentiation is needed. If you have more than one gradient operator.

B

Right so we can compute that numerically, but.

B

It's quite expensive and I don't see many supports from the other differential differential community.

A

So, just to clarify this is needed in the cases of adversarial. It.

B

Is needed for Ken, but.

B

Anyways I don't have a good pseudocode which can resolve the moon. Travels introduced by hi recursive query uncle.

A

So, for now those glands they can be implemented just in the regular training Craig. How are you implemented.

B

A

B

What do we, what a.

A

Congenital flaw, or in PI torch, you can train those networks right.

B

Right but they kind of drive there, so you write down your gradient computation, explicity, youth, using.

C

B

Corp reputation right without using also curve like, for example, you know we can't have him. Yes,.

B

Okay, you can anyone see, you know, can you place.

A

Beauteous else's, if you are not talking.

B

5M you just use kita.

B

You scroll down to the end of circulation.

B

Yes, so for now, price watches in pythons impression use as user to write down this this stuff directly, so that also div can calculate.

B

B

Okay: okay, my point.

A

B

So this this kind of thing.

B

Strictly bypassed the need of to query operators, but it will require user to write down the gradient commutation using forward pass. It cannot just use two backward pairs to compute the gradient of gradient.

A

Okay, and can we save that information in onyx somehow.

B

Yes, we can save that so I mean we can still allow recursive gradient code to 100 fully support the gained case without asking user to do all the graph construction in the future. If we have a good coming pseudo, color implementation.

B

So this sure, although we say, although I say that I want to restrict that be only one single grainy operator, we can still prick Rick relax constraining the future, because the signature will be just the same.

A

Okay, I'm, sorry, but why cannot we just have it in onyx now? What is the problem with somehow taking this code from by torture? Can the flow and saving it in phonics? They.

B

Can safe, then, in onyx, with squirting operator.

D

B

So, what's your question my.

A

Question is so, why do are you saying that it's so difficult and we cannot.

A

We have to postpone it until later release four onyx.

B

No I'm saying that we can. We have first used going operator in a somehow restrictive way, but it it's good enough to most cases and even for again user. Forcing cocaine user can compost the gradient directly without using backward.

B

I'm not sure how how many, how many you have ever hair or study the high order stuff so forth. It basically requests you to.

B

Decompose your curve into primitives, like, for example, you need to decompose your convolution into addition, multiplication and so on, fall off scalars so that you can recursively apply that.

B

Auto differentiation function, you define Auto differential patient function for a set of operators, and the outcome of the auto differentiation should be also the set of the operators so that you can recursively apply Auto differentiation, but for some unexaggerated, like oh I, aligned and so on, it will be hard to live. I mean it's doable because we can write the bus, Plus code to implement and.

B

C plus also can be compiled to vir, so it's definitely doable and it already exists, but it will take some time if we want to compute how the.

D

B

Of, for example, our line.

B

Please we need to decompose in first.

D

E

So can we go back to your proposal and see what's the impact of this restriction.

B

A

Want to share the screen, yeah.

B

B

B

So it will be a couch right here. Wait.

B

B

B

And this culture will be removed in your future, but for the first three days, I.

B

D

B

C

B

C

B

However limited so.

D

Sorry does this mean there should be only one node with a gradient of liquor.

B

Yeah, it's kind of Catholic to our original, because there we want to automatic a computed gradient of of gradient former amaizing game model but green. I think that is not that easy, so I'll try.

D

B

First habit then restrict removal, adoption.

C

D

Actually, in some sense, though, I mean you have talked about Hawaii aligned. It is true that you might have operators which are non-trivial to differentiate and especially hard to express back in runners, yeah.

B

To Auto differentiation, you need to express your backward using.

D

B

D

B

Apply the whole differentiation, okay,.

D

Right right, that's server. Okay,.

B

Go ahead. Sorry: okay,.

D

So the limitation is really that the problem is not with supporting higher order derivatives but that the fact that for operators it's easy for summer British, it will be her exactly.

B

That's my boy, so we want to make the incrimination.

D

B

Operator super heart, the authorities or we, which is first registering.

C

Call here virtual.

E

B

Easier information and then we.

E

B

Our culture and Iraq's.

C

Can staircase is there are any specific Ganz that can be supported even with the specific restriction. There should be a subset right right.

B

So to some here yet no.

C

B

There is to be one query: anomaly.

B

B

Boston compute G underscore X hit the.

F

D

D

Guess I can compress a question to we can me.

D

Can is it possible to allow the use of second order derivatives in some cases, even if it is not possible to support all operators.

B

Yeah sure for I.

A

B

Multi-Layer perceptron, it will be easy to.

C

B

Them because it's for opacity, so this.

C

B

Similar to it's, it's for backward pass is quite similar to its forward pass. It is just a multi-layer perception.

B

But do we I mean I'm, okay with having conference um operators, but I feel it will be easy if we just put a culture on a single gradient operator.

E

Do you have a can use case? We can try that with your restriction and see how that would work or not.

E

B

Because that the.

E

Discussion here, I believe, is sort of triggered from having to support again right.

A

E

Remember I mean I still have that your simpler, a linear model using era, grad optimizer I, have that model. Actually, if you want I can show you that I can actually sort of use that intensive flow to to continue or transfer learning right, so I think we are good with that. By having this constrained and use of gradients of gradients thats related to gain. That's, why I'm asking do you have a real game model? We want to try this out in order to prove our design is good enough. That is for some sort of gain.

E

Is that reasonable, or we want to push this out later? I mean I'm up for anything here.

C

E

C

E

So, what's your goal here of getting this yeah yeah at least you're, navigating to your game mode.

B

Okay, so that's how this title, you know, I! Think fight with herself. Oh yeah I told you, okay, so fighters do have to gradient operator. This is his great operator and you can see the lost penalties. What the voice my side, gradient. First compute, this guy! Okay! Here you got your gradient and you compute the penalty penalties here.

B

So if we don't allow to gradient operators inside curve, we were not able to do this type of can, which is.

E

B

E

B

It we can use it in our next release. Okay is just a contracting text. We can easily remove the text, it will not change the signature.

E

Right, do you see some other gains without this.

B

The first skin doesn't have the Gordian penalty.

E

So does that mean if we have this restriction, we cannot support gains for now or support gains next release.

B

We can only partially support like 30% of the games.

B

But for the games with with squirty and penalty we can. We cannot do it directly.

B

A

B

We wanna add a constraint to make implementation simpler or are we weak, which is to whatever popular model need like to gradient operator.

B

It's about implementation, it's about a balance of information cost and the Express Express Express expressiveness introduced in one release.

B

So do you guys want to have a full support of Ken in the beginning, or you want to have some support of getting a beginning, ended? Add more later.

A

Maybe this is also related to the question about what are we presenting as a meeting right in August right.

E

So that's a good segue to what I like to discuss today. Right. Do we have short a time you know frame or target time to finish our sort of phase one right. So if yes and when and then what we want to present at our next face-to-face meeting.

B

So my position, mine to me, the lower bound, would be using so for under the current proposal. Even with cover given with I, can't rank right now, writer, we should be able to support everything we can do using the previous from also that is, if in the culture and also satisfy satisfy the the need, I will be happy to coach. My current probe also enter with one extra constraint.

B

The current proposal, even which the cosmetic and do everything the first floor, also can do and more from the flexible and also allow skin some some of the games for also total or any game.

A

So what do I need to do in order to be ready to present at this meeting? Do we need to write some code.

A

What do I need to do? I.

E

Already talked about this right, I think the practical use cases you know we should at least present or have some prototype to go with the proposal and.

A

E

There's always the things I like to see. That's why I took this? You know a proposal and created some kind of a you know, touring to to make it run intensive law, but we don't have the other direction. Yet that means from tensorflow how we can generate this sort of gradients and other things. We need right for that. I have reached out to the test flow to onyx community, which is led by come to. He said his team will take a look so to me to call this a complete, like version.

E

One I prefer to have the validation from that team. That's my suggestion here.

A

All right, so that would be okay, so for validation, but in terms of onyx standards itself, we have some pull requests already arrived from ratio and that should be sufficient or do anything else with it. I.

E

Would say so we already have two: oh poor request, like I did I, just you know, put it on my computer. I could generate this training info and the function and so on, and then I can make use of it. Similarly, you know for the other team. They should be able to take this down and try to create that training for in gradient operators and so on from their co-pays. That's the thing I'm hoping you know would happen sooner than later, as.

A

Those people are planning to be in the milk.

E

I'll certainly ask them, because they just sort of agree to look into this. I didn't push them too much yet and.

A

So maybe invite them to sauce meetings.

E

We'll see who was signed up actually I have this converter thick tomorrow. So I'll definitely ask them to you, know, have a name or have someone to join us in our training meeting. Okay,.

A

E

Going to ask for additional converters if possible, to look at this by the way you guys are all welcome to join our converters meeting tomorrow. It's posted meeting indicator.

A

Thank you yeah and chin I. Remember you implemented something when there was the first version of this proposal from waiting. Did you update the code for the proposal yeah.

E

I did look at me. Maybe I can just quickly go over that. Okay, it's similar to what I had before, but anyway, I can give you another quick update. Okay, okay,.

A

E

Queen, which one 3 2 1, okay,.

E

You see my screen yet action: I ate okay,.

F

E

So yeah, based on the latest design and also the pr I-- I was doing again, you know from onyx ir to tensorflow and see what we can do. Remember. I present this version, one before that's, based on the initial design and now I have a new you.

A

E

With this graph note, using a function and also with algorithm, then also you.

A

E

Aquarian operators or nodes in the you know training for okay. So to have this analysis, I wrote a very simple code. I'll just run it here, so you can see. Okay.

E

So let me show that the first one is to take the initial.

E

E

Save that say onyx format, including that any info, so this is a very simple linear model. Okay, I run that in pipe watch, first right with a degrade again I print. The results here later on I run the same model, intensive flow, okay, I print the results here: okay, around 50 times and then I- save that the onyx format. Okay,.

E

So by running this Python Nichols that.

E

You see this simple model running in both platforms or frameworks. They produce similar results. Basically, after you're 50 iterations training returns this you know, okay, so catching typos and tensorflow, and then of course, we generated the.

E

Txt and onyx so here I just look at the txt. We have this note.

E

Algorithm right as input and output and our.

E

Algorithm basically,.

E

Hey um once I have this I go back to my tential side and that's the TF I have a paragraph here.

E

So I had to in here lo different models, because it was the introduction of the function node to capture the graph. My current converter, you know, broke right because I don't have a handle for that. So I can read the information in, but I cannot produce so to explain that I have a view here. So that's the current may be maximized view right. I have the inference function, it doesn't have any real operators right vs.

E

the way I have here and you can see I have the real you know transpose and multiplication operators here, okay, in order to have them the training going. I need this graph, of course, with this graph. I need to handle this function and go down to the next level right, but that's where, for now, I just have two models, but you can imagine right that inference. Models on the right hand, side exactly is what we have inside of this function.

E

Okay, that's sort of my my Tesco here goes because, with the inference class, I can do that computation, pass right, intensive flow and then I can configure the placeholders like label right for this. I get it from the the model. The model is actually the onyx model with training info can see here.

E

Okay, so I can load them this model in and then collect the information. I need that label and R is there learning rate and.

D

E

That as placeholder, okay so later on, I just use the a degrade, optimizer internship flow and use that learning rate you know from the model: okay later on I print it out before the additional training make sure it's the same as before, and then another one after right. So if I run this all right, so you will see the initial result without training additional training. Oh I'll! Have it because.

C

I have various here yeah here.

E

Changeable has a lot of warnings to me.

C

E

See this is before training. So, if you remember, you know, this is similar. Why actually, the same as before, 17 and all the way to 86 and I did additional training intensive flow right, I get better results.

E

Alright, I actually saved the model in a place. I can again use tension board to look at the graph okay, so that also proves you know we can continue to use this model intensive flow right. So so that's sort of my very simple prototype to take this linear model, with a degrade from PI torch to onyx to tensor flow.

E

Okay, that's why I was sort of asking. If you want to do again, do you have a model in PI torch right? Maybe we can now compare that to onyx and I can see. If you know we can do similar things intensive flow right so, based on my code, I have this sort of fun summary right. The first thing is right: now we have a node with you know, type like this, and certainly that's something we need to handle, not we, but all the converters need to handle that from now on.

E

Okay, because this will not be in the real computational graph with, but we need to go to the next level right, so we need to figure out a way to do that and then in the training, initializer of course, learning rate we need to extract. We can do that. We have input output, we can make use of.

E

There are certain inputs, actually I, don't know what I can make use of like like team I guess that's some kind of state. We want to persist right.

B

It's not those iterations. There are some learning race creatures which are functions of standoff iteration. You need that to per phone, an update.

E

Okay, so so, for that, I probably need to go back to a tensor flow and see if there are like additional optional attributes. I can set right. Okay,.

B

Learning risk is, you can take a look yeah.

E

Yeah there might be some optional inputs for for the iteration thing, okay, so for this particular node. Of course, we need to know the last function. We need to know the optimizer, um the gradient I right now. I only use the name because he actually I can show you in the code. Intensive flow I just need to set.

E

Trainable variables.

E

Cleanable yeah.

E

So I commented here so during the converter, processing or producing this Onix. We need to make sure we mark certain variables, as trainable versus not because later on here right I'm using this trainable variables but gradient calculation.

E

So this is the place I can specify which variables I want to have a as part of gradient calculation intense for okay. So that's where I can make use of this gradient knows: okay, so.

B

Do you mean, when you add what extra fields who are grading out to denotes trainable or an unattainable increase of the cornea operator.

E

Yeah, so right now what I'm thinking is as long as the gradient nodes here that variables I can find there in the list now make the most trainable otherwise is not trainable.

E

B

That makes sense to me. Yeah I can add that attribute.

E

But right now, I assume all the gradient nodes are trainable, assume, okay, right.

B

So there are two ways to define: Fenimore unconscionable well way to just just like you said we mark trainable prepares trainable using an attribute inside where the operator or an penetration of of the tensor in 400 or whatever, and the other way is to define.

B

The big word for each operator so, for example, the gather operator the paper pass to the Incas is- would be zero, so they're open, no gradient attitude that and index tensor. So there are two different version.

E

Once you start talking about backward pass that something I'm not sure I have control of intensive flow right, currently I'm, just using the principle. Ap is for optimizers, for instance, right I pass in the list of variables for training purpose, but I do not have any control over the backward pass or backward.

E

You know, operators how they implement. Okay,.

B

So for the weight, but for no all of us for the trendabl or now trainable operators you can, for example, if a tensor X is not credible, you can put an empty string to its outputs, so it graded its quería without a generative.

E

Are you talking about in the onyx format, I'm, not sure what you mean by likely right, okay, awkward,.

B

Yes, I'll press are optional. Okay,.

E

B

That's that is not internal step. You just mark it's open, ASL, empty string.

E

So you're talking about outputs here, yep.

B

So if, if X 2 is not trainable, you will assign an MC string to its position. Okay,.

E

So assigning this is the other direction. That's how we produce this onyx format. That's coming from tensorflow I'm, the consumer part that all screaming well I can make use of this right. If I see a gradient operator with no output damming. This is not trainable.

B

All right, okay,.

E

B

E

Is sure sure? Because.

B

You can, if first parameter is trendle, but second is not, and the third is trainable. You will have to put an empty string to the second position of your output list right. It's just like all the recurring units, recurrent recurrent. In other words, you need seeing onyx.

E

E

All right come back to my quick short list. I.

E

Still don't know actually how to make use of this binding, but anyway, ah because I the new wait, a new. You know a state I anyway, I I don't find PP eyes for that. So I have this. Also investigation needed I already cover more. So that's the real work once we introduce this training first proposal, if we, you know, merge that PR, then from the converter side right, we need to handle this pseudo node, and then we determine to create Constantine inference graph versus variables in training graph right.

E

That will probably apply to all converters.

E

Also I believe we might have additional operators in the training graph to go with the gradient. Is that correct.

B

Addition, operators, what I mean you mean quería or adequate or oh.

E

Okay, like I, said, add or subtract or no. The question is right in that list of a node. Can we have other operators, then the gradient operator, I.

B

Mean you're sending a trending function photo yes,.

E

D

B

It can help it have a thousand operators directly.

E

This also does the place I.

E

Anyway, we don't know exactly what to do yet, because right now, like I, said right, we use the minimize by default in potential flow, totalizer and I.

F

E

Specify which gradients or very variables trying to be trainable but I, don't know how I can inject additional operators.

E

They understand a point right, yeah.

B

So if you are using tensorflow minimize that yeah, you will not able to control how your grill gets computed or how you're going to assign but 10:02 offers lower 80s.

E

B

Ingredient well.

E

The distance lower there, maybe even lower you, but here is like compute gradients and apply gradient. Yes,.

B

Yes, that's what I say so you can see F, sir compute Aquarians. You can for short, clip your gradient into a certain range and then apply it. Some operators question.

E

Ryan, honest, we understand yeah, so so that means I need to inject additional operators to work on the gradient before I apply right, yep, okay, yeah, yeah, I know.

F

E

This additional handling knife or training purpose right. Okay, that's fine! I just said there are some work items. I will present to our converters sake as well, because those are all the the work introduced and if we're going to support training, okay, where to apply update by anything that that's the same question I had earlier, not sure, maybe in certain frameworks you can apply in certain ways right, okay, so so that's sort of my current.

D

E

This is a reasonable design. We have some work to be done in order to really support it right and hopefully, the other direction like I said earlier here right, this can be evaluated right. Then we feel much more comfortable going forward with this proposal. Okay, any comments on this or additional work. It you want us to conduct.

E

Maybe you can go back to this next month, a workshop. What we want to present, that is our goal, maybe is to present this proposal with some sort of evidence that should be you know, working with you know, other components in the ecosystem, such as converters, is that our goal would that be a reasonable one.

B

That sounds good, that how much information well sure and so and.

E

Also, okay, I believe this training um working group. It is sort of to kick this. You know training, effort off, and maybe we can close it at some point. That is, we have a some good traction. So do we have any clean? You know timetable for that.

E

Again, I can we say the next workshop. We should for getting more solid implementation shoe for getting the PRS merged by the next workshop. I. Of course, that's still like three four months out, but do we have any idea? We can always sleep everything open.

A

My baby could have some PR snorts right after this workshop. The upcoming workshop in August, oh I, was not ready for that.

B

So my question said: is there a ok with the current proposal? I mean just from the spec side.

E

Like I said earlier, you know I I, don't want to hold anything up here, but I'm hopeful, okay, somebody can look at this and say yes: I can produce the you know. Turning univille from my framework, okay,.

E

Okay, I, don't know how long that's going to take over figure out tomorrow, hopefully- and maybe next meeting I'll have them join our this call, then we'll get much more direct into on this.

E

Would that be okay.

A

Should we maybe meet next week.

B

You, oh my depends on when.

B

Tensor photonics chorionic statistical, okay, our meeting right.

E

So maybe I'll try to figure the timing out tomorrow. Hopefully, so we can do next week if they are going to be ready. Otherwise you can give them a bit more time like two weeks.

E

So can you wait until tomorrow or the day after, to make this announcement? Ilana? Yes,.

A

E

A

Weeks I will be on vacation, but okay, that's okay! My role here is not very big. You can do it without me is necessary all right, so yeah we will discuss into models comes after speaking, so this everybody who is interested can join that meeting, correct yeah.

E

A

Information is in guitar and conductors right, yeah.

E

Yeah you can find the agenda and everything in fact, I can show you right here.

E

So it's tomorrow, 5:00 p.m. our time. Oh.

A

Wait well, I am.

E

A

E

Yes, Pacific Oh.

A

7:00 p.m. Central, okay, yeah.

E

We try to accommodate Asia this time.

A

E

It's late meeting port for us folks, okay,.

A

So, in order for Asia actually I.

E

Would assume there'll be people coming from or calling in from Asia? That's one of the you know suggestion from from the lead: okay,.

A

So you don't have anybody from Europe right.

E

F

Because for your observations, you not a good time hahaha yeah.

E

I have to the meeting you know alternate between you know morning and evening, our time so happy.

C

E

Anyway, so this is a agenda for tomorrow, including the discussion on how to support chaining. Okay,.

A

E

Right that that's it for me today.

A

Okay, some kitchen: anybody else have any comments, suggestions.

A

A

Nothin, okay, so tomorrow we'll decide or the day after tomorrow we'll decide when the next mission case is in a week or in two weeks, but will keep the same time right, Joe! Okay, thank you all. So much have a nice day. Thank.

C

You thank you goodbye.