OctoML Apache TVM Community, 21 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Apache TVM - µTVM Community Meeting, July 21 2021

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right welcome everyone to this july, 21st edition of the microtvm uh community meeting, uh so welcome and uh thanks for being able to make it. We have three items on the agenda today and so, thanks to all those of you who were able to take part in the discuss forum to pull together the agenda, so roughly the three things are: we've got an arm ethos. U r c discussion that manupa would like to have.

A

We have a micro, tvmci discussion that gustavo would like to have, and then we have a unified static memory planning discussion that michael would like to have, and michael had indicated, that he was going to be a little bit late to the meeting so um hopefully he's able to to join later. uh He had indicated he had a conflict, so we'll do our best to kind of hold off on that one till later in the meeting.

A

So those are the three things that we were able to plan ahead of time. Is there anything that anybody wants to add.

A

Going once going twice, all right not hear anything so with that uh we'll open up to our our usual things that we do from uh week to week, which first thing is is introduction. So if there's anybody uh new uh to the meeting that they'd like to introduce themselves now's the time to do so, anybody like to uh introduce go ahead.

B

uh Okay, can you guys hear me sure.

A

B

Okay, hi hi guys. My name is jose. uh I've been I'm new to this meeting, because I've been working. I've posted some posts on the discuss forums about an embedded device which also hosts an accelerator that people at my university are developing.

B

So I will start on october 1st with a phd there, and I am working for the k11 in belgium, uh where I'm working in the mikas department, which does everything with microchips. So we make everything from analog very high frequency amplifiers to also ai accelerators.

B

So there are some very excited research going on, but the problem is that we don't really have any compiler right now for these devices there's only kind of a ad hoc script in python, that's being written every time a new chip is developed, so my job would be to kind of find some more general way to kind of tie or to make make it more general so that we can like utilize more of the same thing across those chips and that we don't have to start from scratch anytime. Every time we make a new chip.

B

uh So that's kind of what I'm doing here. I will be following it a little bit more passively because uh actually I'm on vacation, so I don't really have that much input, but I was just a bit curious to see uh what this meeting is all about. So that's something about me. Okay, thank you. Welcome.

A

Jeff something you're gonna, say andrew. I'm sorry.

C

I was just gonna say: welcome: okay,.

A

All right anybody else would like to introduce themselves.

A

Okay, well, let's keep moving. uh I don't think we've got any announcement or news, but I'll pause just in case anybody has anything that they would like to point out going once.

C

Going twice, actually, I guess I'll quickly um put in a final plug to if anyone.

D

C

You know last comments or remarks on the product. Api changes, I'm just kind of getting that to get through the ci and probably we'll aim to emerge that as soon as we can hopefully this week and we'll see how that goes,.

A

Sounds good andrew! Thank you, okay. So, let's get going with our first item on the agenda which is over to manuba and the arm ethos view rfc, take it away. Manoopa.

D

Thanks yeah so show my screen.

E

um Can you see machine.

E

What's this yeah okay, however, I can see it so yeah, so we have finally come around drop stream. Data is your support. We have been working for several months. We've discussed this in uh the last tbm conference, uh enabling eaters use uh collagen support dvm.

E

uh So I put an rc here that that's. This has been a group offered from mom uh doing uh have this getting going with micro, tm and all those uh uh other development going with it to get getting get testing uh compile compiling through tvm.

E

So uh I just briefly say some hsu 55 is an npu that is designed to work with cortex m and bare metal, my embedded environment, not necessarily bare metal, but that works so environmental as well. uh But uh this is this is uh what the npu is all about. So we want to enable compilation, through tvm process, in ahead of time fast fashion, to use the npu uh to run machine learning models so uh yeah.

E

That's that's, basically the scope of the rfc, uh so I thought I'd just go through the guide level explanation and maybe touch uh the overview of the compilation flow for it. uh Yeah there's this video as well. It's just bit different from what we described in the video, but it's more organized now than what I presented in the last uh theorem conference. But this is how it looks like today, so just uh mentioning the tmc interface that we are hoping that users will be able to use to compile to the ethers.

E

U and cortex m along with it uh machine learning model. This should be the interface and the way we have integrated. The question is: has an byoc uh root, but uh interestingly this one, I will have a full pipeline to get it down to the uh the the c run time modules we generate at the end, so it has. uh It goes through different stages. So, first uh we partition out uh the supported operators that the compilation pipeline supports from the relay graph, so that other others operators can go through the default pipeline.

E

The tvm has the ones we say we support, which that matches much the supportedness constraints we can offload with our completion pipeline, and this would have the external functions uh which contains sub graph of separators. That is supported by those. U and uh the first step we do with them, is we kind of uh legalize them to a subset of? uh U hardware, primitive relay operators.

E

So this abstraction is important, because this set of operators describe exactly what the hardware can support, but it doesn't mean it cannot support most of the operators there, because we could legalize them to for an example, I think I would like to say the dens or the fully connected operators could be legalized to a conversation 2d operator. So so that's an exact simple example, but those sort of legalization happens here and also we kind of use help from.

E

uh We have an uh um villa compiler, which is focus on tflight, but it has apis that we are intended to use, which is to kind of encode the constant artifacts, uh which are the bias and scale near a special encoding for the hardware to interpret it and the weights need a difficult encoding for the hardware to interpret it. And after that uh we use those artifacts. The bias and scale can be uh converted in in level, but then uh each of these hardware, primitive operations, will have an te associated with it.

E

That kind of faithfully represent what the operator does and we undergo certain set of te and tf passes to optimize for memory and performance, and it's more like a trade-off there. So in that process uh some operators might get tiled and and different kind of scheduling, decisions would be taken here and those styling will affect how the weight is getting covered. Therefore, the variant coding will happen in this phase. Then it should end up producing a tear prime function, that kind of corresponds to what cut.

E

Could that could be lowered to this the command stream, which is the binary artifact that that we used with the driver to invoke the influence, so that would be produced at the end of the compilation pipeline. So that's a very brief nutshell of how the compression works, so these kind of ties, along with the other work we have been doing in in the front of aot and uh disappear, and uh chris has been working on and and these types with the interface apis.

E

We have been designing so so this as a package which uh andrew and others help to create what we call as a model. Library format is the package we want to distribute to work uh in any embedded environment. So that's what's written here.

E

So one of the questions- and I thought an interesting observation would be why, unlike the other byoc slows, we kind of lowers them to tier. uh One is one: is this what we call as its cascading style performance optimizations, which is kind of the optimizations? We do to optimize memory and performance uh together in in the tier domain and other one? Is we want this ti express out for the unified static memory planner to plan across the cpu and then pu tensors together?

E

So this is the reason why we have this abstraction layer in the middle of it. So yeah, that's basically, you know overview level how the question looks like and I don't intend to go to much details. I've just put called nip snipper to show what it how the transformation looks like, but uh yeah. I think I'll stop here for questions.

E

um I have a question: can you explain a little bit more about the ethos? Ute, so is that just standard tv mte statements that you're using or did you extend the te statement vocabulary?

E

I know we didn't extend the ir, it's just uh it's just the open implementations. We are defining it that.

C

E

C

So you are making new new te operators um that are specific.

F

C

You, and I think you have a as I understand it, there's sort of a legalization. I mean that's, that's the legalization phase here, which takes standard te and and finds things that can be promoted into sort of the um ethos. You specific, um I guess, operators, if you will.

E

Yeah that that sounds similar right.

C

Yeah, when I was rereading this, I I think we discussed this before, but I was um you know. I was curious if there was, if you had a little bit more to share about kind of like what the motivation was for um moving things out of um the kind of the standard, uh te operators um and and um and creating basically a a parallel set of um of operators for for ethos.

C

You, um I don't know if you have anything more to share on kind of like what what was the main motivating reason that you wanted to. uh You decided you wanted to create something separate.

E

uh Yeah, basically it it will undergo certain transformations that we don't need so so so what we call the com field contract, which is like uh the basic compute the npu, does, is be different.

E

So t t e tier goes really down level like you see like an ir, so we just needed just needed an abstraction that could work with the passes uh to compile it down to the command stream, which can uh which, which has a subset of operations that could run primitively, for example, convolution 2d depth, twice, conversion truly are primitively supported in in the command stream, that's generated so uh yeah, that's the basic basic reason. So we want something to represent that, so these particular te and tf process. We do, I think, mostly to your process.

E

We do here understand that to kind of optimize between performance and memory.

C

Right, okay and.

E

C

Case, I think, there's also like.

C

Parameters that you're adding to the um or the am I I could be completely off base here, but I think that you have you added the bias and scale parameters to kind of ordinary operators that don't necessarily have them like a com, 2d or other um um kind of operators that are that aren't typically associated with um with that kind of on the input end. But I I could have misunderstood this so.

E

Sorry, I didn't you break up a bit for me. Sorry.

C

I I think you you also added parameters to some of the operators. If I, if I remember correctly, I um I could have misunderstood this, but um for some reason I think that's what that was kind of my um my recollection, although it's kind of fuzzy right now, um I'm not sure if that was true or if I'm just, uh I could be uh kind of off.

E

Yes, so, for example, the convolution 2d primary operator kind of supports the biases as as as an input to it. So it's kind of primitive in that way. So uh yeah that I said, that's an example I could say at this time so so the the primitiveness of the operators is basically closely defined to what hardware supports.

E

So that can be uh groupings of what tvm supports and probably past partial breakdowns of whatever support. So it's not simple enough relationship with that, so that is kind of handling the graph partition. That's where we we do certain uh a little bit complicated level pattern machine to identify the what could be lowered to such relay operators.

E

One small question: I hope, so what are exactly? Are you going to push or be? What can we expect to be pushed so? Are you going to push also this lower branch of the pipeline yeah, so this would be contributed as a byoc pipeline.

E

D

Interesting thanks yeah.

C

Yeah- and I know from the october side I've that chatting with jared a little bit- and I think um one thing- that's uh happened as we've kind of um as these efforts have kind of marched along in parallel. Is um uh there's been a significant effort kind of spun up here to um kind of take a good look at the compilation pipeline, particularly from the scheduling part onward, and so um so yeah one thing would be. It would be great to make sure that we could get some input from those guys doing.

C

Basically, the te compiler work that's kind of what we're calling it. um I suspect this will all you know slot in uh together quite nicely, um uh but I think it would be good to get some. I think that one one thing they're focusing on is um sort of how to come to a unified lowering pass from te into tar um and so- and I think that unified mostly means across like the um graph executor and the aot executor and the vm executor.

C

So I don't think that I think there's room for byfc flows um to you know to remain and and do some lowering, but I think it would be good to make sure that you know all those plans basically align. um That's kind of one thing: I've been um keeping track of over the last uh week or two here so.

E

Yeah, yes, yeah, that's that's something! uh Yeah! It's good to have sick, have fun.

C

Yeah, I've directed them to the comments a bit on you guys stuff, but um I think everyone's a little bit busy, so I'll try to keep um keep at them a bit.

G

uh One question uh the.

C

G

Operators which will be immediate uh and will be in the leap uh n, dot c, for instance, uh they will um execute the comments, the um for the ethos, u55 or it will be necessary. A driver at the zephyr site to you know, dispatch the comments for uh ito's u55 unit. How does it work? It will.

E

All right, so I can show you some example what we have today yeah, so so this is what we have today, which will work with the bare metal driver, but uh yeah as we speak. uh We so this is. This is correspond to a sub graph that could get offloaded.

C

E

But uh but we are currently working on uh defining an abstraction for it as you uh launch that could be specialized to different addresses. So I'm not sure the crisis and the call.

G

Okay, I see so uh-huh.

E

G

Will execute a recall functions which will uh use the drivers to dispatch the comments to the.

E

Yeah so right now we directly call them into the bare metal driver, but this could be abstracted to a generic keyboard launch which could be specialized to the different addresses.

G

I see cool thanks yeah.

E

Chris, do you want to add something to that? I said been looking at that.

F

um Just to say, yeah we're currently thinking about how exactly to wire that up. um I've been talking a little bit to andrew about how he passed devices down from the the interface level down into the um operator flow. So it ends up in the back. um So if you have any thoughts would be worth trying.

F

C

Yeah in particular, this is kind of you know. The question is that um when you invoke the um ethos you invoke v3, there may be some context specific, uh some um accelerator, specific um context that you want to pass along.

C

You know that's kind of typically thought of as like a void star, um but you know, since we want to sort of allow um the application to handle kind of um um basically device, initialization and and kind of populate, the at least delegate the responsibility of the application to call the library code to populate that that driver context. um You know that's what I would think of as initialization kind of um there's some question as to how we need to define some interface basically to allow applications to pass in that context to the top level.

C

Aot executor function, so you know. Basically, if you call runtime run, then that should have you know a context for each different accelerator that is used in the computation and that should somehow wind its way down all the way to here. So we're kind of still, um I guess thinking about how to do that correctly. um But yes, well, I I think we should probably post something up at some point or or I'm a little bit behind reading things. So I I'm not sure if you posted anything more on this chris.

C

But yeah, I think that's kind of that's. That's my recollection of the of what we've been chatting about so far.

F

Yeah nothing new um posted as yet.

D

C

E

Yeah any other questions, so I think yeah.

D

E

uh Yeah, just just have a read through it and feel free to comment. It comment and we can discuss it in the request.

E

Yeah, I I I'll also be a bit delayed, a lot of things going on, but yeah I'll make sure to reply to.

E

Everything, uh if you don't have any questions you can move on tom, maybe.

A

Okay, well thanks manupa for uh for leading us through on this and appreciate your uh taking the time to present to us today greatly appreciate it all right. uh Next on the agenda is, I believe, gustavo.

G

Right uh so it's more uh like a um sync up with the what's uh about what optimal folks are doing uh so andrew and maybe uh have um some news about the uh see a micro, tdm ci uh just to in the sense to try to avoid duplicated uh work. You know: we've been uh kind of uh chasing a micro tv mci here uh focus on the purple request. Tests and um theodore is also working on that.

G

So I understand that male folks are kind of doing the same thing, but for a nightly build and using um avm, uh but I'm I kind of lost the track about it.

C

So if we yeah I can, um I can definitely give some some more. I guess background on this, and- and I guess just to um you know just to add some- I guess context here- is that, and I don't want to um make it sound like we're, we're putting a ton of work into this we're kind of doing what we could view as kind of the minimum possible thing.

C

We could do to get um kind of automated uh tests running nightly right now, um so we we just it's it's just me and my dad working on it right now. So there's not really um a large effort or anything spun up. um What we're doing right now is um uh we're launching a um I guess. We have a server with some some attached hardware and and we're basically um building a small.

C

um I guess um I guess I think it's a python script to to sort of track, hardware, reservations and then kind of attaching that to a jenkins instance that, um uh basically, you know, gets a reserves, um some hardware uh launches of the micro tv and reference vm and then um kind of uses, the vm to drive some. um um I guess performance and- um and uh I guess, functional regression on just a set of models. Basically, um nothing super nothing super complicated, but nothing um hopefully a little bit a little bit.

C

um I guess just complicated enough to be useful. Basically, um this is not a uh um fast process, since we have to build the reference vm uh or not, building.

F

C

We have to instantiate the reference vm every single time and um uh you know not sure to what degree you know the um we still have yet to kind of pipe clean. The whole thing, so um you know.

H

C

Things like you know bugs with like one thing that we we see, sometimes when you, when you launch a vm and you attach usb devices that for some reason you know it takes a few times of enumerating the device before it really becomes stable across the vm and I'm not really sure, that's something that's kind of new to me in the last month or two. I think so. We're not really sure what we have to add to.

C

um I guess sort of ensure in an automated way that the um usb traffic is flowing properly. But what you'll see is you'll see.

C

You know, failure is programming, a part that's attached to the um to the vm, um and if you do it two or three times, then suddenly it works fine for um sort of indefinitely, um and so you know so there's just a bunch of small things there that you know we're kind of hoping to do kind of the minimum amount of automation, but um you know sort of just so we don't have too many of those things to work through um longer term plans I mean yeah.

C

I think that one thing that'd be interesting to discuss here is kind of longer term plans. I don't know if that's um kind of in your uh something you wanted to cover gustavo but um yeah. I think one. You know one question is um kind of with the open source dbmci.

C

One thing we didn't want to do is make it impossible for people who, who don't have this hardware to um to iterate on a pr, especially one, that's completely unrelated to micro tvm. You know if you're working on a gpu, um centric pr and then a micro tvm test fails, and it's like well to reproduce this test you need to buy. uh You know some, hopefully cheap, but um but still you know you have to buy and and get all this equipment and development boards.

C

um I don't you know it's sort of hard to to keep the development community healthy if people are kind of requiring everyone to buy everyone's hardware.

C

Basically, so you know we don't want to do that um at least not in my uh that's kind of my been my take on it um so far, um and so you know the limitation, though, is that if we, if we take that approach, then the problem is that uh you know we aren't really testing anything on real hardware, we're emulating everything, um and so you know people have been talking a little about some middle ground and um you know one thing we could do is we could run basically nightly tests on real hardware, um and so that's you know.

C

That's uh one thing that we're thinking about doing right now with the um just with the infrastructure that we're building here, um but you know, I think that there's more interesting things to be done um beyond that, for example, um tracking performance on uh maybe a.

F

C

Website that the community could um could inspect um and then another um suggestion that I've heard so far is basically allowing um allowing people to supply like additional jenkins jobs that are run on their own infrastructure and that they can basically launch these jobs. I don't know if it would be like a committer could trigger one of these jobs running on a um on a pull request, but basically these the idea is that if you open a pr, um you know some some hardware in the loop um regression.

F

C

Could contribute an advisory vote and say: hey we tested this on uh microgbm hardware and and it did fail, and then you know that wouldn't necessarily block submission that the um the committers would have to use discretion basically to to know whether or not this was like a badly breaking change that you know would really slow down the community um or uh um you know, a change that um you know. Maybe the regression uh ran into a hiccup itself, so um anyway, that's kind of some thoughts there.

C

I don't know if you, I think I've kind of discussed these before. But I don't know if you had anything to add there. Gustavo.

G

Oh yeah, so um I just would like to get a sense what uh you and my dad were doing uh at octomell and uh I'm uh also in this in the same um doing the same with uh theodore. So we are partially working on the micro tv mci.

G

So we are not fully located on that, uh but we uh were trying to you, know, um get uh jenkins uh running and to experiment with a full architecture like uh having a worker attach it to was a single board and that kind of stuff and see how it dispatches the pull requests and how how fast it goes in comparison to a docker container, for instance, and also have the chance to work on those uh serial issues.

G

You've we've discussed in discord like how to identify the boards and how to attach new hardware to the system, and, uh and so that's what we are trying to work at our side and uh in that sense I just would like to ask you if you think it's uh still important to do that. Work in a community sense to see how it goes, and um you know, or if it will kind of overlap uh with what you were doing at talk to email.

C

I think one thing is that um you know I I think actually it's probably helpful for for both of us to have um some form of automated um um uh job. Launching uh I mean I think um the more infrastructure people can share. I think the better. So um so I certainly um I don't know that it's um wasted or anything like that.

C

I think that um one limitation of our approach is that if we take kind of this, this next step, basically of of trying to put advisory votes on prs, you know our approach is basically to build this vm, and that takes um quite a long time. I mean I guess.

C

I know that the ci for for tdm takes quite a while, but um you know it shouldn't necessarily be our approach to always, you know, create a two-hour ci or whatever, um and so I think, there's significant opportunity for for speed ups there, and I don't think we've um planned to do any of that work there, um and I think, from last time you guys were chatting. You were kind of hoping to use the docker image um to to.

F

Your attached hardware, that sounds like it, has a much.

C

um Better chance to me of working kind of a in a, I guess, a faster and also potentially a way of that uses the cpus on the executor nodes um much more effectively. So um it might be good for us to chat. um Perhaps I'm not sure if it would make sense to have like a detailed chat um at this meeting.

C

Although we totally could um um I think it's a chat just now that we've kind of like gone down this path a little bit and um you know maybe we can share what we've learned so far and see what if there is overlap or or if there's stuff we can share for, for instance, um I still think that you know uh developing kind of like a status, page or kind of a um a page that uh that gives kind of like nightly performance stats or nightly um functional status.

C

Statuses might be um useful to the community kind of as we're um pushing more prs and all that. So, um I think there's you know, there's quite a bit to do just um above and beyond the just having a a simple, automated runner, um so yeah.

G

Right yeah, I agree andrew that's. That would be interesting yeah and we are indeed looking uh for it yeah as well and uh yeah about voting uh in a pull request. That would be, I think, a next step, but uh yeah initially get in some performance statistics. That would be really awesome.

I

Yeah um tom, you might know, but um are we limited in the ability to show performance numbers on different boards for tests like this generally.

A

What we like to do is if it's a publicly available board. um You know shouldn't shouldn't, be an issue, but we still like to kind of you know talk to the manufacturers of the board and just make sure that they're they're, generally okay with it, okay sure, yeah.

C

It's easy for it's amazing how easy it is to like uh misconfigure a board slightly and then the performance.

D

C

You know completely different than what you're expecting so yeah and I think actually along those lines. It would also be really interesting for us to learn whether or not we're um I mean one of.

F

The things is that.

C

um We're kind of uh relying on zephyr to do most of the um like soc specific configuration, and so we kind of assume that we'll put it in a reasonably good state. You know, as we start moving towards you know, building these tuning logs, basically on these devices that will help us really optimize our performance. um You know one one easy way to shoot ourselves in the foot is to misconfigure the devices basically, and so you know it definitely would be helpful to get some feedback signals um um about that.

G

Yeah, okay, leandroid! I don't know if you have any additional comments about the ci we've discussed that a long time ago. So if you would like to add something.

J

So I think I think my only comment is that so as much as we obviously want uh ci running on on boards and things and that's valuable. We also are interested in kind of maintaining and expanding the the ci then on models, because that's cheaper and require less configuration.

J

So uh yeah. That's that's my only comment on that. Apart from that, I think uh we are on the on the.

J

There are still some questions on which is required from a board so that we can have it consistently uh in the ci, uh but yeah I mean for for benchmarks and things. It would be awesome if we could get some numbers as a routine.

G

Cool yeah, I see yeah. One thing uh we are really trying to stick with uh andrew and leandra is using uh terraform and ansible to configure everything. So we don't want anything. You know ad hoc. We want something that you can easily reproduce anywhere in any host. So that's what we're looking for.

C

You know so yeah um uh I don't yeah. I think that would be. That would be a good way to like share things. I guess um and kind of make sure that everyone's kind of on the same page as how things are configured.

C

You know, one of the things that I did when I um kind of took the vm approach was to try to lock down as much of the software stack as possible, basically between different runtime environments and so kind of having that that reference, vm was kind of um you know my early stabit at um uh standardizing. The software stack. I think it's really important that we, whatever we do if we are going to share infrastructure or have kind of a you, know, here's how to reproduce the tdm performance stats.

C

We should definitely make sure that we, you know, address that point basically um uh in that in.

F

G

Of whatever that system.

C

Is so I see yeah so so, maybe kind of for next steps. It would make sense for us to chat a little bit more. We can um post up and kind of share what we um have discussed so far uh kind of on the discuss forum. If people are interested in following along- um and uh you know, it makes sense for us to form a bit better of a plan um for kind of, like medium term work on the ci and and all that, and- and perhaps you could share that at a following community meeting.

C

um You know in a couple of weeks or or in a month or something like that, um how.

D

Does that sound to you.

G

C

G

Okay cool, it sounds good, sounds very good.

G

So yeah uh regarding micro, michael tvmci, tom I'm, that's all I've got to discuss here.

A

Okay, thanks gustavo thanks for everyone for the discussion, all right, um michael is here, so I think we're ready to have our uh last uh last item on the agenda, which is a unified, static memory, planning discussion. So michael I'll turn uh things over to you.

K

Thanks so um we had a couple of questions on the rfc that uh was proposed, I think by manopa we posted some of them online.

D

K

So, first of all really great work. I think this topic is really moving forward seems to be really good, so in the r in the rfc. Maybe we can bring it up. Maybe I can share it.

K

K

D

K

A

I sure hope so.

K

Okay, let me yep there, it is ah perfect cool, so I'm talking about the one which is in the new tbmrcs uh repository so um monopod. The question we had is you proposed here, or the thing that is proposed here is an interface where somehow the input are the buffers plus pool sizes and so on. So the first question we would have here is: what is the assumption?

K

Is the assumption, somehow that the memory planner does not need to change the order of the call graph.

E

um Yes, so so kind of the memory plan assume the schedule has been committed, but uh I think uh yes, as I mentioned in the reply, was that they were thinking. uh The scheduling can happen uh so so this is specifically in the operator scheduling right, not interoperational.

E

C

Of the operators right.

E

Yeah, so so we think the inter-operator scheduling still could happen in relay. uh That is um uh so, I think, uh can I can. I share the screen for a second. I have an example to kind of see whether we are talking about the same problem: yep, okay,.

D

uh Let me I have a figure for towards this yeah. Let me see.

D

I I can't share yeah.

E

I I think I.

D

E

uh Yeah, so I have an interesting figure that became uh so so this is actually one of the.

C

Hold on, I don't know if you can see here. Oh there we go okay.

E

Yeah yeah: this is actually one of the questions uh matthew bentham, also from arm raised uh in in one one of the meetings that about uh what do you do so so just give context to all this. So this is an snipper from incept inception, and uh so these numbers kind of corresponds to uh the execution order which would result in uh creating the lowest memory pressure.

E

uh uh So so, I think, is this: what you are, after just to clarify so so this is then uh this is so so. The current uh the control uh graph or the either the tier aot mod, the main module or the json creator, will just use a visitor to create the colossus operators. But I think we need to just be be a bit more careful and and generate a different sequence that might result in different memory pressure.

E

So in this example, this we have found out this particular sequence seems better than another another approach, so simply because uh we can get rid of intermediary feature maps being open uh for so long. Is that the feature you are after.

K

Yes, absolutely so I mean you mentioned that the input is a tear, prim func, so it.

C

K

To be like an ast is, with this sort of asd every possible sorting possible. If you remain with the with such a tree,.

E

uh Yeah, so so so what we are saying is kind of. We can kind of uh use uh relays lid bindings, uh which, which kind of so there isn't already a pascal to a normal form, which kind of create a sequencing in relay operators so really being a fully functional language, doesn't have a concept of sequences such unlike tier, so tier kind of is kind of imperative and functional. They both have statements and expressions, but really only they have expressions, but the lead bindings allow you to create a sequence right now.

E

They, the sequence, is created in the way this that visitor is traversed, but that doesn't necessarily need to be so. So what we are saying is that, if you're just interested in creating the sequence, we can add the scheduler, which is like to call our memory aware scheduler, which creates the lead binding in a different way that results in uh in uh different interim pressures. So I think in this particular case, the the memory used within the operator doesn't matter because they.

C

E

Live when you finish the operator, what matters is the boundary tensors of the operators. So therefore we feel that this is a pass that could happen in relay and on its own, so and and that point we could commit to that schedule. So then we would end up with the tier frame func in this particular order.

E

We haven't planned for this activity, but it's on our data.

E

uh So the reason being is that we think we don't want to make scheduling and allocation together because of the crime complexity, but that is not to say the schedulers can tap into the allocators algorithm if they really want to run in loop allocated to get more realistic uh memory numbers, not just the memory pressure.

E

uh But, however, the idea is that when you come to this memory planner the schedule is committed, so the order is committed. That was the design we are going with, but happy to hear any concerns around it.

H

Maybe a quick question: are you sure that the memory allocated inside of operators does not.

D

Influence influenced the memory planning.

H

Because I mean, for example, you you could, if a if an inter buffer is released, you could use that space for the next inter node buffer, that you need to allocate.

E

uh So between operators- yes, I was thinking about allocates that happen inside the operators which which are kind of uh which should not be uh used outside the operator.

H

Exactly do you are you sure that you can separate these two, because.

E

H

Good, so are you sure you can get a an optimal solution if you separate buffers that are allocated inside of operators from those that are allocated between operators, so basically outputs of of operators.

E

H

Is that what you say.

E

No okay, that yeah, so there can be a case when we you need a scheduler like a very, very comprehensive scheduler which which can afford to run that uh you know, look into all the buffers and perform uh performer allocation before committing to a schedule. But I we think that is part of the schedule that is part of the schedule planning activity that has to take care of it, which can call into the allocators api.

E

But we would like to keep the allocator simple so that once it comes to the allocator, the operator schedule is committed.

E

The ordering is just one problem, so there are many. If you go into advanced scheduling, we might want to break these operators up and different. Do lot lot of things that that falls on the dumbbell of scheduler. It's always going to be a tradeoff between performance and the memory. So so we feel all of that should be taken care of with scheduler, but that doesn't restrict us just to the scheduler from using any of the apis from the memory plan. If it needs that information to commit to a schedule that.

C

Is we've been thinking about that as well a bit here and um I think that's that's. The big insight I've had so far is that basically, um it's not necessary to couple these two components together, but um it's certainly possible for the scheduler to do something like ask the memory planner to compute. um You know the total live memory on a particular or compute, even use the the linearization pass from storage rewrite, for example, to produce basically a sequence of memory, allocating free operations or or.

F

Use- and I guess, kill.

C

Is kind of how that that pass uh specifies things and so yeah you can use that basically to learn kind of um uh possible orderings, basically at the relay level um and then sort of propose an ordering and and run memory planning based on that um to to actually lock down things at the tnr level. Then um that's that's kind of on my side. Now I I I guess the one thing that's maybe a little bit different.

C

There is, I think, there's there's there's this question of how we handle um operator workspace buffers, and I think that one thing that we've been talking about a bit before was lifting the workspace or hoisting. The workspace allocations out to the top level function so that even the workspace allocations were considered kind of parameters to each tar, prim func and I'm not sure if you guys um were still planning on um doing that, maneuver or if that was.

E

No, no, I think it's I mean for a better readability of the iro compilation down the line. If you feel it should, I mean having the allocate associated with the prim phone kind of helps us when you're targeting multiple backhands, so extreme funds are getting lowered. But still this kind of overlay of buffering for uh liveness conflicts could do that without really needing intimidated ir.

E

That's that's what we're doing.

K

Then maybe, let's you're saying the scheduler basically calls the memory planner multiple times can't so.

K

Then counter question, so if I want to replace it, will the next rfc then be a universal schedule? Lobby uh interface.

E

So schedule is a bit tricky, uh so so, whatever what we are saying is that uh one could have a scheduler which is which is a set of uh passes that works on the tier um and which may or may not need to access the allocator to determine that. So we are keeping the window open so that one could design a schedulers to the particular respective hardware or even incorporate with whoever needs that allocation information.

E

C

E

Universal uh schedule is not not in our roadmap, but we are just keeping the.

K

Possibility for that, because I'm trying to absolutely right so I'm trying to go this from a user's perspective, so what I want to do. I want to find the best possible way for my new fancy neural network that our ai guys, someone came up with. Where would I plug this in and currently from my point of view, this was the universal memory planner. So now you're saying I know this is someone outside um then. Actually, I would see there is the need for an interface around it.

E

Yeah so yes saying so, yeah.

K

Or the order, the need for or there's an issue for a possibility to be able to change the order within the memory planner.

E

um I would still keep that decoupled, but yeah to correct something. This is unified memory plans. It's not meant to be like universal memory planning uh by unified, I mean across all the globe. Memory, workspaces and buffers that are globally scoped could be planned. So that's the goal of it so um yeah, so you're kind of thinking of the other way.

E

What to do about scheduling to it and and if the scheduling needs to be memorable, which is the case which with us you we, we are doing that scheduling with memory aware, but that also comes with the fact. We need to create current performance into the account as well. So once you go to its performance, it, the scheduling and memory allocation becomes intertwined, and uh but uh so the design we are proposing is left scheduler to handle that complexity.

E

If it is affordable and decided to decide there so yeah, we would then I'm I'm more or less saying uh there need to be a component, a scheduler that does the ordering uh in in the way the backend wants it. But it's free to access uh memory planners interfaces to do that, but I think both of them is representable in relay and here the outcome of the criteria.

C

Yeah- and I think that there's um um there's additionally, some overlap kind of between um kind of this and and like the auto tar or meta scheduler plans, um to kind of which would sort of allow um uh auto tuning to basically tune at the tr level. If that makes sense- and you know the question of kind of like where you found that- and um uh I guess just what, how does that relate to any such scheduler? So I think there's there's. I totally agree with you michael.

C

I think that that is kind of a um a gap basically and kind of the generic um or the general.

F

C

um For this problem, um we haven't addressed that right now with usmp, and uh I agree with manoopa that um it's it's probably better to keep the interfaces separate since they aren't, they don't need to overlap, um but that the implementation should definitely be able to um use information from from one another.

C

um I think that the kind of one of the reasons we haven't released a lot of that at least one of the reasons I haven't um commented as much on this lately is that you know there's still some open questions in my mind about how how this all the pieces fit together. Basically- um and so I think, there's some opportunity basically for us to to release some some more designs basically or propose some more designs um here. So I'm definitely very interested if you guys have uh thoughts or or not.

C

I agree with what you're saying that um uh you know yeah. There isn't a schedule or a piece right now and the scheduler like they have to be kind of um hardware aware, and so there's probably some room for some sort of interface as well, um but yeah. I don't know that anything in the current um rfcs are. uh You know explicitly addressing this at kind of a general level right now,.

H

Maybe one one small additional question: I mean basically what what we would need or one approach maybe is that if.

C

You have in the json graph.

H

C

Only thing that you need additional.

H

To what's already there is, is information about the buffer sizes inside of operators, then.

F

You would have everything.

C

That a memory planner would.

H

Need to to globally optimize the memory plan. Would that be an approach that might be considered considered.

E

Yeah uh yeah, I mean if so that that kind of sounds like with me a schedule. uh So so it depends on the the degree of freedom that scheduler can take, but I would still view that as a scheduler, which can run before uh allocation of the memories um yeah, but I think it could be an extension on what we are doing right now. uh After after we have something yeah.

C

E

Current work via current focus is we kind of assume the schedule is committed, but uh if, if someone, if one summit needs to change the commit, I would leave that as an incremental work that can happen on top of it.

C

Yeah so, for instance like without with the auto tir approach, um you kind of have to imagine that there'll be some sort of iterative process here right so we'll perhaps we'd run scheduling and scheduling may come to one um arrangement basically, and then that would be passed along to memory planning and memory planning would then plan within those bounds. Basically, and then, at the end of the day, you kind of look and see the peak memory performance or whatever statistic you want to use to uh or like whatever, whatever cosmology.

C

When it comes with that, and then you know, you may come back and iterate again, basically and then optimize further and so um not to say that you couldn't do optimization with each within each of those iterations but um yeah. I guess the main question is whether or not the memory planner should also be responsible for reordering stuff or whether it should just work within the bounds. I think currently the prevailing um well.

C

Currently, I think the thought is to keep things operating within kind of the um the bounds of what's already been scheduled, um but I would also say we're just very early on this, and so I think, everything's very open to the proposals. Basically, so if you guys have um thoughts or you want to propose something more concretely, I think that would be.

C

You know welcomed on the discuss forum for sure um um I think that it'd be great to see like a more comprehensive um kind of layout, basically of the the scheduling problem in general and then think about a little bit more, how to kind of put the pieces together. If that makes sense, I agree with what you're saying sebastian.

C

I don't know yeah if you had a if you had a sketch, basically of the buffer sizes in each function like in principle. That is enough information for uh some component to do both memory, planning and scheduling. um If we wanted to combine those two.

D

E

I'd agree with andrew on that yeah. So I'm happy to hear any such improvements on it.

K

Cool very good discussion, so I think we only have two minutes so or tom. We probably will not too long or do we.

A

Well, I think uh this is probably a good time to maybe uh call it a meeting unless there's uh something else that you know we urgently want to discuss. But if not and as always you know, the discuss forum is a definitive place for all discussions anyway. So it's not bad too.

C

Yeah thanks for the discussions, everyone that was, that was really great and uh yeah. Please keep following along and we'll um make sure everyone is kind of posting up to the discuss forums as as things develop and, uh and also don't hesitate to use the uh discord as well to kind of have faster chats if people need to.

L

Yeah and before we go, I also wanted to remind everyone that we have a general tvm community meeting tomorrow. Also and uh leandro is going to be presenting on some of the work that they've been doing with um with the doctor containers and ci.

L

A

Thanks chris for.

L

A

Thanks for the reminder, okay, everyone well, uh let's go ahead and call it a meeting I'll uh get the meeting recording to you chris here in just a moment. So thanks for attending thanks for participating and we'll see you in about two weeks.

H

D

E