OctoML Apache TVM Community, 6 Apr 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Apache TVM Community Meeting, April 6, 2022

Description

https://discuss.tvm.apache.org/t/next-tvm-community-meeting-apr-6/12475

A

Okay, uh hi everyone, so this is uh the tdm uh community meeting. So now we are doing uh it uh each week, uh this uh kind of a gathering of people who participate in the tdm community, so this um meeting relies mostly on on having an agenda proposed by the community.

A

So um I guess a couple uh reminders from from from our side is that we are always looking for volunteers to uh propose new topics like today's topics that will be collage presented and the discussion will be organized uh by mark, and the second uh thing is that we are also trying to have some more uh people hosting. These meetings, like I am today and like andrew, was doing last week so to guarantee the continuity of the session.

A

We are always looking for people that want to host one and collaborating this aspect of the community as well. So, basically you can. You can get in touch with us either on the discuss forum.

A

You can just post there or you can contact us directly on the discord, for example, if you want to either present or host one of these sessions now to start on the on the announcement community announcements for today, I'm very happy uh to welcome gustavo romero and also murdered hazard as new committers for the project, so basically a recognition on of your kind of a great effort in pushing prs and reviewing code and being active in the discussions on the forum and in all the sort of channels that we have for this community.

A

So yeah congratulations both and I guess this is mostly what I have as an introduction. Again, we are recording this session, so it will be available after the meeting in case. You want to um refer this to someone and share the video as well, so that people can see. I mean the great discussions and things we host here every week and with that I guess, I can hand over to mark shultz who's going to present the main agenda topic for today, which is the uh collage new contribution that was recently merged into the code base.

A

Yeah over to you. Mark thanks lander.

B

um uh How long do we want to? Is there any other business, or is this kind of the remainder of the meeting just want to make sure we.

A

I guess that's that's the main topic for today. Okay, all right, I and so do you want people to interrupt you or do you want people to yeah look.

B

We're a pretty small crowd, so um why not folks just interrupt um yeah, okay, and if you just want to chime in on audio, I I think the trouble posting questions in chat is good, but it tends to encourage like a large questions that aren't necessarily in sync with the presentation, and it makes it a little harder. So since we're a small crowd, why don't we just take yep, okay, okay, I'll! Do the uh screen share thing? Bear with me?

B

Oh, oh! I think I think you need to stop sharing.

B

There you go, let's see if it works. This is a new machine. So forgive me if there's shenanigans, I have to go through how's that yeah sent slideshow all right yeah. So, as leandro mentioned um this went in um when was it last week um I really uh with this rfc. I really want to try to be more. You know this isn't, like you know, written in stone tablets and the rsc is fixed, and everything from here on is fully determined.

B

I do want to be able to go back around this rfc and kind of revise it as we learn. More parts of this project are quite speculative as you'll see, and I think, as we get deeper into actually checking stuff into maine, I'm pretty sure we'll want to go back and revise, and ultimately I would like to make this um uh kind of you know an as built and as implemented um record of of what we did uh and it's already out of date, just based on my own prototyping anyway.

B

So um I'm I've been mostly uh chipping away at this um uh michaelis papadimitrio uh who's actually in this session, thanks for which is actually a good time for him. He's in he's in greece he's been uh working on this um matthew. Barrett is a new octo ml. Employee uh he's uh just started with us and uh sung park is one of the we're very lucky to have him here at octo ml he's one of the authors on the preprint, which I'll mention later, uh but uh most of the problems. You can blame on me.

B

uh So what what's the basic idea here? Well, you know: let's, let's, let's start with mnist, I took out one layer because once you've seen one layer, you've seen them all, um and so the idea here is that we're we're wanting to kind of find an optimal partitioning um for your overall model graph, and that is basically um you know. We have all of these available back ends. One of the very unique features about tvm is its.

B

uh You know uh ability to support, bring your own compiler plugins and those plugins can work at very different levels of abstraction. Some of them, like tensor rt, almost want to be whole model.

B

Compilers others, like the tvm back end itself, is, is all about um good scheduling for kernels um cutlass is kind of uh more of a library with its own tuning uh infrastructure, and then we have more kind of library based things, so uh we really want to make sure that we bring kind of the sense of of tuning or optimality to this mix and matching of all of these byoc um plugins and tvm's own lowering machinery in the same way that tvm itself does tuning for schedules, uh and so basically we want to go from that to something like this.

B

I'm just making this up don't take this as representative, uh and this particular partitioning has decided that um um tensor rt does a particularly good job with fused our old friend confidu, um uh the the dense or the matte mole transpose down. The bottom uh turns out to be um most efficient on uh kublas and tvm is left behind to kind of both fill in the kernels that are remaining so in this case it's a pad, a max pool and a reshape, um but also I won't really get into this.

B

It's talked a little about a bit in the rfc.

B

It's also responsible for all the additional glue, so remember at the end of the day, uh um you know the vm or the graph executor is responsible for all the plumbing, and some of that can be include, like you know, actually pushing constants in and so on, and so part of the partitioning decision is also to decide that yeah, these particular expressions constants whatever don't need to be fused into any particular kernel. They are just executed by the vm itself, um so in effect there's kind of a residual host partition in this world as well.

B

Okay- and um uh obviously you know, the whole point of this is so that overall end-to-end model hc is reduced compared to if you kind of followed an eager um strategy which would be just using tvm or just using partition for tensor, rt or even carefully constructing some chain of you know, partition for tensor rt, then partition for couplers then partition for something else um to try- and you know see if you can kind of find that you know that optimality. Instead, we simply fall back on on using measurement to guide this optimal selection strategy.

B

So and that's that's honestly about it um in terms of um the setup and everything that follows from this is all engineering, so uh let me get into that. So obviously this is um based on pre-print. um I just you know just to declare I'm not trying to do research here. I really want this to be an engineering project, so I'm taking my job, as you know, basically picking up the research, that's already been done and kind of getting that into maine in a in a sustainable and reusable way.

B

So um I'm sure there's going to be lots of. Oh, we could do this and we could do that and have you seen this paper and so on? I'm deliberately kind of putting a little bit of a wall around us so that we don't get drawn into that because I do want this to kind of be. uh You know very much an engineering project anyway. So uh that's the pre-print.

B

um uh I am not going to talk about any I'm not going to present any graphs and so on this isn't an analysis conference talk or anything. I defer all of the um performance questions to the paper now. Obviously, here at octoml we have extensive infrastructure for doing performance comparisons and sweeps across all sorts of models.

B

A lot of those models aren't even public they're actual models that our customers have given to us. So obviously, internally, we are paying very close attention to that, but we're still in the process of building that infrastructure, we're trying to connect what we're doing here into that existing infrastructure, and so that's why I don't want to start to throw bar charts at you or anything like that, because I'm still not confident myself.

B

Nevertheless, the paper shows uh that indeed, you can do better if you kind of use actual empirical measurements- and you are prepared to be very flexible in mixing and matching between the different backhands. The paper does demonstrate that you can do better beyond simply. You know partition for tens, rt and letting tv and do the rest things like that.

B

um Just for those who happen to be familiar with the paper, um if you're not just ignore what I'm about to say. So this the rfc is basically the pre-print uh we've taken away the evolution, research aspect from the paper. um I know uh uh ml cis. Folks, love evolution, research, it's a it's a great way to do all sorts of fun things, we're we're just sticking to um basic um dynamic programming approach. For now uh that we also have.

B

um We were quite worried in looking at how to bring the paper into maine in a sustainable way. We were quite worried that the the papers prototype implementation, relied on a whole new um library of uh if you like, fusion patterns or byo c patterns with its own fusion engine and so on, and we were quite concerned that we'd kind of end up with um you know a non-scalable process, namely that every time one of our mls customers came to us with a new model and a new target we'd have to go. Oh wait: okay!

B

Well, this byoc has these patterns, but now we have to replicate. You know all of that. All of that logic up in the collage level. We just felt that that wasn't sustainable. So one of the big things um we've tried to do in this uh this kind of version. The rfc version is to make sure that we directly piggyback on byoc.

B

You know, and ideally with no changes to the existing byoc machinery, not the least because a lot of the ioc plugins are out of tree.

B

We also the paper, also kind of introduced, a new notion, that's orthogonal to targets and devices called back ends. um We've taken that notion and folded. It back into targets which you'll see later, and uh the thing that we have added is we've kind of lent very heavily into this new notion of a partition specification um which is kind of has its own little. It's like a it's a little bit like df patterns.

B

In fact it's built on top of df patterns, but it's its own library of base and combinator rules that when you combine them in different ways, you can express different partitioning strategies and that flexibility means that we can make the overall search much more efficient and we can also without getting drawn into endless patterns and so on. We can tune uh the search for the different byoc targets, I'm not really going to get into that, except for one slide. We can talk about that.

B

If you want to, uh everyone is, of course, welcome to take a look at the tree. uh We are just starting to um peel things off the tree and start to push through to maine. Now that the rc is kind of in place, um I expect it'll take us uh a month or two to kind of chip away. um All comments are welcome.

B

If, during um the pr review we realize yeah, we didn't quite get this right very happy to backtrack to the rfc and we you know we we follow through um uh and just if you did want to actually try out the the tree. I should warn you I'm you know not keeping it in uh it. It goes into experimental dead ends. um So you know look at the code, but please don't assume it's actually going to work for you.

B

um However, uh I don't think I cover this elsewhere, but um uh I am you know. Obviously we started out with good old mnist just to get off the ground and get the get the basics working. uh We've moved on to gpt2, just because it's that's about 1300 nodes and uh it's a good way to kind of tease out where you've accidentally got it.

B

You know a super linear dependency on n and you're you're, fooling yourself, so it is chugging away on gpt2 um and uh but again I'm not going to talk about actual performance improvements for that.

B

Okay, um so from the outside uh we've tried to make it um kind of as um uh innocuous as possible. So uh first thing is: it's opt in so if uh nothing will change other than the a few you know passes here and there that we've had to kind of robustify, but nothing is going to change on the mainline path if you don't opt in so all the existing byoc um calls uh will still work and you'll be running. The same passes it'll still be running. Tvm's built-in fuse ops. All of that stuff.

B

um If you do opt-in um and uh let's see uh so- here's the oh oops, sorry, I can't highlight text uh so the relay.collage.enableclash. True, that's what kind of you know switches you into the light, fantastic uh in your in your past context, um so that's kind of first point of customization.

B

The second point is in order to express to the collage machinery what you know what it should be, considering exploring I've piggybacked on targets and introduced a notion of target refinement, and so here I'm I'm building up a bunch of targets, and some of these targets are just plain old cuda targets. However, they've introduced an additional attribute compiler, which corresponds to the pyoc compiler name, and then I just throw all those targets together and pass them in, and so this is piggybacking on the existing heterogeneous target support machinery.

B

um There's a few changes I have to make, because now this is a. This is a list not a dictionary but they're all actually um pretty minor, and uh thanks to matthew, you can now also have compiler qdn and compiler kubless, okay, and with that uh that will trigger a new collage partitioner pass, which um this is another difference with the paper uh is run very early.

B

It's in fact, it's run um as soon as we can get away with, um and the reason for that is um this practitioner pass is going to be kind of it's you know reaching in and looking at all the byoc patterns and so on in order to decide what all the valid partitionings are, and we want to make sure that those byoc patterns see the same graphs that they currently do with the existing partition for tool chain convention, and that convention is obviously you do your partitioning before you enter the main tvm build, and so that's forced us to put the collage practitioner way up front um in the rrc.

B

I kind of time myself in knots trying to explain all of the complexities with the fact that we want to do this pass early, but there's a lot of pressure to also do it quite late kind of where tvm fuseops is uh at the end of the day.

B

I think we've made the right decision and we've just moved it to be as early as possible. um Note that the paper, however, does its work actually inside as a hook inside the existing fuse ops, so that that's another difference with or how the paper approaches.

B

Okay and then um so the practitioner does, its thing goes off, does search does tuning. Does all that stuff. Its output is actually no different from the output that you currently get using all of the existing machinery. It's encoded all of its decisions in exactly the same way that they're currently encoded using primitive functions with compiler attributes. The bodies of those functions may have composite functions.

B

It's all exactly the same, there's no like core ast changes or represent change, representation, changes or anything like that part of the work we've had to do to make all of the seamless is to make sure that you can take any of um uh any kind of.

B

Basically, we've had to make sure that the compiler, if you simply take a module, that's already been kind of rewritten to use these primitives and so on. In any form you want, and you just pass, that through tvm, you get exactly what you expressed. That's not quite the case at the moment. There's a few little glitches, but we've fixed those, okay and um and so yeah after um collage has done its thing. We just let compilation proceed, there's no downstream changes and all of the existing um lowering dispatch and uh unification.

B

You know unioning of runtime modules and all of the existing machinery that supports byoc and gets you to a final single. You know: runtime module, that's all unchanged and it just we just let it do its thing.

B

Okay, so that's that's kind of on the outside on the inside. I kind of didn't want to go um too deep, so I thought I'd stay more on the you know, light and fluffy side, and we can just chat about anything that takes people's interest. um I should mention I'm working on some additional things which are not in the rfc, and um uh so I I need to go back and put some more explanatory text in there, but.

C

B

Let's just kind of skirt along the surface see how we do um so. You know on the inside. What is this mysterious collage practitioner pass? Do well, you know, naively, you can just kind of try them all um and you know, given the no expense spared approach which we have to uh tuning you know, maybe you know it's not completely outrageous that we would actually, uh you know almost brute force our way through, but thankfully we don't need to do that, and so we can bring this back into the realms of of practicality.

B

um Certainly as soon as you start dealing with n factorial, you know because a complexity class you have to trend very very carefully, um so there's basically two main assumptions. um uh The the first is that we can kind of rely on the existing boac patterns, along with some very simple kind of fusion styles, in order to kind of get what I've been calling ideal partitions and we can kind of recover those ideal partitions kind of independently.

B

So each each potential back end can recover its notion of ideal partitions, independently of all the others, and uh we can proceed from there. And so what is? What's an ideal partition, maybe I'm getting a little too. um You know hung up on notation here, but so the idea is that an ideal partition is kind of like a goldilocks partition. It's not too large and it's not too small right.

B

So we want it to be as large as possible because, uh let's say you know the part. Let's say that we on the one hand we have a partition, uh conf 2d ad and another. We have the partitions confidenti and ad separately.

B

Well, obviously, um if we don't explore the the partition that has both of those operations, we're missing out on, you know the fusion opportunities and other optimizations, which is the whole point of of this work, and so we want to make sure that, when we're dealing with ideal partitions, they're kind of large so that we get lots of opportunities for the various byoc backends to kind of you know flex their muscles and and trigger all the fusion and optimization that they want.

B

But on the other hand, we don't want them to too big, I mean uh because then uh we're kind of you know stuck having to explore this huge space. So we so we we do, we want them. We don't want them so large that if we split them you would get much the same execution time as, if you you know, measured them together. um So in other words, let's say, we've got uh two confides for argument's sake in succession.

B

um We could say: oh well, obviously the ideal partition is confidentially confidentially, but uh the execution of time of that for a particular boioc uh target is probably the same as just having two partitions confidentially conflicted and adding their execution times, and that's because by unioning those things we're not kind of opening up any more optimization possibilities.

B

um So, basically, uh with a by being careful um with these rules, we can make sure that the starting point for the search is kind of primed from partitions that are kind of sensible, that's kind of the hand wavy way of saying it, um and then the second simplifying assumption- and this one is- um is, you know more suspect and is why the paper uh explored using evolution. Research so we're assuming that when we have two partitions that we're exploring that um their costs are additive and so basically given to partitions.

B

The search is assuming that the cost of running a and b um you know, uh as a single run, is the same as the cost of running a and the same across cost of running b in isolation and uh plus a small penalty to account for the fact that yeah you had to launch a kernel or make some other. um You know there's some overhead to do to do that call and that assumption is patently false.

B

I mean we have cache effects and all sorts of other things that mean costs aren't additive, but nevertheless, it's a simplifying assumption, which means that we can now just use a classical dynamic programming approach to doing this search and in fact the rfc uh uses uh dijkstra just because I'm trying to make I'm tr, I'm hoping that we don't have to explore the whole space that we can kind of. You know once once you get to a particular point in the search space and you realize there's a very expensive option.

B

Well, you don't need to waste time. You know branching out from there, in other words, to use the uh the classic shortest path terminology, um I'm hoping that the graph has a low bloom factor and that you can kind of fairly quickly kind of just find your your shortest path, uh and I think, from here on. I have some pictures because I spent all this time drawing them for the rfc and figured well.

B

If I spent all that time we might as well, we might as well look at them um so uh uh so on the inside. uh Obviously, um uh we're going to be doing lots and lots of work with sub graphs. This is kind of our core data type and so uh from the paper did it this way, and I thought it was a very nice idea. So basically, you assign a post dfs index to every node, which is already done as part of the the the uh index graph machinery.

B

That's already inside tvm, which is part of the df pattern machinery. So we assign a unique id uh to to every node and uh now and now we can build a very efficient representation for subgraphs. So we're going to be ex, you know we're going to have many. You know thousands uh like, I think, gpt2 we end up with about 4 000 kicking around, um and so we can just represent them very efficiently as as bit vectors and then uh there's also this whole machinery for for partition rules which I mentioned early on.

B

um So the idea there is that um it's actually a two-step process.

B

So when collage begins, it looks at the targets and looks at for the compiler annotations attributes in those targets and from there it goes off and looks at the byc plugins and and it basically from that information imports it and builds its own representation for partitionables, and then those partition rules can be if you're like executed on a dataflow graph in order to yield a set of candidate partitions and a candidate partition is the sub graph and the target that you're you're wanting that subgraph to be compiled for, and so uh this side is just showing you that we kind of actually compose those patterns in order to affect the the kind of uh rules that we're looking for.

B

In this case, it's looking like it's a it's kind of more like a cutlass style integration, where it's now building up a whole set of possible uh partitions um uh based on uh df patterns that are kind of pulling out the primitives and additional uh fusion combining rules. That kind of combine those patterns to yield the ideal partitions.

B

And uh whoops, and so uh once we've done that now we move into actually doing the search and the search is done on an implicit search graph. We don't actually materialize the whole graph.

B

That's that's not necessary, um so that the a node in this search graph is actually the bit vector representing all of the nodes in the model that you've already accounted for, and so basically, what we're saying is every path into a particular node in the search graph has by some combination of um candid of partitions has, if you like, covered all of these nodes. So we've we've already decided that somehow we know what to do with this subset of the model. And now the question is: what do we do with the rest?

B

And so the the uh edges out of these search nodes are all the possible candidate partitions which can slot in at that point without uh intersecting anything that we've already accounted for, and obviously you don't want to waste time kind of you know if you, if you apply one partition, rule and then another partition rule well, that's the same as applying them. The other way around. So there's there's tricks in there to make sure that you don't waste time searching. uh You know possible rewrites that are obviously commutative.

D

Short question yeah when you refer here to on tvm, so are these the regular topi operators.

B

uh Everything I've written here. Yes, these are just regular relay operators. The relay operator is okay,.

C

B

The on is just referring to the target in my abbreviated form,.

D

B

And uh where are foot star, I just mean um it's just my notation for saying uh I'm rewriting just this sub graph and the star whatever whatever is inside the star is not part of the subgraph internally. It's not represented as these expressions it's represented with these uh um bit vectors.

D

B

Right and so um just using uh classic dark straight, you basically just uh lazily, explore uh this search graph. You start with the starting state has no covered nodes. uh The ending state has every node accounted for at every node. You simply enumerate all of the candidate partitions that can slot in there without violating any of the rules, and uh you keep track of the cumulative costs for the best path and uh with any luck.

B

If there's a low bloom in your uh in your search, uh you'll kind of the search will narrow down and find a path to the finished state um without having to explore the whole space.

D

And in the auto tuning case, you would do audit tuning for each path from beginning to end.

B

We are actually doing uh that's an excellent question, we're actually doing auto tuning uh or currently just auto tvm on the fly, and so and that's you know this. This is where I'm a little worried because um uh for auto tvm, it's not too bad, um but uh for the newer meta schedule, machinery every candidate, tvm kernel will be treated as its own.

B

um You know tuning task, and so we are going to be exploring a lot more of those than tbm would just left to its own devices, because tvm's fuseops is always eager, whereas in this world we have many more possible, um you know candidate, kernels to to try and tune for, but yes currently currently I I haven't made any face distinction here. When you as collage is searching, it may find a particular candidate.

B

It may ask need to know well, what's the cost for this candidate that may trigger tuning, which will then finally return the cost of the best schedule.

E

Mark on that point, will that be still covered by the cash ship mechanism in the constitution? Estimator, assuming that you know in a subsequent uh search in the college and if he decides to tune the same operator.

B

Yes right, absolutely, yes, so uh all of the estimate, the cost estimator, is obviously backed by a cache um here at octoml we have a cache that has visibility across all of the models and all of the targets, and so one would hope that we get a good hit rate on that, um but even taking that aside, certainly when you're tuning, you know a lot of these deep models.

B

You'll just see the same candidates over and over again, and rather than trying to kind of account for that by some tricky graph, rewriting exploiting sharing or something I just rely on the cash to to um just um cage that.

D

And but by cache, do you mean something that you have created yourself or like an online service or some sort of tuning log somewhere on github.

B

Right so um there's an abstract cost estimator interface, uh which uh basically given an ir module, gives you a double, that's pretty much it well an ir module and a target um that in the prototype there's only one instantiation of that interface and it just runs using the the public, tvm local runners and, and uh it actually bottoms out into the standard benchmarking machinery. That's in python um internally talk to ml.

B

We will have different instantiations of that interface that redirect off into our infrastructure and in practice my plan was to simply pass the cost estimator as an argument to the main machinery.

B

So that folks, can you know adjust that as they need to um the caching in the prototype, which I will probably make the kind of default I'll have to clean it up a bit, because currently it's a little bit too hard-coded, but um the caching and the prototype works by just a naive, in-memory cache, coupled with a little bit of hackery that I've done to use the standard, auto tvm tuning records as a cache as well.

B

So I've had to kind of you know, put a poor man's caching layer in front of the auto tvm tuning machinery.

B

The net result is, um if you I could check in a kind of a cached, I could check in a representation for the auto tvm tuning for a bunch of examples, um but I'm thinking not to try and check in a cache for uh any of the collage candidate partitions, because, honestly, it's pretty cheap to measure those. The most expensive thing in the collage search is the tvm tuning, not the kind of. Let me compile and evaluate how quick this is for um cutlass and things like that. That's all pretty quick.

E

I have another question uh so after the after candidate partitioning is searched, probably let's say we end up with a better partitioning, will collage or will this work consider merging any compiled region? If that was an original desire of that target,.

B

Could you repeat the last part again sorry.

E

So so when, when, when you find a better partitioning in the graph and it's 18 volts saturn by oc target that originally had the desire to merge the compiler regions. um So after the search is concluded this week, merged.

B

Yes, there is, uh there is actually a cleanup pass that um mergers adjacent. So so, even though during search I only looked at little, you know what I started calling ideal partitions. If you end up finding oh yeah collage collage tpm the long stretches of tvm little candidate, kernels yeah, they they all get kind of joined together. So it's a little bit like what um merge compiler regions already does.

E

All right all right.

B

Yeah- and you know, based on the simplifying assumptions doing so should be neither here nor there, it might save a little bit of uh you know, might save a bit of kernel transition time. Those are supposed to be pretty small, but it shouldn't if, if by doing that, suddenly whoa wait a minute everything's dramatically faster. Well, that means collage, probably should have been searching on larger candidates to begin with,.

B

Okay- and I think I'm not sure I have any more slides- do it: oh yeah, most important slide um yeah, so so let me kind of just um you know, put some put some disclaimers just so that there's no disappointment.

B

um So yeah, as I said, you know we, we have to be careful in only looking at fairly smallish subgraphs. I think. Currently, I'm like n equals four. Maybe we can push it to n equals six or something.

B

um So if there is a particular byc, you know in um back end that you know, has the world's most fantastic optimization that only kicks in when you've kind of got this 20 deep uh candidate.

B

Well, we might never explore that and and which means that the user may have been better off just running partitioned for that um candidate, uh that tool chain in the first place um so yeah. Well, I'm still hopeful that that won't be a problem, but uh the proof is in the measurements.

B

um Lots of people um bring up very legitimately. What a minute that um you know my particular byoc needs this particular layout, or it's only running on this particular device. um So can you um since you're already searching over partitions? Can you extend that search to also be in terms of like device, placement or layout or memory scope and all the other kind of choices?

B

And yes, we'd love to do that? um Not in this version?

B

There is an approach to doing this, um which we could try in a v next, um but for the moment, I'm just simply declaring sorry, um uh that's a scope, and so that's this means that um for some targets like nvidia, you probably want to first apply. You know a global layout and then enter the main collage partitioning and then proceed, and I should mention that just because collage is doing search doesn't mean all search has to be done by collage.

B

There's many layers of choices to be made um and uh third limit is yeah, we're very much in the the auto tuning world here I know there are lots of folks who need kind of a more classical compiler tool chain that doesn't involve waiting a few hours. Absolutely, I think, at this point, collage is not going to be for you.

B

Theoretically, the cost estimator could be replaced with an analytical model, but I think it's firmly in research territory as to what that analytical model could be, uh and then final limitation is, we've tried as much as possible to just piggyback directly on byoc they're. You know, because the interface isn't particularly firm, um there's a lot of variation into in how folks have done it, and we've had to make a few adjustments.

B

I'm taking it upon ourselves to make those adjustments, as as we go without breaking backwards compatibility. uh That's about all I had, I feel like. um I probably should have paused more for questions.

C

I have one question actually yeah: um supposing that we identify, uh you know uh a model that has a set of um somewhat identical, parallel uh graphs that you could offload. Is there a mechanism by which we sort of um estimate sort of the latency of running those sections in parallel, um if they're.

B

Yeah yeah, that's a that's, there's another whole kind of sub area. Here of um you know how flex, when I say partitioning, what do I mean do I do I mean that, yes, you can explore serial versus parallel, uh there's also, I I don't even talk about this in the rfc um notions of inlining. You know like hey, I see a a a a reshape, which is then shared 400 times or in gpt2 like 38 times or something ridiculous.

B

Should I be exploring well do the reshape and then share the result, or should I inline the reshape into all of the consumers? So basically none of that we're doing so. At this point all we're searching over is like: where does the cookie cutter go, and you know what color are the cookies? You know what target does that thing go to and that that's it so yeah more more.

C

More v, next right and and seems like the if you were to explore something like that, the um the search process would need to be expanded. A little like the model would need to be expanded a bit and then also sort of the way that we compute the end-to-end latency would need to be expanded.

B

Yeah, I think, basically so long as we stay. I think the lines we need to stay within would be. The thing you want to explore can be expressed by a local transform.

B

um So you know you may want to say: okay, well, here's my partition, but also do some copying in and maybe put a you know a power here to separate um so long as you can do that, and you can estimate the latency by measuring that thing in itself. You know you don't need to go back and measure the whole thing as long as you stay within those lines. You're still in this nice friendly, dynamic programming world- and I think that's all you know in due course. We could explore that.

B

um I want to have the empirical stuff a very firm foundation on empirical stuff, because I think it's too easy to fool ourselves, um but when you start getting into worlds where you know the choice I make here has a dramatic global change and I can only measure the effect of that change by measuring the overall model, latency you're you're well well, outside of dynamic programming, world and um yeah.

B

I don't think this is the right approach for those sorts of problems right, okay, make sense.

A

um Any other question or anything I guess you can just mute yourself and ask or raise your hand or type your question. There are many options.

B

I haven't looked at chat.

A

B

It was uh yeah, no worries.

A

Any other question for mark or comment.

B

We might have an extra 10 minutes for coffee.

A

C

Yeah, should we uh allow anyone to introduce themselves with their they're new here and want to say hi. That's the only other thing I could think of. We, we missed from our stardom thing.

F

Hello, yes, I'm new, I'm george! um I recently started working with the compilers team within arm. Yes, so happy to be here. This is a community of faces as well.

F

A

Anyone else, I guess you can just unmute and introduce yourself.

A

No, probably not so I guess um we can call it today.

C

Thank you mark for the.

A

C

Yeah for the great presentation.

B

Yeah thanks everyone and um uh feel free to just send any questions or comments or whatever. I think the rc is closed. So that's not very convenient, but um on the original discuss post.

C

uh If you can find it, I I took some notes on the um the conversation today and so not another presentation part, but just any of the conversation bullet points I'll I'll post up. Some brief notes on, I guess, on the rfc thread or.

B

Perhaps thanks appreciate.

C

That it's hard to take rabbit on it yeah and actually it's worth mentioning here, just to kind of along the lines of the uh what what uh leandro was asking about kind of at the beginning of the meeting. uh One thing that we should get a little bit better at um and and I'll try to improve this in the next couple of weeks is um we'd really like to have someone to be a host and then someone to be a note taker and just identify those people uh early on. And so um you know it's.

E

Kind of been a bit ad hoc.

C

But uh especially as we're starting to sort of spread out the load of hosting meetings, yeah we'll we'll start asking folks to do that too. So, that's another role that, if anyone's interested in please do um sign up.

E

C

Ping, us um so great.

B

Yeah, I guess yes yeah go mark. I was just going to say I'm happy to take notes next week. If that would help, I won't host. You've probably heard me talk quite enough fantastic. Thank you. Yeah.

A

Yeah wow- and uh I guess for for everyone else, uh reminder that we are looking for topics for for the agenda to be composed for the next weeks. So, if you want to volunteer you can you can reach out on on discord or you use the document as well that you see on the on the forum um yeah, I guess that's it for today. Thank you mark for the presentation. Thank you, everyone for attending and we meet again next week.

A

D

Bye thanks see everyone.

E

Bye thanks bye.