Rust Programming Language Parallel Rustc Working Group, 21 Oct 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Parallel Rustc Planning Meeting 2019-10-21

Description

Discussing the jobserver integration, how LLVM parallelism works today, and what better Rayon-jobserver integration might look like.

A

A

Today's parallel meeting, we were going to talk about how LLVM delle, Vigne, beckham and I guess the job server a little bit is that right and I think Alex. It's be the main one to do the talk.

B

So I mean I, read up a lot of I wrote of a document just kind of of an overview of. What's the parallelism is in kind. Where is the job servers specifically I, don't mind going over any of the various pieces, but so I don't mind like diving into one particular one, but we can also just start with a job server and see what goes from there.

A

Yeah I was sharing your document here.

A

B

Happy to go in any order, I.

A

Would basically like I, think correct me if I'm wrong, but it seems like our impression is that overall, the sequential overhead is not too bad. You can always get better, but it's not that bad, but the gains we're seeing or.

B

A

High as we would like, and we think that this may have something to do with the interaction of job servers and so forth,.

B

um Kind of not the last time that I've been sparked the current. What we exactly have entry right now, I do think. A lot of the performance is related to the job server, but I don't think it's related to Olivia integration at all. I believe that there's a like this is where a little while ago we were talking about. Maybe we working, maybe maybe rusty, doesn't want to use rayon.

B

Maybe just wants to use parallel for loops in a few places, and so that is where, like right now, I think the ran integration has very very fine-grained where it acquires releases of job server token, and that causes a huge amount of traffic to the kernel and that's one of the main reasons. I think that we're having seek let's just slow down, so it's not actually explicitly related. As far as I know to the current job, server integration of the LLVM backends.

A

Okay, well, why don't we start in whatever order? You think I still feel like it'd be good to cover both those things. If we can so whatever order, you think is best, but that makes sense what you're saying and it's actually encouraging I guess, because it suggests we might be able to better performance like without as much restructuring yeah.

B

And so I, it's I do think it's worthwhile to talk a little bit about about what we want to do. It's a coaching back-end going forward because I do think. There's gonna be some big simplifications. We can have a fully Pella compiler, so yeah, let's just go through this. So do you go to just kind of like basically I'm I'm, fine doing whatever's easiest I can just like go down this dock and kind of give like the high level points, and then we can kind of drill into questions as.

A

They come up yeah. Let's do that.

B

All right, so the number one thing to do here is actually, if you want to scroll down and click on the graph that little picture looking thing, that's gonna be the easiest thing to take a look at yeah so case. You can't really see my cursor for anyway, so the parallelism on the back end all has to do with.

B

All right, can you see my screen.

A

B

Right so I'm gonna go bring up a blown-up portion of the picture. All right. This is perfect. Okay, so all of the parallelism that we get is not in the compiler itself, but rather through LVM. So all the cogent one has to do is he means to take much of the Russy data structures like meteor, translate them to LV Mir, and then, once you get to elevate me our IR, then we have to optimize it I think coach ended.

B

The translation phase has to be serial because it uses the type context and that just inherently is synced right now. So this big bar right here we see at the top. This is translation. This is the main thread, performing translation and then kind of as soon as we have an LVN module ready to go. Each Olivia module is completely independent from all the others in terms of like LLVM data structures, and so that is safe to send across threads.

B

So we basically created all of your module, sent it to some other thread, and then some of the thread goes off. So we have this sign of like you can see this.

B

Like the stare effect going on here and what that means is that the main thread is serially, translating the module working off to a thread and then, as soon as it finishes, this module trance, it's a little bit more, that it Forks it off to this thread, and so we can kind of see that each this, the stare part, is kind of what the main thread has to do is serially produce this work, and we can see that while one thread is optimizing, the main thread is continuing to translate more and kind of for more work off layers, and so this graph is particularly like a very nice stare in the sense that this was a many many core machine with 16 cogent units.

B

So we never actually had like.

C

B

Threads picking up work for other places, except for kinda like this one 3 right here, but in any case that's the main face parallelism or so the the main thread will translate stuff, send it over to other threads and they will optimize the optimization even happens in debug mode. It handles things like inline, always it's very, very fast things like that, and then what we're looking at here is actually a optimized build, and so there's this kind of wall at this portion, and this represents the fin LTO passes.

B

So as soon as every single cogent unit from the left side is optimized. Initially this this one from thread two is the very last one. So it finishes right about here that translates to this. We do a tiny amount of work to do thin, LTO analysis, and then we, as you can see, we fork off a bunch of yo stuff to actually happen later on, and so that's where all the parallelism comes from. So this is sort of like at a high.

A

B

What the parallelism and rusty currently looks like, which is like the stair step of like we're, creating a bunch of parallelism and we lose it all, and then we gain it all for a little bit and we have title to you: make it sound so far. Yeah.

A

I have a question: how often have you looked at these sort of graphs I'm a little bit I think it's very interesting how uneven others like a few modules that are taking it's totally orthogonal to what we're doing, but it is interesting. There are a few that are taking a really long time. Yes,.

B

A

Me wonder if we should sort the order of bits which we create them.

A

First, this.

B

Is um it was like a really crappy format for viewing these crappy in a sense of is not nearly as fancy as the current one. So this is very very recent from the proof, profiling work and to me it's so nice to view these kinds of graphs of the ancestors to know. I've barely been looking at these and the other answers. Yes, so as you look at them real life, we can do a lot better, and so that's the only section I wrote here about cogent units, which is that for parallelism. This isn't too interesting.

B

But the main thing is what you're just saying is like we're not doing a great job of balancing these things out and it turns out there's a poor quest which.

C

A

B

Of rejiggering things where it turns out, we were partitioning to make. We want to make it kind of as even as possible, but then we were in mining afterwards which drastically changes. So this is basically saying what's in line and then partitions.

A

Saying frequently we would tune this and we as far as I know, I, never really did public.

B

Partly didn't have the that's correct, having.

A

B

This is what we need to visualize it, because it's so easy to see here. You have these like two modules that.

C

Are doing nothing.

B

Just like a part, another part other side of my screen, yeah.

A

You can just drag.

B

Okay, so you have like these- this is obviously much larger than it needs to be, and it's preventing like a bunch of idle word, but yes, anyway, so yeah that is known, I think we do need to work on it, but it is someone.

A

With a kernel to.

B

This record in.

B

Any questions so far to.

C

Be clear like the stair step, the purple bars all of that is not sort of inside Russy right. That's all the unfold entirely.

B

So this bar at the top is the main thread actually translated Olivia. So that's mostly in rusty because we were manufacturing Mir. But yes, every single other one is predominantly. We just call it LVM and said: run your optimization passes and that's like the entire thing. There's like little tiny pieces where it's just all opium- and this is like if the LVM is where we are actually executing LVM on multiple threads.

C

All right and there's no further ich Ihnen. You could do like in terms of LVM we're sort of as parallel as allows to be right as.

B

Far as I know, yes, I, don't I think there's been some efforts a long long time ago to paralyse within a module, but I'm not really sure they ever got off the ground or landed in tree now. That could be wrong. It's been years since I took a look at this, but I'm relatively certain that like because this would be been one of the number one things that clang would want to enable.

B

So I believe that if it was there there like a little bit a bunch of fanfare about it, so I think because clang hasn't said anything I highly doubt all of you paralyzed within a module, and it's only. We have to paralyze.

A

B

The module level, which is very coarse, it's just an object. It's like the object file level.

B

All right so um kind of an overview so of what can and cannot be paralyzed. So we have these five steps which the first one is. We have to actually split cogent units in Rus team, just to figure out. What's where that's kind of hard to paralyze, you can make the query parallel, but you can't paralyze it to too much the actual translation. It's.

A

Worth noting that the current.

B

It'll get it from like kid of the inherent parallelism of queries. If we do the parallelization of the query level.

A

B

Are we for sure keeping that model no.

A

Just saying, there's.

B

The might it might be an argument in favor of.

A

Well, maybe I was pushing like I was kind of one of people saying maybe we should dump this and seeing.

B

A

Me reconsider that, maybe, if it's not too hard or we should think about, if we can keep that.

B

Is true and I haven't done, Yuja mana profiling, but this is a it's a non-trivial amount of time in the compiler. It's not massive, just the splitting things up, but it's all. It's not tiny. I'd like it's not just inconsequential but on the next part, which is the most interesting one is where we actually translate to LVM. So we translating how much of here mirror to elevate ir that today is not parallel is actually can become parallel once we get parallel like the whole, parallel compiler. It's that's. You know.

B

That's the number one major simplification, the coaching packages. We could just translate everything in parallel all at once, so we don't have to have this like coordination between the main for the order threads, and all that that's like a really big thing. I wanted to call out um coder- and this is like today's sequential, but it can be parallelized cogent- that sort of parallel. We don't worry about the penalty, oh, is by definition serial. So it's okay to do anything about that, and then the actual translation after penalty. Oh, is it's already completely paralyzed.

B

The major win is translation can become parallel and we can't we can't land that today, so we actually turn for the parallel compiler all motive. Oh, are.

A

You suggesting.

B

This the beginning, there's like this weird detail we have like so everything in the compiler is controlled via a coordination thread. That is not the main thread, and the coordination thread is the one that decides everything it'll tell the main thread coach in a module, it'll actually spin off a thread to go work on all of the young it'll, be the one doing sin LTO passes or sending that to thread to do passes, and so it's it is a very long comment.

B

You know, I'll click on it very, very, very long comment that goes goes, and so that's actually read that recently. This is still pretty accurate per get to date in terms of what the coordinator thread does. But this is largely in the plantation detail and I expect this to almost entirely go away. Once we get a parallel compiler, because we just won't need this crazy.

A

B

Between the main thread and back so mostly gloss over that.

A

Sort of related to my question, I think the one thing I remember distinctly from when we first introduced this perilous stuff, especially with incremental, is that we were initially killing our people usage because we were translating to lots of cogent units and then running along parallel and I was killing us and we wrecked I think.

B

You and MW.

A

Who kind of did this work to like sort of don't do more than n at a time reduce the peak memory usage, basically as well as starting as soon as you can, which is good and I'm a little bit nervous that if we just sort of just made it so that each query takes a code genuine and like runs it I? Guess we kind of get some of that for free. Now that I think about it? We.

B

Sort of get some of that.

A

Because rayon already doesn't run more than like anything's in parallel or whatever our underlying does have to be random. We're.

B

Gonna have a fixed number of.

A

Work with its that'll already kind of limit the total amount of concurrent work. So maybe it's a good point. Yes,.

B

Actually, but this is a section on the route specifically on this and you're, definitely right, so it used to be like that. Stair step didn't used to happen. We used to when we first had parallel compilation or parallel code gen, we translated everything, serially and then in parallel. We did everything in parallel, so that meant I repeat memory usage was literally every single Olivia module in memory, the biggest it's ever gonna be because it's not optimized ir+ the time context.

B

Everything is that, so that was what killed us and we needed to fix that and I actually tracked it down. That was this was a some innocuous PR which doesn't sound like it's tackling exactly this, but it was when we introduced that stair-stepping effect and the explicit purpose of that was that we want to make sure we don't hold literally everything in memory all at once and I. So we still hold I know for, like so I think one of many things is like we want to try it.

B

We want to drop the type context as soon as possible after we translate everything we also want to, but but on the other hand, we do actually have every single Olivia module in memory at once when we do with penalty, oh, and so, if I remember I, we have to talk to him MW because he might know for sure he was tougher, the one that was investigating this and I'm pretty sure it came from Firefox, but I think it was incremental or an incremental. You could have hundreds of thousands of objects and having all those.

A

B

And so the final tier there is not nearly as it doesn't matter nearly as much because we don't we don't retain things memory and so I really like this is like the pseudocode for kind of what I think what I expect the cogent back gonna look like eventually is like we'll have some query to get the coaching units, but this is sort of like what you were saying of the inherent limitation of the job server and rayon I believe will make sure that we can not have all this in memory all at once and as I was writing.

B

This I realized I. Think that name. The main thing is. We need to drop the type context as soon as possible, but they're like this is an example of like fiddle too. But if you didn't do this, we would just cut out this step in general, and so we would like memory would be the entire elegant module be dropped as soon as it could like what we actually create the object file and so like this is definitely something on what we need to we have to.

B

We must avoid this is a problem we must continue to do with, but I'm pretty sure that um we can largely let the job server limitations of like just inherent parallelism guide us as opposed to having this like weird heuristics for who's. Doing what?

B

When, because that's all based on the fact that we have to do main thread is the only thread that can translate things, whereas if we get always said that every thread just picks up a module, takes it all the way to completion and then starts the next one, that model is just inherently more naturally self real self. Looking because it's like you started it from front to back so no one's producing everything all at once.

B

It's only there's only like if you said Jeff JTED, there's only almost ten Elvia modules in the system at one point in time do.

A

You remember the query structure like here. You wrote, translate Delvian and optimizes separate steps, but I mentioned you might actually want them to be. Like one query, basically, it just says produced the end result of the CTU right, which is yeah, do optimizations well.

B

It's it's a little more, it's difficult because in if you're not optimizing, so if you're not doing a penalty, oh that everything can be one parallel for loop, because you just want to translate an unco gen. If you are doing a fiddle to you, you have a synchronization point. We have to optimize once we don't do Co gen and then you do some stuff and then you guys get in Co Jenna after that, and so whether or not you have felt you or just general LTO enabled might add.

B

A synchronization point makes it a little kind of funky, but I don't know that's where but anyway. So that's like one aspect, the other one is I, don't think.

A

B

Are queries I believe that, like juicing, LLVM, ir is kind of just like ad hoc and like thrown together or internally uses curious everywhere, but I mean like to Bruce the IR but I'm relatively certain that the actual production of the CG you like the LVM stuff is not part of the query system.

A

B

So good indeed,.

A

I remember: we talked about schemes for making it.

A

B

I'm not sure any of those actually came into existence like I, know, cuz queries they always cache the results unconditionally right, like just in memory right, yeah, yeah,.

A

B

A

I was bringing this up.

B

Yeah I, don't think we have to worry about that because it's there's it's not part of the query system. Today, it's more just like you tell you say, translate and then it uses the query system to translate.

C

But it doesn't.

A

B

A

A

It is something where you do want to you want to at least I mean it has to be integrated at least a little bit, because you don't want to do it if you don't have to, but anyway, I'll look into that. Sanford I think.

B

It's the higher bid for.

A

B

And so the other the, unless there's any of that role with these questions or the preview stuff, but there's also the job server integration which I can talk about as well. Is that kind of the next major thing? The last major thing really yeah.

B

Any other questions on stuff before that all right, so we have to deal with a view version of parable right, I'll, just I'm gonna go over the chopped server briefly, just to make sure it's on video, but this might be review for.

B

Parallelism is actually really difficult to do within build tools, specifically because everything wants to use parallelism, and so one of the most examples of this is that cargo is going to spawn the number of CPUs processes. It's gonna spawn the number of CPUs russy's instances it actually can, but if there was no rate limiting or know sort of like limitation of how much can be done, then each of those receipts. This would spawn another num, CPU threads, and so and also if this is recursive, then everyone can keep doing that.

B

And so you can very quickly. You can have the exponential blow-up of the number of threads and processes on your system, and so the general idea is that in the build process which lots of things wants to do various amounts of parallelism, but there's still one build process as a whole. We want to mean general limit parallelism on this. We want to make sure that cargo treats everything nicely all the Rusties coordinate and especially the Firefox's use case.

B

Firefox hasn't much as C++ files getting built, but also put your us code getting build even across that they want a limitation parallelism to make sure that you're not spawning thousands of processes, and so the solution for this is this thing called job server. I believe it was pioneered by Kanu make and it's been comported too much platforms ever since job server is a name for a glorified IPC semaphore, which is synchronization primitive. Where you add in n tokens, and you can get n tokens.

B

But if there's not longer raining you just block and then you can add it back in at any time, and so the idea here is that cargo cargo typically creates the job server and then inherits it to these odd processes. Mrs. Thurm via follow scriptures or literally IPC semaphores on widows, but the idea there is that so cargo will create a token pipe with 42 tokens.

A

B

It if you have a 32 core machine and then it will take it, remove tokens as it spawns processes, and then each process internally will attempt to require token if it wants to run parallel work. And so this is like some weird adherence stuff about the protocol that you can kind of largely not do we care about.

B

It makes it sort of difficult for us in a few places, but every running process always has at least one token, because it's the token for that process, and then it's optionally, the process can acquire more after that, and so the general idea are the the way that it works in the backend right now is that our units of parallelism are extremely coarse. It's just a.

A

B

Module which is massive, and so whenever we get around to actually we have a module we would like to translate. We call this request. Token method, I'm, not gonna, really much talk where that basically tells the job sort of crate. Please really token, and then call some callback that you previously registered and then it.

A

B

Send that token back to the coordinators read in Russy and so for every single object file we get for the main thread. We're gonna request, a token which will attempt to actually do a block read and then once we acquire a token, we get another message and they say.

A

B

Got four tokens: I've got three modules, so let me just go and start executing those, and so this will dynamically manage. It will acquire tokens it'll spawn work as long as it has tokens and then the I linked it in here somewhere, I forget where but oh yeah, so tokens are tokens are immediately released as soon as we realize we have too many tokens, and so it's not a perfect system like this is not nearly as critical clean as good to make. But it's like good enough.

B

It's kind of fine, so there's some freshing of the job surfer tokens here, but it ends up coming out in the wash and not matter too much because the you know parallelism is so coarse and so large, and this also means that once we acquire token as long as they're worth as long as there's work to be done, we will continually hold onto that token. We're not like acquiring our token for a module and then releasing it once that module is done, it's more.

B

We just keep it as long as we have modules we're gonna, keep on using it and as soon as we have more tokens that we have running workers, we drop the or drop the other token, so that they will be released back to the system. So.

A

Let me make sure interesting: we got an internal queue of modules to translate somewhere, I. Guess yes, and we have we put them in there as they get done sorry about modules, to process with LVM, optimized I. Guess yes, and as we create the element, my are. We stick them in this queue. Meanwhile, this, like other thing, is saying so much someone else. I, don't know who is saying a worker thread would like to start.

A

It needs a token and then I just kind of pulls things from the module so long as it has a token they kind of get something else from the queue of work to do and if it ever finds it that a queue is empty. It releases its token stumps. Yes,.

B

Actually, I will pull up the code here so.

A

We'll kind of get to some peak token usage carry on until things start to finish off and then decrease.

B

B

This is where we actually create the thing that it's listening.

A

B

Tokens the job server creat has this into helper thread, which literally spawns a helper thread, which does the blocking reads and writes where that helper thread is, like. You said that a message of doing the blocking read and it does the blocking read and then it'll call this callback, and this callback is just sitting on channel and saying here we have token so that's sort of our sourcing tokens from is you request token and then eventually, they'll come back on the channel every time.

A

You tokens just that callback gets invoked whenever new tokens coin. Yes,.

B

Whenever a blocking read successfully acquired a token it isn't, this method is called. This just sends a like on the channel. It sends a message stuff, but I don't actually know what.

A

B

Is tokens are literally store just at a Veck as soon as we get it like, we can see, we just pushed the token onto the local vector and save it, and that's it and the villain will pump the iteration of this loop again and the beginning. Part of this loop does a whole bunch of stuff. So, like errors don't worry about- and this is where this is like- this is like every single block and it's really complicated.

B

We have to deal with the main thread, like the main threads doing crazy stuff, but largely just ignore that for the purpose, so we have those tokens scroll back up. This is the beginning of that giant. Block.

B

We say we have work just some amount of work to do and the amount of work we have running is less than the amount of tokens that we have. So therefore we can spawn in our work. We have ten tokens what we have five units, so we have five things in the queue two units running in ten tokens, so we're gonna spawn all that, and so this will just pop it off and create the increase.

B

They're running I can't just go actually spawn work and then once we've actually spawned as much work as we possibly can is where we actually truncate saying. Oh, if we owe a lot of times like we might like, if you give us two modules, we'll work tokens, but if we get one token, we finished both modules.

B

With that token, and then we get the second token, we still have to relinquish that and like it's that's part of the monkey interface, where it's not like as nice or clean as always might be, but it means so sometimes we will acquire token and then immediately release it back to the system, but in practice I. Don't this? Actually, matters do too much, but otherwise, once again, once we get a token, we will never truncate it unless we have fewer things running that we actually currently do.

A

What thread is this this before thread? Yes,.

B

This read the return of looking at is the coordinator thread and.

A

So a lot of our own work do I've. Looked at this code before the.

B

First thing is yeah, and then it sets up like a bunch of error handling for panics and then.

B

And I think, like the like, I was saying that the main thread States or them like, though there's a lot of like handling.

A

B

The main thread LLVM, even though literally never, is or is the major at co Jenning and that sort of dance of Matt of managing that will do a little bit a lot easier. Once we actually have like one thread from start/finish, takes the modules on the beginning of translation, all the way through the end to Cochin or just optimization, if it still too.

B

Yeah, so do y'all have any questions about like job server or the current parallel, or anything like that. You.

B

A

Sorry I have to deal with one thing be right back that maybe I do like you can go on if you like.

C

You so one sort of question how expensive is like acquiring these tokens? Is it like on earth milliseconds.

B

It's a Cisco, so it's it's a reading. The write on a pipe on UNIX and I. Don't actually know how expensive the semaphores are on Windows, but it is a Cisco on some form and so a one byte pipe read and a one byte pipe one byte pipe right. That's that's we're paying at each time. We.

C

Only spawned the job so like the helper for that, but that's just one upper thread yeah. So it's not like.

B

Correct it so technically, if you start resi, it actually immediately spawns a thread with a bigger stack and then, when you get to coach n, we spawn two threads, the coordinator thread and the job server helpers read. So those are three threads that we're starting unconditionally and yes, there's only one hepl helper thread for the entire process and I think the current parallel integration, nice I, don't actually know exactly how it works. But if, if the current parallel integration is using a helper thread, it's a different helper thread putting this one.

B

So there might be multiple jobs or help with writes, which was fine, because it's just all reading and writing up pipes, which is coordinated. The kernel level, but yeah.

B

C

B

Is sort of inherent, unfortunately, like you? Ideally, this would be like some non-blocking I/o, where we can wait free token for five milliseconds or whatever, but the file descriptors on unix cannot be non-blocking and makes it. So we can't do anything. Make is not ready for its file descriptors to be non-blocking, so we can't set they're not like. So we have to do blocking I/o, which kind of forces us to make a separate thread.

A

So the question I wanted to ask: maybe you already discussed it, but in the thing that the loop that's calling spawn work, that's basically creating an threads as as they come in are each of those threads then also pulling each of those threads must have a loop within them right.

B

We know we actually spawn a unit of a thread per unit of work. We do not have a red pool, we don't have any like reusing.

C

B

So that's actually that would be I. Don't again it's worth like the the unit pair Elizabeth's. Of course, I. Don't actually think it costs that much because it's a whole via module, but there's no exits.

B

A

Okay, I see somehow.

B

A

By this performance, graphs, which showed one line kind of doing multiple modules that.

B

Was actually that was the specific feature that I requested, because this is like morally what your CPU is doing, but actually this is a special flag being passing. Please collapse, threat IDs. If you can, if you don't pass that this actually shows you a giant waterfall of just a one unit per thread, that's what's literally happening so there's these threads are not literally being reused. It's just in the visualization.

B

If you do not pass this, it's not collapsed. It's just a giant waterfall. So.

A

What are we doing? Do you know? Did you look at all how the job serious stuff is integrated into the parallel rusty today, I haven't.

B

Taken to take a close of a look at that so I'm not sure, what's going on there.

A

B

Was asking the cost of acquiring and releasing a job, server, dokkan, and so on unix, at least it's a reading of write on a pipe. It's a read of one byte, to put it back and it's a write of one, it's a read of one byte to get it and it's a rate of 1 byte to put it back in there. So this is a very, very expensive operation for doing, unlike every aeration of a very Titan, but if you're doing Olivia doesn't matter at all. Just once it's gone.

A

And I have to look what I would expect. This is what bucks he did. I sort of remember this in the terms of like rayon works. It has a notion of threads going to sleep when there's no, where and then waiting back up right and they end up like blocked on a big lock and it's something they we tried to avoid doing, don't do it too lightly. It seems like that would be a time to relinquish a chopped server token or get one back like.

A

We probably wouldn't want a strategy where we, where rusty like acquires a bunch of tokens and never gives them back wall completion is happening. No well is.

B

This Tennessee.

A

We actually would want this.

B

Team, this yeah so you're right that mean this assuming Briana's.

A

B

What you were saying here, this looks like it's doing it once per thread where its acquiring and releasing I'm, not sure I, don't know if it's calling this call back around like every.

A

That's something we added for the rusty reinforces it.

B

Does I would suspect, I don't know something is being called too much because I've definitely seen it super high on the profiles for the what I did a little while ago, but otherwise you are correct in this. Instead, like I.

A

B

Think that rusty should not hold on the threads are hold on to tokens. Opportunistically actually work to be done, although that could be fungible like I could imagine we hold on for it for like two milliseconds that if no work comes in the next two milliseconds then, but that kind of thresholds gonna be really hard to deal with so I would ideally I do think.

B

We should try to get to a point where, like we have so much parallel work flooding the system that the amount of job service stuff that we do is just very, very tiny. Compared to that. So.

A

One thing I will say is that there ray on the version of RAM that well all versions of Graham have a pretty not very good algorithm for putting threads to sleep in week. It's kind of a binary algorithm, actually, where it sort of says, like either they're all awake, or they can start to go to sleep and the point being as long as there's work happening, it tries to keep all the threads awake and I've been until we in the last month or two I was hacking on an alternative one.

A

That would not be that bad that we sort of scale up the number of fits more gracefully based on how much work there actually is. There's a few benchmarks that regress I haven't, landed it yet, but like basically metrics where it turns out to be useful to have all the threads in around grabbing work. But that might be why you're seeing so much acquisition that.

B

A

We put in one job into the queue and suddenly spin up in all the threads, and they all try to get tokens and then yes, they start to fall asleep, whereas something that kind of scaled them up more appropriately might work out better. Could I please.

B

A

With that fork.

B

Yeah I am I, so I, just profoundly I did was a long time ago, like I, don't think it substantially changed since then, but I like thread creation, was a big one where I feel I think like just creating a bunch of threads, was high on the profile. So it sounds like this would definitely fix that. The other one I'm not thinking of is that if every thread is waiting for jobs or token, then that's definitely a issue we're like once it jumps over to can re-enters the system.

B

Then it's gonna cycle through every thread until the final phrase. What everything is like it puts right back in and so that sort of cycling behavior could be really.

B

Like if you have so I, you like put a unit of work into the ring on thread, pool and therefore 32 threads will try to acquire top server token to run that work, but there's only one job server token available, and so one thread gets it and does it, but then, as soon as it releases it every other thread. Sequentially then gets that might be issue.

B

But again it sounds like it depends a lot on the specifics of how, when exactly these required hours, so cold and I suspect oxy knows for sure I'm, not sure myself, I.

A

Could certainly see that happening, I'm skimming over I mean it's not too hard to look in the rail source so seeing where they, where we release it right now, but it looks like we. We do indeed release the thread when we go to sleep as you might expect, and then we acquire the thread when we require the token when we wake back up.

A

So probably if you did do something like one unit of work, that's dropped in and that's it goes not sure when that would happened, though, that was sort of correspond to like a slow trickle of jobs, which we probably don't. Do we probably have some master queue. I, don't know what to look. Maybe no I.

B

A

Really poured like this strategy of acquiring tops for returns and rayons currents getting threats. You just don't play well together. You really want something that goes hey yes,.

B

Josh was also saying that whenever Rhian starts, it immediately spawns the entire thread pool this.

A

B

May not be appropriate for the compiler. This.

A

Will be true also on my branch is just in most of those threads would go to sleep, it wouldn't be hard, so they really released the tokens back. That.

B

Was at least I I definitely saw at the very beginning of a profile when we have like tons of tiny crates that take just under milliseconds to compile spawning I have a twenty core machine, so spawning 20 threads per se. Every single time, I merely go back to sleep and die that actually was causing a lot of CPU contention. A lot of a lot of time spent in the kernel, as opposed to in the compiler and wonders.

A

B

A

Were it probably wouldn't be hard and I thought about that as a concern, but it probably would have been hard to at least not that hard to have a notion where, like they ran, waits to spawn ways to actually start the thread until there's a good reason, and this.

B

Is one really like I do think we need to probably look a little bit more into what exactly is going on and then read up service stuff with Rhian right now, like a lot of this is kind of guesswork I'm on my and I'm kind of like trying to see it was well how to improve it, but I don't know for certain. What's going on so.

A

This comes back to the question of what do we want to see to have a dependency on rayon in the first place? um I have mixed feelings about, though I think it's not like. This is a obviously bad right now, with rands current scheduler. Do you make grant have a smarter scheduler, not so bad?

A

It's obviously duplicating some work done more than one scripts are, on the other hand, I think rusty with trouble like.

B

If you were gonna make a scheduler.

A

For rusty, you probably don't need as much work stealing I.

B

Kind of imagine like a central.

A

Queue might suffice if we're doing coarser, grained parallelism and.

B

I was actually curious there in terms of like getting there was the Effie idea. What if we only paralyzed like top level for leaps and given Rusty's workload, that might actually get almost all the like, not all of it, but it might be a huge win just doing that, and that would be relatively easy to manage. The job servers Forbes cuz cuz. We have like one function that just does the thing, and so I would be curious if like.

B

If, especially if the Ray on work would take quite a while or it's very invasive, it might be worthwhile to prototype a scheme using Russy or just having a top level for loop, that's parallelized, and what allow us to get these things in sooner yeah.

A

I would certainly be so that's something we haven't. Actually we've been looking at where two shared state exists, but we haven't looked. We don't.

B

A

Like a good list of what are all the things we try to do in parallel right, it's not hard to acquire one. You just have to replicate for part or something, but like um we haven't looked as far as I know,.

C

The other sort of argument, for not maybe using ariane essentially ever, is if there's one theory on sort of goal, is to achieve more parallelism ever see what he wants in a sentence that I think most applications that use rayon probably want to to believe that they are the only application on the system, whereas Rossi sort of wants to do the opposite.

A

Yeah I'm not sure, but definitely I, feel like job server. Integration I mean I think this is a problem for Rihanna in general, like all right, so it's just a hard problem that other.

B

Applications probably.

A

Hit too, but that's part of the motivation, I think exactly around making around scheduler a little smarter to be like well, you know.

C

A

If some threats are being very slow, that's okay, we're gonna skill them with me, but anyway, yeah I, don't know I, know I, don't think it had. The main reason to use round I think would be. If we anticipate like that in the future, we might want to write Perl the readers like fine.

B

Grained parallelism about.

A

B

I would be curious to see a perf run, I mean imagining.

A

B

Code like ripped out alders job server stuff, run a perforin per crate, so no job server overhead, just what we currently have, but then also ripple rip that we currently have and make a lot only parallel. The top level I would be curious to see the time. Comparisons between those two I, don't actually know like. We might finish her for losing a lot of opportunities for parallelism in the current compiler. If we only paralyzed my.

A

Current, so looking just with a critic, grep, of course it's too bad Zacks he's not here, I'm sure he that's the most up-to-date information on this, but I see that we are doing the collector. So that was the thing that you said can't be paralyzed. We that we are indeed paralyzing. That's to say when we basically enumerate.

B

A

Full set of things we might need to put into any coach and unit anywhere we do that walk in parallel today, which is kind of cool and also something in the lint processing. It looks like we do: each module we lint each module in parallel. That's kind of a top-level loop though, and but.

B

Those are huge like both of those are significant portions I mean not them not the predominant person, but definitely big loops that are happening in the compiler right.

A

And it looks like we do. These are all the here. Id validation does some they're all they're, not necessarily all queries right now, and some of them might not easily be convertible into queries, but they are fairly high level loops that I see so.

B

Is there anything along like I think at one point we were like the type-checking passes, like query type check and your dad one query, but the loop is all happening like internally like or like the the parallelism comes from like the tree like structure where, but that doesn't really show up anywhere. It's mostly parallelism through four loops yeah.

A

Because by the time, like type check runs, for example, it doesn't have a tree, we've already flattened it essentially to a list of things to be time. Checked that is happening. We are tech. Checking in parallel. I can't quite find it right now, but through.

C

Some cultivators also have some sort of lower-level, in some sense parallelism. If you look for the parallel macro, which just runs like blocks in parallel.

A

C

Yes, II, you.

A

Know I was overlooking that yeah.

C

There is a bunch of sort of smaller things.

A

Right yeah, so we'll do things like look for the entry point. Look for the plugin register. Look for the derive register all in parallel.

C

It does look like most of it is high level enough that, like it's either a four-week work, you could learn for loop, so we can probably write.

A

The brillo.

C

Macro marketing, please.

A

Yeah, it seems this is kind of what we expect that there's plenty of available parallelism and a high level.

A

Well, so if we just turned off the job server, what would you expect that experiment to tell you? Are you thinking of compiling a single crate like without the single crate, the job seeker is irrelevant.

B

Yeah I would want to see this single crate job server for job server performance through perf, just to see what the overhead, if we have excessive overhead, just getting an acquiring tokens by accident, I, don't think we got ever landed, but it would be all secure, interesting to see kind of the first five seconds of a compilation and how much of it is saved.

B

Like I know when I when I tested it, I test card row and like during the first five seconds, everything was like grinding to a halt versus like flying through crates and so profiling. That, with without the job, would be kind of way just to see how bad the job server integration is, how much cost it is right now, because the other alternative is the other thing that could be costly is displaying threads, where every single tiny machine instance spawns 64 threads 28 threads locally. For me,.

A

Okay, so we're trying to figure out how much are we costing okay? So that's actually pretty interesting, because when we said we were talking about j1 overhead and we kind of said something about percentages so that we or we looked for like absolute noticeable values. But what you're saying is, if you have some tiny compilation like relatively small holder, world style, small crates and there's a large number of them, then even if they individually are not very expensive, it adds up over the course of the cargo run.

A

So we have a what we want to pay attention to like small-time absolute or small compilations. Even if they don't look individually like much and.

B

Like honestly, I'm not talking about five ten percent I'm talking like I typed, carbon build unstable and it like screamed through the first 30 crates on cargo and then I type carnival with nightly just like running full parallelism, and it like it, took like 30 seconds to get to the first 30 crates like it was just say me, slow and I was like so I'm talking like many many words of magnitude slower like that's, but then once I got the cargo.

B

It was really oh, it was a hell of a lot faster, and so that's like it's something to keep it I on I'm, not really sure I, don't know of a great way to measure that, in a like repeatable fashion, like throw on purpose like that, just like running it locally whole.

A

B

It's a it's inherently, a very difficult thing to measure an entire karma build because you don't care about instructions of that plan. They care about time and there's so many variables that, like getting any precise amount of time, there's was really difficult are getting ice city measurements, it's very clean every time. So it's not the five ten percent small wins here and there it's more of like like you, should be able to download a parallel compiler turn on parallelism and then type carbon build in a big project and it should like it should feel faster.

B

It should not get really slow and then speed up towards in the compilation.

C

One thing that kind of when asked do we think that it's viable that if we have sufficient, because in the compiler that so the long term future is maybe we say, cargo is no longer sort of j16 and only spawns five seeds at a time with the idea of being like it's more advantageous for us internally to paralyze, or is that not really something that would expect to be interesting? I.

B

Hadn't thought about that I would maybe say the cargo will always be better at parallelism, because it's just so simple, it's just processes and those are guaranteed to be parallel and guaranteed to saturate as much as it can, and so like it's a question of like if Carl can spawn a process, should it not because receive might be a better job leaking we're. Keeping those cores busy and I would say, like I feel, like I worked, actually, probably not because, like.

A

Thank you, one of us I just think you want to refer to course, this green girls and again, most of the time what.

C

I was thinking about was that sort of when you get specially towards the end of the crepe graph, if, like cargo is spying dependencies still to an extent, you get this situation where, even though you have a parallel, compiler you're, essentially compiling all single threaded, except they're, pinning the cost of blocks for those first, you know thirty forty crates, because they're so small that you always have sixteen or twenty eight or, however many Rusty's like you, can always find more so you're, paying the cost of lots but you're not getting anything out of it.

C

You can't quite more than one thing: that's.

B

What I think that we could do a lot of work on where I suspect that the the overhead from like dealing with the synchronization of the job server, we can get pretty small or we should be able to get pretty small, and if we can't like, we can't that's something to figure out later but like like with inherent tiny fine grain rayon parallelism.

B

It might be really difficult unless his job server, sorry, the schedule improvements pan out but like I could I can easily see that, like for type checking, we could like a spot of thread to the choir token and then like do the other half the creep with that second thread, like that's relatively easy to add up and would have a very low impact on the synchronization I'm.

C

Not saying like jobs, server, synchronization I'm, saying, like the compiler using locks, impose the cost if the first half of the crank graph is single threaded by virtue of just you know, you're limited by following cores: you have those lost you're, essentially always paying single threaded cost.

C

A

B

Could imagine I.

A

Really I mean I think what you're saying mark if I understand is like in the beginning, there's a lot of parallelism available into form of crates, so those crates effectively get only one job server to open each who know. Therefore they will never get any benefit from their locks, as seems correct, but believe me, yeah I mean one question is how low we can get the overhead and the other question is like I guess in the dear God and the most extreme versions we might have two versions of rusty.

A

The cargo sort of says like this is looking pretty. It doesn't look like you're gonna.

B

Get any threads.

A

Out of this, but I feel like you're, probably like I, think the game we get from the end, hopefully always that cost. That's the bad guys.

B

The hazard my experience for the the lion share of the compilation is spent on large crates, taking one Corps at a time.

A

If I were to try to take that rayon branch that is less eager without starting its rights or it'll still start the threads, but that is um less eager about like scaling starting making them from sleep and produce like a rusty branch. How hard would it be to get some measurements out of it? What kind of measurements would we wanna get I want.

B

A

Call which I think we should end but with like there's some some measurements to make and I feel like I, don't quite know yet. That's your head. Well, it seems like at least one of the scenarios to investigate.

B

um An easy one is perf single crate performance, make sure it doesn't regress and see if it actually improves another one is the given that branch we would produce like all compilers and just test the op compiler for the previous commit but single threaded and take that to build project and see what happens like I I'm, expecting cargo to pretty like I guess like I, want to see the load average for a compilation to be almost at the number of cores for almost the entire time like that.

B

That should be what we're seeing and like you're, also not just caramel stuff. But users like that's whatever. It's not it's more difficult to measure. It's more subjective, I mean you get so put numbers to it, but that's what I would expect just like. Take it around to a couple of projects build from scratch and just see how it fares.

A

Okay, I'm thinking about how like I, have to go revisit I mean one of the child is here. We don't really know how much I get away. Rayon is kind of what we want in the sense that we're stealing is reasonably well designed for this cases, where you don't really know how much help the other threads are gonna be, which is exactly what we have going on here.

A

So I guess just having a central cue and pulling jobs from them would work too as long as it's flat, but we do things like you know, like rail will do things like divide the work in chunks they're like getting smaller, such that if it looks like you're on your own for this loop because nope, you know when no one's picked up, you've finished a chunk and nobody picked up any of that work from you. Then you go on and you don't bother to do further subdivisions and stuff like that.

A

So you get a lot less chewing overhead, like just from pushing into a global queue. You don't push any things. You push login things stuff like that which pays off and my yeah. We probably want to do the same kind of things actually.

A

B

I think some of this also was I. I might just want to dedicate some times just investigating this, like getting a good story like we did this riddle to you when we turned it on, because there was some surprising regressions, I mean they're taking enemy like oh yep.

B

These are the cases it's gonna be rests on, so it might just require a lot of in-depth investigation so like and like I'm being very I, think we're still very vague about what the cost of jobs are is where exactly a problem is exactly the problem is, is we're kind of hoping that this fancy new scheduler will fix everything, but once we have something concrete to work with, we can just go and investigate and like do a really in-depth analysis of trying to figure out. What's going on there.

A

Guess what I mean anybody else hurry, because ii had.

C

One sort of know I can make, is that I've been thinking and looking at adding the cargo. The timing graphs to her, because we always build like the whole crate from scratched and those graphs seem useful and he might not be entirely precise and they probably have high variance. But he just like that could be helpful here as well, because it'll give us some insight into overall sheepy usage with bearing compilers I was losing.

A

Me that's great.

A

Alright y'all later we'll discuss with the offline Angus.

C