Rust Programming Language Air Mozilla, 3 Jan 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Faster Rust Builds

Description

1:30- Stuart Pernsteiner, "Faster Rust Builds" MV

Help us caption & translate this video!

http://amara.org/v/2Fh7/

A

Hi I'm, Stuart and I have been working this summer with the rust team on getting faster, build times for rust code. So today, I'm going to talk a little bit about some of the reasons why rust has had trouble getting fast build times and some of the approaches that we're using to speed things up. So one of the big reasons why rust has trouble with slow build times is that the rust compilation model doesn't exactly make things easy.

A

So here is what the compilation model looks like for C and C++, and in this compilation model you have many independent source files and on each source file. You make an independent invocation of the compiler to produce an object file and then, once all of your compiler invitations are done, you take all of your object files linked them together to produce the final executable and I'm just going to ignore headers and libraries, because they're just to keep things simple.

A

So with a compilation model like this, there are a few pretty straightforward things that you can do to speed up the build times, so one option is to use parallel builds. So all of these GCC and vacations are independent and you can run them all in parallel. So you know if you have a quad core machine, you Ron make dash j4 or something like that and get roughly four times faster, build. Another option is to do an incremental build.

A

So if you've already built the project once and you've only changed one of the files, then you can skip running GCC on the unchanged files and just reuse. The old object code when doing linking- and this can also give you pretty significant speed up on your build now. The rust compilation model looks like this, and so the main difference here is that there's only a single, rusty and vacation for this entire project, which means we can't run anything in parallel.

A

There aren't any parallel rusty invocations that we could run at the same time, and on top of that, we can't really do incremental builds because I. Basically, if one of these files changes the single o file, this entire project will also change and we have to rerun rusty to rebuild 0 file and there are no sort of there's no sort of redundant rusty and vacations that we could skip.

A

And another problem here is that those GCC invitations were each running on a fairly small, so on a single source file, which means maybe a few thousand lines of code plus. A few thousand lines of headers, which are mostly declarations, are not actual code, whereas rusty here is running on every source file in the entire project, and that can be hundreds of thousands of lines of code.

A

So since the so since the sort of standard approaches to speeding up build times, aren't going to work with this compilation model. If we want to make things faster, we have to make changes to rusty itself and get that single rusty invitation to not take as long. So, let's take a look at what is actually happening inside rusty, so when rusty runs, it takes all of the source files for the project.

A

Parses Amal runs them through type checking and then translates all of the rust code into llvm intermediate representation, which it passes off to the LLVM compiler back-end, to run optimization and code generation, and then at the end of the lvm passes, you get out the object file. Now it turns out that if you time how long these different steps take, the type checking and translation phases only account for about a quarter of the build time.

A

The time spent in lv m accounts for three quarters of the build time, which means the llvm code or the lvm in vacations, are a good target for us to try to optimize, and so my work this summer has been mostly focused on getting ah being able to parallelize our calls to lv m. So, basically, instead of having a single lv m ir module and a single invocation of the lvm code, generator would actually generate multiple llvm modules and run multiple optimization cogeneration steps in parallel.

A

So the sort of interesting aspects of this approach wound up being sort of the top and bottom parts of this diagram. So the parts where you are taking the output of translation and trying to split it into multiple llvm modules and then at the end, when you have the output of multiple codegen passes in trying to get it into a single object file, so the splitting ll into multiple llvm modules, I described it sort of like this. You take the output of translation and split it, but it's actually not quite that simple.

A

These llvm modules, if you try to duplicate or split them, I will actually end up sharing some internal data structures between the two halves and the result is that if you try to run these on separate threads, they will make unsynchronized accesses to those data structures. You will get race conditions and usually seg faults ah and.

A

So what we wound up doing instead is I modified the translation path of the rust compiler to actually just produce separate, llvm modules from the beginning, and since these are now totally independent, they have no shared data structures between them and we can send them off to multiple threads to do the actual work of optimization and code generation.

A

And then the final step of this process is after code generation has run. We each code generation pass, so each thread will produce a separate object file and we would like to combine those together into a single object file. Since that's what the rest of the build process expects as output, and it turns out to do this, we can actually just use a linker feature called incremental, linking which combines multiple object files into another object file which you can feed into later, linking steps, and so that's actually pretty straightforward.

A

There were a few issues with portability with this incremental linking working slightly differently on different platforms, but once that was sorted out, it basically worked.

A

So the parallel codegen stuff actually provides a pretty significant speed up to rust build times. So the blue line is total, build time and you can see that it decreases, as we add more threads up to 4 threads, which is the number of cores that I had on the machine that I ran. These tests on. You can also see the red line, which is the time that the rust compiler spent, not in llvm that increases, but it only increased.

A

So the time spent outside lvm increases slightly because there is some overhead from splitting up the llvm module into multiple pieces, but it only increases a little bit. So it's totally outweighed by the benefits of running multiple llvm steps in parallel, and so one downside to this. A parallel build approach is that where previously we had a single compilation, unit and lvm could see basically all of the code for this entire project at once, and therefore it could do in lining on basically any call now that things are separated into multiple pieces.

A

There are boundaries between them and where lvm cannot see the target of a call and so might be prevented from doing in lining in some cases, and there are also some other optimizations that this hinders, but it turns out if you look at the performance of the generated code as you increase the number of threads, it increases somewhat up to about fifteen percent overhead, which is uh I, guess fairly significant, but is definitely tolerable, considering the benefits that you get in build time.

A

So if you're doing a development build and what you mainly care about is basically being able to edit the code recompile and test it very quickly, then this would be. You know, parallel code. Jen is a good choice to let you do that for release builds. Obviously, if you want the maximum performance out of the compiled code, then you should stick with one thread and just uh basically, hopefully you can deal with the longer compilation times, so uh this parallel codegen feature is already available. Now in rust master.

A

You set this flag to, however many threads you want to use and it will separate out the code generation into that. Many pieces, another optimization approach, which I don't think I have time to talk about today, is a incremental codegen in which we only run translation on parts of the project, or we only run translation on functions that have actually changed.

A

So we generate, we only generate lv mir for functions that have changed and that's the only code that needs to get optimized, so that can also provide good speed ups, but that is not actually finished yet. I am still working on that yeah. So that's all I've got there any questions.

A

A

The question was: why do the speedup stop at four cores and the main reason is that I only had four cores in the computer that I ran the tests on.

A

There is also another effect, which is that there are some there's some code that has to be duplicated into every compilation unit or into every llvm module. So even if you had, you know a thousand core machine, if you try to divide your project into a thousand llvm modules, you die, you would not get a thousand X performance because there would be fixed overhead in every module.

A

How big is the fixed overhead? It depends on the project ID, so I think, currently ah anything that the programmer has marked, as in line will get duplicated uh and, let's see yeah so code in the current project that is Marcus in line will be duplicated into every lvm module. Also, if two different, if there are calls from two different modules to external code that is marked in line, the external code will be copied into both of those modules. So it doesn't necessarily get copied everywhere, but it can be copied multiple times.

A

There's question from San Francisco I: don't think we can hear you.

A

Hello, can you hear me? Yes? Yes, you can hear me okay. So first of all, this is awesome. Work thanks for doing this, um so I just had a question about the performance numbers that you showed. So you should a chart. You know with decreasing build times as you added additional threads, but I was wondering.

A

Did you attempt to measure the overhead relative to the version of rest before you did your modifications another? Is there some baseline overhead for the changes to the code that you made to support this parallelism.

A

So I don't have I, haven't explicitly tested that, but I do not think there is any significant overhead for.

A

So I don't think, there's any significant overhead introduced by the sort of parallel coach and infrastructure. So if you have parallel codegen turned off, you should get the same performance as before, because it's generating essentially it's essentially generating exactly the same I are as it would have before, and doing exactly the same optimizations as it would have before.

A

A

Any more questions.

A

So the question is I: guess: what's the how much time is spent in optimization, which is how much is spent on code generation? So that depends on how high you set the optimization level at 0 0. It spends basically no time in optimization on optimization level 2. It spends about two-thirds of the time on optimization and the remaining one-third on code generation at a zero code generation for lib rusty, which is the you know, main component of the rest compiler and is our biggest library on the machine. I was testing on.

A

It took about 30 seconds in code generation versus at 02. It takes about 100 seconds in optimization and 50 seconds in code generation.

A

A

Question is which passes which optimization passes were taking the most time and I. Don't actually have any idea. I haven't measured that.

A

A

The question is: is this on by default and it currently has not turned on by default anywhere. The plan currently is to have it turned on for turned on by default for low optimization levels. So if the user does not request o 2 or higher than we would turn this on by default, and otherwise, if they do request o 2 or higher, we sort of expects that they really care about performance and don't want to take the ten to fifteen percent performance hit. And so we will leave it on a single threaded compilation.

A

The question is for turning this on by default. Do we have a way to find out how many cores the machine actually has and I, I'm not sure if Russ Russ tur does have such a mechanism? Ok, so we can use that, although the current plan was to only enable basically two threaded optimization, because enabling a higher numbers of threads can actually cause a can actually cause worse, build performance on very small projects.

A

So, basically, the overhead of splitting things into multiple pieces will outweigh the benefit of faster code generation. If there is not much code to operate on in the first place,.

A

A

So the question is: could we have a heuristic to detect small crates and go to more than two codegen threads in that case and I suppose we could I wasn't already planning that, but I can try it out and see how the performance looks.

A

Any last questions.

A

Yes, that's it.