Rust Programming Language Rust Linz, 1 Mar 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Rust Linz, February 2021 - Will Hawkins - Comparing performance of range- and counter-based loops

Description

In this short session, I will share the story of investigating the difference in performance between a `for v in expression` loop and a `while` loop. The talk will cover topics in the Rust language itself, compiler optimization levels and how to benchmark performance.

Code: https://github.com/hawkinsw/speed_limits_of_loops_in_rust
Will on Twitter: https://twitter.com/hawkinsw
Rust Linz on Twitter: https://twitter.com/rustlinz
Submit your talk: https://sessionize.com/rust-linz
On the web: https://rust-linz.at

A

My name is: will hawkins and uh at any point uh feel free to reach out to me uh over email, hawkinsw gmail.com. If you have any questions or you can find me on twitter hawkinsw.

A

So today I'm going to talk a little bit about the speed limits of loops in rust, and I hope if I've done my google incorrectly, you will recognize that that should be a local speed limit sign.

A

They look very different from the ones in the united states, so I hope that I I hope that I got it right, but we'll talk a little bit about oh good, good, I'm glad it is um so we'll talk a little bit about speed limits and how fast we can go with looping.

A

So, first we're going to talk about a little motivation for writing loops in rust. Obviously, nobody needs to be told why loops are important for any programming language, but a little bit about this context.

A

Then we'll talk a little bit about writing canonical loops in rust, or what the rust book would recommend that you write then we'll talk about writing fast loops in rust, which indicates that there might be a problem with canonical loops. We'll draw some tentative conclusions, we'll talk about release mode, which is something you have to always be aware of we'll address any persistent speed, differences and we'll draw some final conclusions, so I hope we'll be able to um address any questions as well as we go. So here we go so the motivation for this talk.

A

I started with trying to benchmark a simple function. Call I was trying to uh answer the question of whether it is faster or slower to call a function with move semantics than it is to call it with, um with sharing semantics, passing a reference versus passing, something that's movable and just giving up control of it to the function.

A

um I would anticipate, but I don't want to presume that everyone understands the difference, but it's not I'm happy to answer that, but it's incidental to what we're talking about here, so you don't have to understand the difference. The point is that my motivation was to time a certain sequence of operations.

A

Benchmarking is obviously a very difficult thing to do correctly, but you know as a rough approximation to benchmark something. What we can do is ask the system that we're running on what the time is before we do some operation.

A

We can do the sequence of operation a certain number of times we can get the time after the operation is complete after the sequence of operations is complete, and then we can divide the difference between after and before by the number of times we did the operation to get the average time to complete a certain sequence of operations.

A

So we're going to do the repetition here this x times in order to get an average, because if you do it once or you do it twice, you never know what system uh level artifacts might appear and you might get a little bit of a a little bit of an artifact hope that makes sense to everybody.

A

So, just in pseudocode, what that might look like is we have a function main, uh which is where we're going to start our program, we're going to name a variable before we're going to get the system time using our now.

A

A hypothetical now function, we're going to repeat x, number of times this certain series, certain sequence of operations, and then after that, we're going to ask for the system time again, then we'll calculate the average for the time that those executions took by by subtracting before from after and then we'll divide by x, which is our total number of iterations, and we will print out the average execution time hope that makes sense.

A

So far I'll pause a few seconds here to just see, if there's any questions that come into the chat before we go any further, all right got a big old thumbs up there.

A

So I would like to ask a question now for the chat uh in rust. What is the quote-unquote right way to write a loop to repeat operations, a fixed number of times.

A

There's a couple of different ways to do it, but I'm curious to see what people think is the way that the book recommends.

A

Oh somebody read my mind.

A

Good, these are good answers.

A

Yes, great mega very good.

A

All right great, so just for those of you following along at home. Oh I, like not matthias's answer. That's really! That's really really really good. I like that, but let's go with uh rainer and uh attic. I hope I'm pronouncing that correctly um they're their answers here and we'll talk about iterators. So in an iterator. What I would do is I would write a canonical loop with a 4 and, I would say, 4 something in some iterative range.

A

So let's look at our first coding example here and let's turn what our our um pseudocode into some real code. I hope that that's big enough for everyone to see. uh If it's not prior to the presentation, I sent out a link to the github repository so that you can follow along, and this is the benchmark, dash iter.rs file.

A

What you can see on line seven is that we've got this function main, which is like just like in our pseudocode, we're going to define x here to be some relatively large number of iterations we're going to take the system time before the operation and now in lines 12 through 14. This is where we get down to business, and we will see the uh recommendations here that uh rayner and abik made that this is the kind of loop, the canonical kind of loop, that rust recommend.

A

The rust book recommends that we write just to explain this a little bit more.

A

What what this is saying is that we've got four some variable, that we don't really care about its value, so we're going to leave it as an underscore, which is an indication in russ that we don't care about its value and we're going to say in and now we've got this interesting syntax here: zero dot, dot, x, zero, dot, dot x is an implicit way to turn a range between two numbers into an iterator at each version at each iteration of the loop or each pass through the loop rust will ask this iterator for its next value and assign that, to this underscore variable again, we're not using the underscore variable directly we're just using this as a way to do this particular function to execute some function to time here x number of times, once we've done it x number of times.

A

We take the system time again and we calculate what I call the average, but is not really the average.

A

This is just the time the total time that that takes in seconds- and this is okay for right now, when we would go back, what we want to do is actually calculate the average, but this is enough to generate to to show the point.

A

So let's actually run this. I'm gonna open my terminal here.

A

Oh nuts, all right that moved it to a different window, but I want to bring it back over here.

A

All I'm doing here is making sure that I have the right version of the code linked as main.rs, so that when I build the code, cargo will know what to build.

A

A

Oh got the wrong one! Sorry about that.

A

A

All right, so, let's do cargo build good and let's see how long our loop takes here with cargo run.

A

hmm This is not good, it's taken a long time to run, but let's give it a second and see if, let's see if it finishes, while we wait for that to finish, um let's see if there's questions coming in there.

B

Is one question, maybe you can repeat it for people who don't join discord? Does the blanking the iterating variable speed up the loop, or does it only help linting or compiling.

A

We'll get into that in a little bit later, but the question again raynor repeated it but I'll repeat it again. Does blanking the iterator help speed up the loop, or does it only help linting and that's not a question that I addressed directly but we'll actually see when we go through here what's going on, and my guess is that it really only helps with linting.

A

So just a very, very, very good question. um Oh so I have, I see a comment here from not matthias that talks about how llvm can optimize the loop and uh also we've got. um We've got some good comments coming in from stable minor as well, and I don't want them to ruin the ruin the surprise. So so, if you're interested in knowing the answer before we get there, you can read some comments uh in the discord channel, but otherwise follow along with me.

A

So, unfortunately, the time between those two times that we took was a full 22 seconds and again this function that we're timing isn't really doing anything valuable.

A

So that seems like an awfully long time to uh that seems like an awfully long time to take for just doing nothing over and over again, so that was that was slower than it needed to be right, probably way slower.

A

So, let's take a look at what's actually going on under the hood one of the ways that you can look at what's going on underneath the hood is by doing something called objdump, an objdump will actually be used to take the binary that the rust compiler generates rust c and actually look at the machine code, that's generated which gets executed on your system, and I find this to be very, very, very interesting and I love looking at object code. It's not something that everyone likes to do, and it's also not something.

A

It's also relatively intimidating for people, but it doesn't need to be uh it's actually very straightforward. Usually so what I'm going to do? Is I've annotated some of the object code files that are generated by the compiler for these examples, just to give us a sense of what's going on. So let's take a look at those right.

A

A

A

So what we're looking at here, uh let me know if this is big enough, but what we're looking at here is the machine code that the compiler generated for this loop for this main function.

A

And I've highlighted some really interesting parts of the code. The first thing that we're going to do is create some space for the local variables, we're going to initialize that x value we're going to call system time now and store it in the result before we're going to make the xero iterator the big iterator.

A

And then we're going to start the loop, so we have the top of the loop and the bottom of the loop.

A

So this is the entirety of our loop right here and after that, we're going to take the second of our system times and record that and then down here. What we're going to do is we're going to print the results.

A

Now. What's very interesting about this, you can all go back and look at this on your own at any time uh with the comments and feel free to do so.

A

But what I find really interesting and noteworthy, which may be the cause of our slowdown, is that at the top of the loop each time we're calling this function function calls are not cheap and each iteration.

A

We are calling this next function in order to get the next value of our variable very expensive and ultimately, the source of the problem. So this is a big deal. So here is why that loop took so much time. Let's see if we can do.

A

Better, so that's a problem for us, because the rust book recommends that we write these four in loops with iterators as the canonical and readable way to do that.

A

So the community would not recommend doing something that is slow, and the simple case is such a simple case that we have here. The 4 in loop should be as fast as a while loop. So let's actually do the comparison and see, if that's the case,.

A

So what I'm going to do is I'm going to rewrite this benchmarking function with a simple while loop, which is not the recommended way to write loops in rust, but it's a definitely a way to do it. So I've already got that file written here. It's called benchmark dash while dot rs and what you'll see is that the setup and.

A

Operations before the loop are just as the same as they were before we're doing the same number of iterations we've got and we're taking the before time. The only difference is that we do have to take care of making sure that we execute only the right number of times, so we create our own variable here. Our counter variable, I it's immutable, so we can change. It starts at zero.

A

We calculate that we're going to do the body of this loop, while that value is less than our upper limit, we'll add one every time and we'll say, call sum function to time again after the loop. Everything is the same as it was before no difference at all, interesting. Okay. So now that we understand that, let's go back and see how long this actually takes to run again, I'm going to make sure that cargo knows that I'm executing the right file here.

A

I'm gonna do a cargo clean to remove everything that we had before and I'm going to do a cargo build great. So now we have a new version of it. That's reusing the while loop and I will run cargo run.

A

Wow that is significantly faster than the one before it this time it only took two seconds to do all those iterations, instead of the um 22 seconds that it took before. That is a significant difference. So what is coming to mind? Is that there's something that's not right here, something not right. So again, let's look at the code that the compiler is generating and see. What's going on again, I've annotated that code and I've placed it in this repository that you can all see.

A

But let's open it up here.

A

We'll go through this quickly. We make some room for the local variables. We store the result from system time as before and we're again at the top of the loop.

A

What you'll notice here is that at the top of the loop there are no more function calls there is no function call here to get the next value of the loop. That's because everything is taken care of without calling it without calling a function. The incremented iterator or the next value is calculated with straight machine operations.

A

These are simple adds. Remember, we were adding one every time we go through the loop, so we add one every time through the loop that does not require a function call and what you'll see is in the loop body. The only function call that we're doing now is to the function that we're timing.

A

So this is the cause of the significant significant increase in speed.

A

We don't have to call a function to get the next value of the loop every time we simply add 1 to the value which you can see we're doing right here, we're adding 1 to the value of I and then we're jumping back to the top of the loop. If that value is still less than our upper limit, we do the loop.

A

A

So we've got our conclusions. The end of the presentation, while loops, must be faster than four in loops in rust. Does anyone disagree with that statement at all?

A

I know that there are probably at least a few people who have spotted this, but I'll wait for some answers.

A

All right, stable miner, fully disagrees, and I think I know why felix disagrees. Good.

A

All right good, so this is not a fact. Tell me uh there. You go all right now we're getting some good answers. I like this peter and stable minor and uh felix, I believe, have already sort of thought this through. What am I missing here? What am I missing about? What I've just shown you that might make this behavior.

A

A

Exactly all right, so simple, simple, optimization here: what we're doing with these cargo build operations before is what you'll see. Basically, the compiler is telling us. This is an unoptimized build. This is unoptimized and it also includes debugging information, which is helpful for finding problems in russ code, but it's also not useful for trying to get the most speed out of the compiler.

A

So let's do this: let's make sure that we're doing our cleanest build and let's go back to the iterator version of the.

A

A

Now, if I execute cargo build again, I'm going to get the debug version one more time, so what I need to do to get cargo to build a optimized version is to execute in release mode. In order to do that, I just pass the dash dash release flag and away I go. What you'll see is that the finished line here indicates that we built a release version and then it is indeed optimized. What's really cool is that didn't take any longer to do than to than the unoptimized version.

A

However, if we try to debug an optimized version of the code, it's going to be a little bit more difficult. Now we can't do cargo run to execute an optimized version. We have to actually call the binary that was generated directly. The program that was generated directly so we'll do target release and we'll do.

A

A

So this will, let us execute the optimized version of the code that we compiled oh boy, zero time between that's really fast. So that was our big problem. We weren't optimizing the loop before and once the compiler had its way with our loop, the time between went to zero. So now, let's also compare that to how the optimized version runs. uh The optimized version of the while loop runs we'll do cargo build cargo clean, I'm sorry to get rid of the old version.

A

A

And I can't read quickly enough in the chat, but someone gave me a really cool answer to the question that I wanted to know for quite some time and I'm going to tell you in just a second what that is we'll do cargo clean again just to make sure the cargo build release and now what I never knew was this: you can do. Cargo run dash dash release and you get the exact same thing as running the binary or running the program, directly, very cool, so there you go.

A

The answer is the while loop now takes just as long as the um iterative loop, which is great, so let's go back and draw some more conclusions. That was a close call, people that was a really close call. We almost discovered a situation where the rust book recommended doing something in a canonical, very readable fashion.

A

That was much slower than the way we could do that, but we got away from that. That situation is not true, because, while loops and for loops take exactly the same amount of time. That is a big big win.

A

I'm gonna pause here because I think you might be waiting for the other shoe to drop I've labeled this a fact. But is this actually a fact or is there something else that we perhaps need to investigate.

A

Here I see some people typing, which is good indication that there are some thoughts coming. I'm excited about this. Yes, yes, felix good, show the obj dump.

A

Yes, good rainer is exactly right. Felix is on the right track. I like this okay. So let's look at the object of what's going on here and we'll see what.

A

A

hmm Suspicious this is very suspicious.

A

What we notice here upon complete investigation is that the optimizer has done far far far too good a job, and what happened is that the function and the loop in its entirety was taken away by the compiler it. The compiler realized that that loop meant absolutely nothing to the outcome of our program and therefore it was entirely optimized away.

A

What you'll see here is that the code is now only calling system time twice back to back, and what we would expect to see in between is some operations here for our loop, but there aren't any, in other words, that loop is gone entirely from this code.

A

So this is not a good way to evaluate speed for a loop, because well, we didn't do any loops at all.

A

So we have to go back here and we have to label this. uh A tentative fact we're not really sure that uh that four and four in loops and while loops take the same amount of time, but um we're going to investigate whether this is actually the case.

A

I, like uh a couple of answers here that I see in the discord chat. Benchmarking is hard. Yes, oh it's almost like abec has read my mind here on. uh What's going on, I like the comment. No code is actually the safest code. That's exactly right. I work in safety and dependability and my boss used to say that the safest airplane is a rock because it never goes off the ground. So no code is the safest code.

A

Yes, uh rum, I can't pronounce I'm sorry your screen name. I like that comment, a lot abik. I like your comment too, and that's exactly what we're gonna do so, the takeaways from release mode, and what we just saw is that the optimizer does too good of a job in optimizing our code and doesn't give us an accurate way to benchmark.

A

So what we need to do instead is to use something called volatile variables volatile variables when we label them such we label a variable volatile. It tells the compiler, don't optimize away the operations that are done on this variable because they matter this name. Volatile comes from back in the c language where you used to be able to just say, volatile and then the name of a variable and what compiler, what programmers use volatile for is to make sure that the operations are actually done into memory rather than optimized away.

A

This is important for any software that needs to directly interact with hardware by writing. Registers by writing register values that the hardware might you use. So what we can do is.

A

Thanks rayner, I appreciate that I won't take it personally that, uh but I don't have the qualifications to pronounce his uh that user's screen. Name correctly, um I have to be austrian, which I'm definitely not so I apologize for that. But back to the task at hand, uh abec was exactly right that what we can do is we can use the volatile crate in rust in order to accomplish the same thing that we could have done in c.

A

So, let's see what that looks like I've got this version here that I've rewritten- and I will let you browse this on your own uh later, but what you'll see is that nothing major has changed, we're doing two loops back to back instead of having them in um in two separate files, but we're going to do the operation the same number of times.

A

We've got some mutable value here that we're going to increment each time through the loop to make sure that the loop body is actually done and in each loop. What you'll see is just to make sure we're comparing apples to apples here on line 24 and on line 13.

A

We are iterating a counter variable, because we need to iterate a counter variable down here.

A

We want to make this commensurate and do a counter variable up here, too we're going to at each iteration of the loop we're going to write back this volatile value, which is our indication to the compiler not to optimize the way we're going to do that in 12 and line 23, which is exactly the same operation again apples to apples and now we're operating at a little bit of a different time scale here, and so we're going to get things that take microseconds rather than seconds.

A

So we're going to call duration since, but we're going to convert it to microseconds instead of seconds this time. So we'll see something that's a little bit different and what we'll do at the end is we will print out the two different times and see if they compare.

A

So let's execute that and see what.

A

A

So again, just to be sure what we're going to do is we're going to run this in release mode to make sure that we get the best out of the compiler, but not too much.

A

That was quick because I didn't clean it first.

A

All right, that's much better. So let's do cargo. I learned another thing: I can do r save myself some keystrokes here.

A

All right so now in microseconds, it looks like they both take roughly the same amount of time perfect. Now, let's do this a little bit more. We're not really satisfied that one answer is enough to prove that we are correct. So let's do another couple of iterations and in order to do that, I've set up a little script that will run this a few times and make a csv file that we can use to actually plot the results.

A

So we're going to do that 15 times and record the answers in a file and then we'll quickly graph that on screen just so that we can see the results in the meantime feel free to pop. In any questions that you might have.

A

A

All right, no questions either I'm doing an interesting job here and presenting you things that you're thinking about and chewing on, or else I've completely lost you and I feel, like an idiot so either way. I hope it's, the former and not the latter.

A

So let's use uh libreoffice here to open this file and see what we.

A

Get I'm gonna quickly grab this make a little graph here, make it a line. Chart put some lines on it and we'll see what we get all right, pretty impressive stuff there. So what we see is that most of the uh time spent in the two loops is roughly equivalent. We've done 15 iterations and we see that they pretty well track each other.

A

One is not necessarily always faster than the other interesting. So I hope that uh is enough evidence for you to believe that these two iterations, these two types of loops take roughly the same amount of time.

A

Now, let's go back into the object code and make sure that we're not being fooled by the compiler again, and I think what you'll see here is. I think you will see something very interesting and something that will make you very happy.

A

What we see is the first body of the loop. The iterator body looks like this.

A

It looks like we're doing some sequence of add operations and then we're going to jump to the top of the loop if we haven't hit our limit. Yet this is a technique called loop unrolling, which several of you have referred to already in the chat.

A

So this might be something that you're interested in exploring more if you've never heard of loop unrolling before I'm not going to um go into that right now, because I don't want to bore anyone, but in the questions after I'm happy to explain this, but here's the big surprise is when I go look at the loop body of the second loop.

A

What you'll see is. Oh, my gosh. It is exactly the same as the first loop body, so what we've determined is not only that the two loops execute in roughly the same amount of time. It's that the compiler generates identical code for both of them identical.

A

That is pretty amazing. I mean that just blew me away when I saw that so now we can make some conclusions.

A

We know that in release mode when properly benchmarked that the while and the four in loops take the same amount of time, that is pretty.

A

Cool, so that is the end of the conversation. That is the end of what I have prepared to talk to you about.

A

I just wanted to say thank you to uh the creative commons providers for the fact logo that I got off of the noun project, so I want to give them credit as appropriate for the creative commons by attribution and now I think, rayner is going to pop in and hope direct some questions to me, but I hope that you all enjoyed the presentation and I hope that you will give feedback if you, uh if you didn't and think I can improve, I would always love to do better.

A