Rust Programming Language RustConf 2018, 6 Sep 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: RustConf 2018 - Benchmarking and Optimization of Rust Libraries by Paul Mason

Description

RustConf 2018 - Benchmarking and Optimization of Rust Libraries by Paul Mason

As we develop the Rust eco-system we have a goal that "Rust should provide easy access to high quality crates". This means libraries must be both ergonomic and perform well. Rust provides various frameworks to help benchmark libraries however achieving performance past a certain point requires knowledge of some deep language constructs.

This talk explores a journey towards benchmarking various library functions consistently and fairly and consequently exploring strategies for optimizing performance.

A

Hello, can you run here me: okay, yep, Co and so yeah, as I mentioned I'm Paul, talking about benchmarking and optimization of rust, libraries and just as a little bit of intro about Who I am so I'm, currently principal engineer at human interest. Human interest is a 401k processing company and in San Francisco you might be able to tell I'm not from there.

A

So I've got an accent and I be working on rest for a number of years and, as a result, home had to create a number of libraries and one such library that I'll be talking about in a lot of these examples as a rusty, small library, which is a fixed precision, decimal number library written in rust so being told, like my accents, had to understand so I thought I would start off by giving a little bit of a 101 on the New Zealand accent.

A

So and essentially, if you take the valves, you take up the oh and then you move the sounds of the vowels over one. That's the New Zealand accent. So if I say the word pan, you might hear this and I say: pin you may be hearing that and likewise, if I say pen, sometimes you might be hearing there. So if you're hearing the word and yellow then you're going to be good.

A

But if you're hearing the word on the right, you may need to reverse engineer what I'm saying some other times and so on to the talk and so the rest ecosystem is growing and when I first put together these slides, it was just under 17,000. This morning it was around seventeen thousand seven hundred it's very soon.

A

It's going to be eighteen thousand eighteen thousand crates, and so, with all those crates we we're starting to use them in our in our projects and we're implicitly relying on them and live downstream dependencies to really do the job correctly and securely as well as quickly and so as a as a case point really is.

A

If you're using a decimal number library, you don't want that to be the the performance bottleneck of your system, and you also don't want it to be the source of bugs as well and so to really be able to understand the performance of a system. The first thing we really need to do is be able to measure it and really that the idea of this is so that you can understand whether you're, making positive or negative impacts to your to your library and so just to really intro into this.

A

Is that there's a couple of terminologies that you've probably heard around there micro/macro benchmarking, so micro benchmarking is where you're measuring a very small unit of performance and macro benchmarking is where you're trying to simulate customer application workloads, and so, when we're looking at some of these libraries, we may need to be considering both and we'll go into some of those.

A

Some of that detail throughout this talk, and so so just to begin with in order to benchmark, we need to know some of the tools that are out there so and cargo bench is the most obvious one to start up start off with and the reason. Why is because it's included with cargo. It leverages the test, crate test, crates, internals and stable at the moment, which means that it can only be run on lightly. Only but getting it up and running is relatively straightforward.

A

So, essentially, we need to include the the test feature and from there it exposes the bench attribute the bench attribute in the case what function should be called under and when the bench mark runs and within each of those functions, you'll see that this and no function, so everything inside there is what's being benchmarked and in this first example, you'll see that the the sub functions being at impact in this particular example here and the results being passed through to the black box and function just immediately afterwards, and the reason that's happening is because we need to make sure that the compilers not doing any sort of optimizations.

A

That means that we're not actually testing what we think we're testing within that edit function so running these results is pretty straightforward. It's a tabular format.

A

There we've got the names of the tests on the left and we got the number of nanoseconds per iteration on the right and on the far right, there you'll also see the variance between the top and the bottom tests, and so over time, you'll probably want to be able to compare these two make sure you are either doing things if things are improving or regressing, and so this various ways of doing this and you could be using spreadsheets, you could be using all sorts of archiving strategies, but one tool that makes things a little bit easier as cargo bench comp and that's a cargo plugin which, essentially, if you pipe your beach results out to file, you can pass those files through the bench comp and it will spit out a tabular form like so so.

A

The first thing you probably notice in this in this particular table is the red and the green, the red. In the case, it's something bad happened. A regression and the green indicates that something good happened. So either this the same performance or an improvement in performance, and so cargo is pretty its sorry carrier bench is included with cargo. So it's pretty low effort to get set up, it's very fast to compile and execute, and you don't need any external crates within your cargo tamil file, and you know with a good archiving strategy.

A

You could also compare results over from version 3 division 1, if you wish to, and that the major downside, though, is that it's nightly only and performed. The reason for that is that performance may be different and stable. You may be testing against experimental compiler features or whatever else, which means your results may be a little bit different cargo bench comp can be a little bit sensitive to thresholds. So in that particular. In the previous example, there was green and red.

A

I was actually running that against the same code without making any changes, and so essentially you can control the threshold, but you've just got to be making sure you're comparing like for like tests. So some of those tests were taking 2 nanoseconds. If it goes to the 3 nanoseconds, that's a 50% increase and the other thing of of cargo bench is looping through sets of values can be a little bit tedious and so with a decimal number library. You want to be able to test a range of values.

A

7 divided by 3 is not the same as 6 divided by 3, a very different calculation that actually happens under the scenes, and you can do the sort of cargo bench. It's just a matter of leveraging macros to be able to do all the the boilerplate code.

A

For you, a second library which is out there, which is also a viable to use as criterion arias and that's inspired by Haskell's criterion library and it's been written so that it runs on stable by default, which means it also runs on beta and nightly, and getting that up and running, as is pretty straightforward as well.

A

That's a matter of including the criterion crate and then overriding the bench commands so that, instead of invoking the default bench and harness and invokes the criterion bench Alice, and so within the code relatively straightforward, we've got the criterion main function, which essentially is what is getting invoked by the harness as the criterion harness and instead of having the bench attribute. We had criterion group, so we had the bench attribute on each of the functions and still you list them in their criterion group macro and the actual met.

A

The actual function that gets called is relatively similar looks similar in the fact it's got. An editor function as well as I'm passing the result to a black box and, however, the big difference. There is it's wrapped around with a beach function and function, and so what that is, therefore, is to name all the beaches. So because you not able to leverage the attribute the beach attribute. You've got to be able to name them via a string for it to work unstable.

A

One of the actual side-effects of this, as well as they do have Beach function over inputs.

A

So you can actually pass multiple inputs to a benchmarking function, also that the results serve verbose, and so, as you can see, it's not a tabular format anymore, but within each of those benchmarks it does give you a lot more information, it's giving you their and the time the percentage change, outliers, statistical signal, whether it regressed or not, etc, etc, and so there's a lot more information there and in addition, it spits out these graphs and I actually haven't used these graphs.

A

Much and practice, however, I can imagine this would be a great thing to show your boss show that you're actually making an improvement there.

A

So criterion is relatively easy to set up and what I mean by that it's an easy migration from cargo bench. The code was fairly similar and you know it has some way offs and which turn out to be kind of nice in some ways, so bench function over inputs means you can pass multiple inputs to a a single benchmarking function and it does provide statistics out of the box, which means you don't need a second plug in or whatever to manage there.

A

The major downside is it's that it's quite a bit slower than bench and it takes quite a while to I, guess warm up the tests and then run them and I guess it really depends on what you use cases.

A

But if you're wanting to run things quickly or do minor small changes, then this can be a a little bit of a bottleneck for you and it does suffer some of the threshold issues between iterations and there is a check going on about that and one of the the forum's about introducing the ability to control threshold and but at the the previous example again showed regressions and nothing had changed in the code and the last thing which is more of just like it got sure, is because you're naming your own benchmarking functions.

A

You've just got to be careful to not name them the same thing. If you do, then you get these weird subtle comparison errors, and that can just really throw you off if you haven't realized. What you've done so so before jumping into some of the practical things of performance- and the other thing I just want to mention- is just really understanding your application before running it under bench.

A

So you can use intrument ation to help understand this, and the reason you might want to do this is to really understand what the internals of the application are doing before you I understand what the internals are doing, so that you can really make sure that what you're testing against as what's expected from a normal application, Ram and thus various tools to be able to do this. But the whole idea here is that you understand what a both expensive calls, but also high-end vacation calls.

A

So, in an example for a decimal library for division for bigger numbers, you might be calling a subtraction function, maybe a thousand times internally, which isn't exposed by the public API. And so it's important that that particular function is tested under those sorts of loads. So we'll go into some of the practical things and this one's going to see them a little bit of a cop-out, because it's going back to the fundamentals, but essentially some of the biggest ones you can get is really from re.

A

Looking at your approach and these things, such as early exit conditions, these operational efficiencies using bitwise shift and C of dividing by a power of two, these parallel operations, you've got dynamic. Programming use of efficient types will jump into use of efficient types a little bit more in this talk and just go into that a little bit more. But just there's a bit of an example we'll just talk about an Postgres write, performance and so Postgres has a numeric data, type and you're, probably all aware of, and to write out to the Postgres protocol.

A

You need to essentially break a number into groups of 10,000 and then write out the number of groups that you've broken into followed by the weight of the integer portion science scale and then followed by the groups of 10,000.

A

So if we were to write out the number 3.14, then essentially we need to break it into groups, and so we group back him to the integer portion and the decimal portion and the integer portion we actually pad to the left with zeros, to make it into a 10,000 group and on the decimal portion we pad to the right to make it into a 10000 group, and so when we're writing it out.

A

We write out two groups in this case and the first ones integer portions, of the weight 0 and the scale in this case is two because there's only two decimal points that are relevant and the the 1400 group, and so so that was, you know relatively easy, easy to explain from a base-10 perspective.

A

You know if we had it in a string, for example, it might be quite easy to reason about how we might be able to implement this. We essentially chunk it into groups of four and paired the zeroes and output the results. So if we were to do that and rust, we might do something like so where we convert it to a string.

A

We find out where to split the number, and so that we've got an integer portion and fractional portion, and we can then go through and chunk it by fours heading to lift for the integer portion and padding to the right for the decimal portion and then writing it out to the protocol with the the big endian format down the bottom. And so, if we were to do it this way then say we had 15 samples that we're testing with it roughly takes around 15,000 nanoseconds per iteration, so around about thousand seconds thousand nanoseconds per iteration.

A

For this particular example- and of course I wouldn't be talking about this, if you couldn't do it better, so what have you didn't have to convert it to string? Well, you don't, and if you take a step back and think about the pure math approach about how to do this, then you may think about how to scale up the number to the closest team thousand group keep dividing it by 10,000, while storing the remainder and keep going until you reach.

A

0 and you've got your groups, so we took the the 3.1, for example again, and we scale it up to 10,000 boundary, which is 31,400. We divide it by 10,000, we get 3 with a remainder of 1400 and we divide 3 by 10,000, with a remainder of 3.

A

We've got our groups on the right there and that's without having to convert it to string first so implementing this and rust, and we could do the following where we essentially figure out what we should be scaling and up to so the closest team thousand boundary I'm, also using a couple of operational efficiencies here, which is instead of using the modulus and divider functions or the rain. Durham divider functions Steve just taking the last two bits of the number and likewise also bitwise, shifting over to Steve, dividing it by 4.

A

So we've scaled up the number. We then divide the number by team, thousands during the group into array and finally, we write it out to two Postgres, so the performance of that is quite a bit faster. It's 5,000 NS seconds versus 15,000, nanoseconds and so we're getting a you know. Three times increased just from really looking at our approach, the first one was easy to reason about, but the second one of you take a math approach to it.

A

Then you can get some clear, clear winds just from doing that, so the next one I want to talk about is the exercise slices performing better. So if we have two identical functions like so where one of them takes exercise slice of three and the second one doesn't, then the hypothesis is the first one runs faster and you're, probably looking at this and thinking. Well, obviously, it runs faster pool there. The second one does all these bounds, cheeks, which the first one doesn't do and you'd be right in this particular case.

A

So the the first one well that one of them takes two nanoseconds pin iteration, where the the one where bounce cheeks takes three nanoseconds and within a loop we get 20 100 nanoseconds in 3700 nanoseconds. So it starts to broaden a little bit there. But what if we didn't, have those bounce cheeks? What have we had this particular code here, which is, of course a little bit dangerous to have in your production and base, but nonetheless, how does it perform compared to a fixed sized slice?

A

Well in a single iteration sense, it's closest to nanoseconds per iteration, but as we get into a loop, we start to see a little bit of a little bit of difference there. So we have 20 100 nanoseconds again versus 27 hundred nanoseconds.

A

So what this is saying to us is that, by giving the fixed size slice, the compiler is able to make various assumptions or various I guess heuristics, to be able to understand that they can. It can make further optimizations that it's unable to do without having that fixed sized slice, so fixed sized slices are indeed faster.

A

um So, following on from their inlining, can both improve and hinder, and so with that previous example, there say we just throw inline always on to the top. What's going to happen well, you said that things are going to improve substantially then you'll be right, but interesting to note is that both of them behave exactly the same way now. So within lighting both of them take 1,300 nanoseconds per iteration for the multiple iterations and one nanosecond for this, the singular version.

A

So what that's really telling us is that the compiler is able to make various I guess conclusions by having the the code in line, so it's able to optimize the the compiled code so that it even you know everything runs faster and inlining for the second example of the bounce cheeks.

A

Well, it knows that the size of the array that I'm passing through is 3, so it can basically remove all those bounce checks and make things faster as well, and so so within lining that the common thing is is well that I often hear is I use matter than the compiler and to know when to in line and essentially I. Think if you're measuring things- and you understand the impact of what you're doing then sometimes I say.

A

Yes, you can be smarter than the compiler and so with a library, though I think the main thing to call out there is that just to be worried about the impact you're going to be having on your downstream components. So, although consuming applications, so it will increase the binary size, and sometimes you won't get the benefits you want now with with the increased binary size. So my general rule was to be very wary of inlining public API functions going nuts with your internal stuff.

A

If you, if you know what what the impact is so the fourth one is primitive types, almost always better, and so the first ones a little bit of a contrived example, but bear with me, and so every was representing a decimal number I might represent it as teamed about ten divided by ten to the power of E, and so, if I wanted to represent pi and two more precision, then essentially I have a bigger integer portion and a bigger scale to help represent that and an obvious way of representing- and this particular numbers with a big ant, I mean that's.

A

Just a single number could store the emotion and a scale we could store it as a an array of for you, thirty twos and so have a ninety-six bit integer. So we have a cap there and a fourth one to represent a scale and a sign, or you know, since rust 126 we could also use a 128-bit integer.

A

So that's pretty cool, so we'll take a look at big emphases: u32 the for the array of for MU, thirty twos and as an operational example we'll try to add together, two decimal numbers so adding to adding together. Two decimal numbers is relatively straightforward.

A

Essentially you need to have the numbers at the same scale to be able to add them together. So if I have two point, five, which is scale one plus three scale, zero, then I just need to scale up the the three to scale, one to be able to add them together, so 25 plus thirty, it's easy. You have fifty five scale, one so five point five and so big. It's a lot easier to reason about this.

A

It's one single number you need to scale up, and if you have an array of you know three, you thirty twos, then you need to start thinking about words and how the overflow happens between binaries. It's not just a simple addition. Anymore, you need to essentially add the bits together one by one and so with a naive implementation of both these. You can see that the the array version does actually behave a lot faster than the beginner version.

A

However, that's not really what I want to go into in this particular part. I really want to talk about exploiting fixed size primitives. So by knowing that that array is 96 bits, we can make a lot of assumptions about how we add together or do certain things with their particular number and so shifting on from addition going to division, a naive implementation of division might look like so and I'm using 32-bit numbers here, but essentially the concepts the same.

A

So this is a the two's complement version of division of you're, not aware of it as as, essentially you take the dividend and you keep inger divisor every time. You're successful, you increase the quotient, so, as you can imagine that doesn't perform very well. So for a smaller numbers, you can get really good results. 18 nanoseconds, but as you get bigger than 90 million MSI considerations, it's not acceptable in most cases. Probably so we know that the the size of the type is 32 bits.

A

So with that, we can also make various assumptions about how that two's complement and division works. So, instead of looping through in number of times, depending on the size of the number, we can limit our loop through only 32 times and a syn and basically exploit the fact that it's a binary number.

A

So we can exploit the fact that when things carry over, we can they disappear from the number when we should bit shift left and we can carry them over with various tests that we need to I won't go into this myth here, but essentially the idea of it here is that we are now only looping 32 times, so we get around about 45 or 42 to 46 nanoseconds per iteration, for this division and for all all of those particular numbers and I do want to call out here is that the hundred is actually slower.

A

In this case, we do have 46 nanoseconds per iteration over the 18 we had before, but it's a reasonable trade-off, considering that we now have a very consistent performance across the board.

A

So the other thing I just want to pull out here is. This is another example why you should be testing across a range of inputs if I just have one test around for a hundred or even 200 or what a thousand, then it's not going to be incredibly bad performance, but as I start to get to the bigger numbers that starts to get horrible.

A

So the fifth thing I wanted to talk about is copy, borrow semantics in this ones and essentially, when we first start getting into rust, well, I'm sure a lot of us are battled with the borrowed chicken before and we've done things like this to get around it and obviously that's a warning sign because cloning to get around the borrowed checker isn't the way forward. It's just the way forward, temporarily or whatever else, but essentially, if you're, avoiding the borrowed checker, it's probably a bad sign and the reason being is copying or cloning a large struct.

A

We all know that that can be expensive, but essentially it can also be expensive if it doesn't fit on the stack. So we need to be considering the size of the data that we're really copying in these particular cases.

A

So as an example we'll take a look at the when we had before, which is the fixed size, slice, and so we've got the fix a slice of three there and then on the bottom, one there the copy, we don't have, the the borrow we copy the array instead of borrowing it.

A

And so, if we were to look at the performance of this, then we can see that for a single iteration, it looks fine to nanoseconds each, but once we get into a bigger loops, then we start to see a little bit of a variance and, to be honest, this this particular benchmark is an interesting one because it varies substantially between runs. It really depends on how the process is working at the time, but essentially this one has a roughly around 2,400 seconds versus twenty nine hundred nanoseconds.

A

So copying in this size, just 96 bits as absolutely slower than just borrowing those 96 bits in this particular case again. If it can fit on the stack it's fast, but if it doesn't, then it potentially could be a bit slower. um So what about unsafe I mean unsafe is a great way to improve performance right.

A

So, as library authors I think that well, the definition of unsafe really for wooden rust from my understanding is you're basically telling the compiler hey, I, know better than you I know what I'm doing trust me and well that's fine when I'm writing the code. You know I trust myself, but you know, of course, I cause a lot of bugs as well, but essentially the the community doesn't necessarily trust me, like I trust myself.

A

So having unsafe within your code is something that, if you can avoid it as possible, then then you know you should be avoiding it um and I have done a little bit of experimentation with in the library and I have found that and the cases that I've been looking at where I've thought hey I just had me moved. That would be a lot faster. I've actually found that there hasn't been any noticeable difference and I.

A

Think that's kind of interesting I think that's something that, as as you know within everyone here, if you were measuring your performance and you see that there's a huge and performance increase, then you can make that trade-off only if it's absolutely necessary and so I think there originally I was jumping to the conclusion that unsafe was always going to be better and in terms of performance. It wasn't necessarily the case, and so I'd probably be interesting to challenge challenge yourself by just measuring your results.

A

So, just as some closing thoughts, as that um you know, as library authors, we have an implicit responsibility, consider and performance as small details do matter, and it's really important that you understand how your applications working under load or under normal circumstances, in order to really benchmark it effectively.

A

It's important that you're testing across a range of inputs would be a shame of you. Just were testing against dividing by 100 before and the billion was turning into 19 million nanoseconds. So it's important that you're really considering a whole variety of of benchmark tests and the main purpose of metering is really so that you have insights. So you can make effective decisions going forward so I, you know.

A

All of this has really been a result of experimentation, so really understanding what's there, knowing that you want to make the improvement and then absolutely testing something and I guess, testing a hypothesis and seeing that actually worked, and so all in all this community is awesome and you guys are great for coming out to the russ conf and we can keep making it awesome by really working together.

A

And so thank you for listening and if you do have any questions and you can come and see me in the break, but I did also want to just say a special thanks to all those people that helped me build these libraries. You guys are awesome. So thank you.

A