Rust Programming Language Performance and Optimization, 1 Sep 2017

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: RustConf 2017 - Improving Rust Performance Through Profiling and Benchmarking by Steve Jenson

Description

Improving Rust Performance Through Profiling and Benchmarking by Steve Jenson
This talk will compare and contrast common industry tool support for profiling and debugging Rust applications. We'll discuss our experiences finding and fixing performance problems in a production Rust application.

A

So today we're going to talk about understanding, performance of rusts programs and libraries and we're going to talk about some rust, specific pitfalls and we're almost going to talk about I'm gonna talk about some tools. You can use to see the performance of your rust programs. Some of them you can even use today and install on your laptop I, am an engineer. I work at buoyant. We make we make thanks to our design team for some great transitions we make linker d-link rudy is a cloud native service proxy.

A

We build what's called the service mesh and I, hear buoyant I work at I work on some Ruston's Gallico, one of the things that we've built his Nicole linker TCP, which is a TCP proxy, it's written in rust, and it's as I'd work in cloud native environments like kubernetes or DCOs I, weave more protocols coming right. Now, it's TCP but we're adding HTTP to like. Just today we announced that we Karl open sourced the h2 repo on github. You take a look at that link of a TCP just like link.

A

Rudy is apache license and free for everyone. One of the things that we've built to enable us to build link, adhesive peacenik, called taco, which is a stats library, allows you to instrument. You are applications and as a stats, librarian has some basic features as a counter, so how many times something happens has timings so how long something takes to occur there to you know some bit of code to run and then gauges which are values at a point in time. This is also a patchy licensed.

A

I'm gonna use, taco I'm, gonna use, one of our macro benchmarks, is sort of a drawer. Two of our macro benchmarks is a driver for talking about that. The talk and the topics that I'm here to discuss today here is a macro benchmark that I wrote and I'm going to talk about what macro benchmarks are here in a second, but I wanted to give you a sense of what taco as before we started. You know digging into it and looking at graphs of performance and charts and that sort of thing.

A

So here's a here is a multi-threaded macro benchmark. Where here you can see, we have a timing. We measure how long it takes to do an action. We have a loop about 10 million times that we do this inside of a single program. This this macro benchmark driver called multi through, and what do we do here? We set the current iteration. We increment a counter, and this is the basic work we do and then we do this and we do this ten million times.

A

At the end, we send a signal to a channel says we're done.

A

We're gonna come back to this in just a second. So here's the only not talk right first want to talk about some causes of slowness in your programs, some rough specific pitfalls tools and then we're talk about this. This great measurement called a PC, so.

A

In native programs, there are generally three causes of performance problems as memory stalls, which is you know, talking them talking to DRAM, essentially lock in tension in the CPU utilization. When we talk about memory stalls really, what we're talking about is the memory hierarchy where these aren't exact numbers. These numbers are here to give you a sense of what the order of magnitude changes when you move up levels when you're talking to memory. So a register takes like half a nanosecond to talk to your last level.

A

Cache it's like gonna, be like 10, 10 nanoseconds I mean, if you remembering to see. If you've got like a l1 cache, you got an l2 cache and then sometimes you have an l3 cache each each layer, the cache adds another is larger and slower and cheaper. Well, they're. The same cost anyway so again happen in a second ten. In a second, then talking to D rounds gonna be about a hundred nanoseconds. So that's just a quick overview, the memory hierarchy to remind yourselves, then we have long contention things like spin loops and blocking weights.

A

These are things that can cause performance problems in your program and we'll see one of these here and here in a moment and then CPU utilization. The reason I talked about I want to talk about CPU utilization last is because it generally hides a lot of the performance. Usually a CV utilization can hide the other two things that I talked about. They can hide. You know memory latency through slow, a slow instruction. You can't it looks the same. What am I trying to say a given instruction?

A

There's you can't tell if the instruction slower, if you're talking to memory when you're looking at, is what I'm trying to say here and instruction can hide lock contention like have a spindly when I say I really should say a function here. Idleness is often countered. It's useful work like you look at top or H top and you'll see.

A

Oh it's mice, I'm using 90% of my CPU, but I can also mean you're spending, 80 percent, your time waiting for RAM or disk and I want to talk about a little a little bit about some rough, specific pitfalls. Yeah. Here, here's a big one like derived copy and large trucks copy is great copy, makes your programs much more economic. You know they can be a real lifesaver, but if you find yourself overusing it you can accidentally without meaning to the great thing about Cologne is its explicit, but copy is implicit.

A

So if you find yourself you know implementing copy, you can find yourself killing your DRAM bandwidth and the most common reason is. It was small and I want to start it and here's an example for you can see someone might make a person struct where they just have a vivid int. They have a reference to a string, and then somebody decided well, let's put the whole users DNA in there and that's that's 800 megabytes.

A

It should be a reference or maybe not there at all, and then again you know we're talking about where time a copy. You know we can talk about clone, doing clone a little bit, something that can kill performance through through saturating your DRAM bandwidth. The nice thing about clone is, is explicit, you'll see it in a profile, you'll see clone being being there at the very top.

A

Here's here's just a sort of a contrived example of going through a people vector and pushing them into a for another friends vector and then something that pretty much everyone in this room is familiar with, is at the stand, the default hash or is cryptographic with strong, and so it can be a little slow. It's well-known trade-off to you know. Experience rust programs is really surprising to brand new rust programmers.

A

They maybe aren't used to this before the seamless before the nice thing is: there's lots of great alternatives for using different hashes and there's a good pages with benchmarks, rust, specific benchmarks for these. These different hashers and here's. Here's an example of using a different hash. One, that's a little bit faster here. We use an F and V hash, there's no reason to show the slow version. Everyone knows what it's like to make a hash map, so I decided to show what it's like to make a slightly faster one again.

A

It depends entirely on your use. You should you can pick from one of the great alternate one of the great alternatives, so something else that can that can bite us is using expensive arguments in places where maybe we don't expect the code to be called like expect is a good one. Like you often expect your code reasons called expected, specular to work, and then, but this isn't, you can put something slow and expect like, for instance, this this was pulled out of one of the Liza's.

A

Libraries ended up being in the hot path, and so this format was called a 10,000. You know the bazillion times, you know every second and then pulled it out. Did something a little cheaper, sped up the program dramatically again. This isn't this isn't specific to expect it just it has to do with the fact that Russ is eager, and you have to remember that you just have to remember about the evaluation order in the program last I want to talk about pre allocating backs.

A

You know when you look at experience for us programs, program or programs. You see that they pre-allocate you Novak, see everywhere, like they hardly ever allocate and here's an example of where we do this. In Lincoln adhesive II, we we have, we have a structure as a buffer size and a default size, and then here's the highlighted as us, making the buffer that we use to shuttle data around from from you know, inbound outbound and now I want to talk about some of them.

A

Some of the tools that you can use I'm going to talk about some tools in the Mac like instruments, cargo bench, cargo bench compare and also some Linux tools that we use, because our primary deployment target is Linux, where we use perf and flame graphs, vtune and again also cargo bench and cargo bench compare so cargo cargo bench is a micro benchmarking tool, I'm sure many of you use it. It's part of the part of the standard toolset, and it's really great for what's consider like the high used parts of your API.

A

You should consider writing micro benchmarks for here's, an example of a micro bench or a micro benchmark using a cargo bench that we have in tako, where we bench how we'd benchmark. How long does it take to make a new a new counter and sort of the output? The output will look like this. Here's a bunch, here's it mix in with a bunch of other benchmarks, we've written where we tend to write a yeah here you can see I've highlighted 23, nanoseconds, 47, nanoseconds and those numbers. Look great.

A

Everyone probably be happy with those numbers unless you're really paying attention. Cargo bench compares something that's relatively new. It allows you to compare to two cargo bench runs. This is really great for avoiding performance regressions. You know someone ships, you a PR, you look at you, look at the PR and github. You try it yourself, you do it. You can do a textual comparison, so Bart cargo bench does and you can see well. This does help or doesn't help performance in this particular example.

A

I borrow this graphic directly from the from the cargo bench a compared to get up page. You can see it doesn't really make much of a difference. Even though there's lots of green and rad the numbers themselves are not not that different, while bike micro benchmarks are really useful and you should write them. What we have found is that riding macro benchmarks is, is a better a good you, another great, really good use of time.

A

A macro benchmark is taking your api and using it sort of in the same context that your users would use it. The reason that micro benchmarks are limited in utility is that you run something handful of times, and you measure it. So you don't really know you don't really know how well it's running in the context of your of your CPU, so in the might in the micro benchmark that I showed you before we exercised the API in a loop of 10 million times and we'll talk about. Why why we do so so many iterations.

A

So when you construct a macro benchmark like that example that I showed earlier, you have to remember to compile with release, because you want to have optimizations turn on it's a mistake to profile code.

A

That's run that's build without optimizations, because you might find yourself fixing something that you don't care about, that the compiler would just fix for you by turning on the optimizations and one challenge this is, you have to you have to add symbols in because if you don't have symbols, you're, just gonna see a bunch of address spaces in your profile and so just a little bit of just a little a little bit of text in your cargo tom will give you some your symbols so again back to constructing a macro bench.

A

Art taco, as I mentioned earlier, has to has two macro benchmarks. We have a single-threaded benchmark and we have a multi-threaded benchmark, because these are the two different ways that people tend to use taco and that this again is the same slide. I showed before of how of the of the macro benchmark that the multi-threaded macro benchmark, there's so what's going on, is there's a couple threads that are being spawned there, they're calculating timings and and adding them, and then we have another thread.

A

That's doing the reporting work and then here just a really quick segue using instruments to look at the performance of this of this multi-threaded benchmark here, I use a measure called IP C, which is instructions per cycle which I'll and I'll talk about what that means here in a second. This is really just to give you a sense of what we're going to talk about more here here a moment. So we tend to we.

A

We have stumbled upon using IPC and I shouldn't say stumbled upon it's a fairly commonly known metric we use IPC is a useful. Empirical metric tells you how many instructions are completed, every clock cycle when we learn CPU architecture in school, we we tend to learn it as you have a program that want runs at one instruction and then runs the next instruction and the third and so on, and that's not that's just not how it works on mard until cpus. I think this is this is sort of the model that we learn.

A

As you can see here this again as the serial model, you run the first instruction, then the next instruction, the third instruction and here in this diagram, you see I've broken out the other stages. I got this uh I got this graphic of the next one from a great page called um ninety minute. Micro microprocessor guide, which you might enjoy, but see really is not really how programs run anymore. They run more like more like this. They run as soon as one fetch stages run.

A

The next fetch stage can start as soon as one decode stage start. There is finished the next you can start and you don't just have one execution unit anymore inside of a single core. Now you might have three or four or five and even more complicated instructions can each one of these little boxes can depend on another box.

A

So, instead of having a full CPU full of work, you can have you know, memory stalls the things we're talking about before are memory stalls locks can put a lot of empty space inside of inside of your pipeline.

A

So, given how complicated things are, and since instructions can depend on each other and we can have huge bubbles in the pipeline, how do we know? How do we know we're doing well? Well, here we enter the realm of the performance counter. So until engineers had the same question, how do we know our program is running? Well, given how complicated a modern CPU is they added something called the performance monitor counters they? Let them just like the counters in the in the benchmark.

A

I showed you before just track how often things occur and they allow you to do things like calculate ratio, an enormous number of counters inside inside of a modern Intel CPU.

A

They be hundreds of counters hunters to document and hundreds of pages of documentation that could be really tedious to pour over, and but the nice thing is that if you find a handful that you find useful, you can use them to derive metrics for your program and instructions per cycle is one of those metrics that we use instructions per cycle is how many instructions can a core retire in a cycle retire, meaning it's finished.

A

The useful work is done and what we found is a good rule of thumb is that if you're, if you have an IPC score less than 1, it means your memory stalled yeah. This is a high PC score greater than 0 or greater than 1. It means your instruction stalled. You maybe need to run fewer instructions, and you can refer your machine. You can learn this empirically like. If you have a deployment target, looks like a modern CPU has like it's 5 wide you can or one of the 3 wide.

A

You can see how well your program runs, but you can also learn whether one point I was the right metric for you empirically by writing a program. You can write one. That's memory stalled, it does nothing but DRAM, fetches and writes you can write another one.

A

That does nothing, but you know, write works entirely out of registers and in you can see whether you can you can determine whether this rule of thumb is appropriate or not again, reminding just how complicated a CPU is, this being a three-wide, which is what you see at the very beginning. This could have an IPC of 3, as I mentioned in that slide, to give a max like the C of three instruments on the Mac, so it has IPC built in again. This is going back to the earlier slide. You can.

A

What I've chosen to do here is look at the IPC for the multi-threaded benchmark and you can see we have an IPC score of less than 0.5. You might think. Oh it's 50, we're saying you know why's it's great, but you know it's a three or four wide CPU, that's like 50% of 4, and it's really poor. It's like maybe a zero point. That's like maybe 10% utilized and here's I've done the same thing with the the single threaded macro. Benchmarking IPC scores about 0.8 a lot better, quite a bit better yeah.

A

So all these counters are available directly in instruments. What you do is you would make a new counter instrument and then you would long. Click on the recording button, you'd pick an event, and then you can. You would have a really just daunting drop-down of performance counters. It sort of assumed that you've read these hundreds of pages of documentation and really what you're gonna do is you're gonna find a handful of them that work well for you, but the nice thing is, you can actually create formulas from these performance.

A

Counters IPC is the only one that ships with default. You can make your own here's an example using instruments where I'm looking at I can tell it's hard to read from there, but you're. Look what I'm, looking at l1 hints and l1 misses for the multi-threaded benchmark, so I specifically went through the drop-down picked. L1 hits l1 misses cuz I wanted to see what they look like before this multi probe macro benchmark.

A

Here you can see we do have about 20 million 20 million hits for the l1 cache and about half a million misses, which you know isn't great. In my opinion, and then another use of instruments can be to sort of look at your typical waterfall of methods and see how long each individual function is taking, and you can see the heaviest stack trace over there yeah, so lots of well.

A

You know it's a easy to use performance tool, it's a little limited because it's cocoa specific, but you since you have the performance counters available you it's actually much richer than a lot of people, give it credit for nice and I, suggest you give it a try on Linux, as I said, when we deploy most of our code on Linux, so we dig and we dig in with perf and perf is been part of Linux for a few years and they've been constantly improving. They put a ton of work in a perfect.

A

It has both kernel and user space profiling. It's a sampling profile where they can, with the configurable a sampling rate which you can play around with you know. Nyquist Nyquist theorem tells you that if you want to send see something that happens at rate X, you shouldn't sample it to X, and so the nice thing about being configurable is that you can really turn it up and it's also pretty cheap if they they just, they designed it to run at low overhead.

A

So you can run it in production at low out of somewhat moderate sampling rate and here's an example of looking at the IPC metric again of our program using Linux, perfect and I've highlighted it's roughly 0.5 and per cycle. This is a different CPU. This is a dedicated machine. We used for performance work, there's a different CPU, which is why the instruction scores a little bit different in here. I use perf to look at the cache, hits and misses here.

A

I'm looking at the l1 misses and hits, and you can see that for the multi-threaded benchmark, I have about a 5%, a 5%, miss rate which isn't great, but the single threaded one I deserve sent hit right. It's great or it seems mean 0% miss right there, but it would be pretty poor yeah. So I really encourage you to dig into this. If you, if you run software, that's supposed to run Linux, you should really dig into perf that's much much deeper than then I had time to go into.

A

Considering that I'm talking about other tools, it's a little very linux specific! You can do things like you can dig into this: the Linux scheduler you can dig into the I/o and network subsystems, incredible amount of cooling being built around perf. One of the things that we use with perfectly build flame grass and a flame graph is a sample.

A

It takes again like I, mentioned purpose, a sampling profiler, so you collect a bunch of samples and then you aggregate them and to get a sense of like what is the shape of your program, and here is what one looks like for our single-threaded program. So, like I said you take, let's say this is a thousand samples. Then it a gates them. The width is what's on. What's on the very top and you can mouse over and see what that is. I know it looks in it's incredibly tiny.

A

You can actually drill in by double clicking, and then you can mouse over to get it to figure out exactly what function it is what's on top is what's what was on what was running on the CPU when the sample was taken and then the width of what's below it, gives you a sense of how often that was seen in a sample. So here we can see, we get a clock. We get the time who make an instant. We call map on a future, so it gives you sort of a sense of your program.

A

This is that wasn't single spreaded benchmark here's the multi-threaded benchmark, much more complex, as you would expect, you can see, we've got a couple: threads we've got a mutex.

A

One thing that we do. A lot is we'll just drill into into some part of this and I just give things a look from time to time to make sure things are looking the same, really useful for looking at long-running program since we since, since what we ship is a proxy that runs 24/7 in people's data centers, it's really nice to be able to get a sense of like how does things look at time, T versus in like 30 minutes from now? Are things changing or, as events happen inside of the system?

A

What's how does the? How does the shape of your running program, change, yeah, Netflix pioneered this technique really, and they use this to measure the health of all their online services, all the time, they're, always building flame graphs of all their online services. The one one thing that's trickier is you gotta remember to yet you have to get symbols just like I described earlier with profiling. You have to remember to add the symbols. Otherwise, what you're going to get in the graph and the flame graph you're gonna get a bunch of addresses and good luck.

A

So the last tool I want to talk about is probably the most. The most complex is v10, so, like I said, you know until integers have the same question that you all have is like. How well is my program running, so they added performance counters. Now they have another problem. What what all these performance counters do? So they made this this commercial tool called v10, which helps you make sort of make sense of what all these performance counters. Do it's an extraordinarily complex tool, but you can get your head around it.

A

You know with some practice. It has tool tooltips and a GUI that works. Shx forwarding has a CLI and if you're an open source developer, you can get a free license here, I'm looking at mm-hmm, this is the single thread. Adventure I really should have labeled these graphs. Thankfully I've looked at them enough to know what they are. This is the single threaded benchmark and according to this is the general exploration view, and you can see there's a lot of there's a fair amount of detail here. I can tell this is memory bound.

A

This is back in bounds. Specifically on memory and looking at the multi-threaded benchmark, that's also back and bound, but something something really stood out to me from running this, which was that I was I, was bound on a remote cache and and will come I'll come to this. This is a lesson I learned from preparing this talk. I didn't realize this about this particular this particular benchmark. This is a multi-threaded benchmark. Here's a view of I think believe this is the memory Explorer here we're looking at this is the single threaded benchmark.

A

I can tell from the miscount, and you can see stacked CPU time, but then, secondarily, what's memory bound and what's really great about this tool, as you can mouse over any cell and it'll, give you a useful tool tip of what that's what that means. It's really helpful when you're starting to use v2, it's such a such a daunting tool.

A

It'll also give you the formula that it uses to drive many of these, which is something which will be like this counter stack counter / some other counter, and you know that's an example of a formula. You can actually take those formulas that you learn and plug them into instruments for when you're doing development. On your on your laptop here is the multi-threaded program.

A

Miss counts, pretty high notes. How I can tell this is the multi, oh yeah. It also has the name at the very end. Here's the multi-threaded benchmark, here's the locks and weight view there's so typically, when you, when you use V tune, it has a lot of different analysis, types that you can run and you would start by looking at like the general exploration view like I showed at the beginning, and then you would run it over and over again, using the different types of analysis. You would run your benchmark.

A

That's another great reason: to use a to write. A macro benchmark is that if you have an example application you can hand it to Perth to run, you can hand it to be tuned to run, and then you, you have a just an easy tool. You don't have to remember how do I run this like what scripts do I need? You just run your macro benchmark in the tools here.

A

We can see that there's a very low weight count and in the multi-threaded benchmark you see, we have a very high weight count like here's, a seven. We waited seven million times and that's no good.

A

This is another view. This is the event accountant view you can see from the progress bar at the bottom. There's a tremendous amount of detail here. This is kinda like the day trader or a fantasy football view of your performance. I wouldn't encourage you spent a lot of time here, but it is something I mentioned earlier that there's a GUI and there's also CSV you can get. This is something you can get CSV from and you can do some post-processing on to learn specific insights for yourself.

A

So that's sort of the take points from vtune is, like you know. Until knows, our CP is better than anyone and they they. They know these hundreds of pages of documentation. They know how to build these formula formula for the different performance counters and they can surface them in vtune.

A

It's really it can be really overwhelming and that the tooltips are helpful, that all the different analysis, modes or I found it really useful. I only only dug into maybe a third of the analysis modes of my screenshots here. Oh that's because and I learned a lesson while I'm preparing this talk, I learned something in vtune highlighted this remote cache issue, and what I learned is that I was remote, cache found and what this means as I was talking. This machine has two physical CPUs and I.

A

One thread was burning on one CPU once where I was running on another CPU I didn't realize that when I, when I was preparing the talk, the screen shot is what remind showed me that and there's also a really great I meant to put his I've meant to put a screenshot here, a memory view that shows you for all the DRAM sockets, like the different like the physical RAM chip, the Euler.

A

This stick that you plug in it shows you all the the memory use for that, and also the memory use between the two CPUs, which was also like a just a dead giveaway that there was a behavior I, wasn't expecting so going back to perf. You can see that I had that 5% miss rate that we talked about earlier. So what is the multi-threaded benchmark?

A

What I decided to do is well I'll just use tasks that I used tasks that are running on one CPU, we'll see how it improves and then the the the miss rate dropped to 0.01 percent, which was pretty fantastic, and then you can see there. The total run time dropped from nine seconds to 3.8 seconds, which is pretty great. That's pretty really pleased with that IPC drop we were talking, I, P C is being a useful empirical measure.

A

Maybe C increased about 10%, which tells me there's still a lot more work that we could do to improve that. Yes, the performance is hard. It's hard to understand. You know like given how complex CPUs are there's a lot of tooling yeah. You know you need you really need to measure and pericle. You need to run your program and see how it does. Then. You know if there's a lot of great tools that you can use. Ipc is a really great measurement that you can use.

A

You know I've been really pleased at how many tools there are to use them to measure performance of programs. Today you know you can zoom instruments on the Mac, but much more powerful than people realize you can use perf on Linux. You can use be tuned, but ultimately the the best tool is the one, the one that you use on a regular basis. Now, I want to get a special thing out to Eliza for walking me through instruments and thanks thanks for listening.