Rust Programming Language Air Mozilla, 3 Jan 2017

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: "A Work-stealing Runtime for Rust" - Aaron Todd

Description

"A Work-stealing Runtime for Rust" - Aaron Todd

Help us caption & translate this video!

http://amara.org/v/2FhQ/

A

Hello, my name is Aaron Todd and I was a research engineering intern at Mozilla this summer, I'm going to be talking about my work on the rust languages, new runtime system and in particular the schedule scheduler for lightweight tasks, which uses a work-stealing strategy now. So there are a lot of questions you might have about runtimes or rust I'm going to try and answer some of them, but I really want to get covered in this 10 or 15.

A

Minutes is the basics of what a work stealing scheduler is and what it does and what part of the runtime it is, and then there are a little bit of details about the data structure used to make this as efficient as possible.

A

So first a little bit about rust. Many of you are already pretty familiar with rust. Having seen other talks on the subject, but the core concepts we need from rust is that it's just a good language for systems programming. It tries to be both type safe through a pretty conventional static type system, and also memory safe through rusts, pretty novel region system for giving you static, guarantees on your memory usage and then we're giving you all this safety and also some helpful abstractions like traits and general good features from a modern language.

A

It tries to do this with zero cost, so you can write code, that's just as efficient as any old C++. You would write if you were a competency Plus+ programmer.

A

One example: application is browser engine like server which Tim talked about recently, so the runtime is just a bunch of stuff that goes on to help your program run, sort of nebulously defined and what it actually does can vary depending on the language. The JVM is an example, runtime pretty much every language, but C or C++ has a runtime of some sort, and even C in C++ applications frequently have some sort of runtime written in C or C++. In rust.

A

We rely on the runtime to schedule the lightweight rust tasks, and it also is responsible for IO management and garbage collection.

A

Rust doesn't have JIT, yet maybe it will someday, but it's a pretty common thing for runtimes, like the JVM to do so task scheduling it's going to be the focus now the rust runtime and it's current or no old state. It was written in C++. It had a very simple task: scheduler, it didn't really do much for IO management. It was just not a great runtime. It was a great first attempt to get the language going and it was great to build off of, but there's a lot. We can do differently to make it better.

A

So the goal was to change almost everything about the runtime and given that we were changing so much, it makes sense to just rewrite it from scratch: get the benefits of clean code but rewriting from scratch. We've got this language rust, that's good for low-level systems. Programming, like writing a runtime, so rust is now one of the very few languages out there with a self-hosting runtime runtime is written almost entirely in rust. Now it's another important definition. What is a task in rust?

A

A task is just a lightweight thread and the most important thing about these lightweight threads other than that they're lightweight. Is they don't actually share memory with other threads? So you can't get any concurrent memory access issues.

A

It's one of the things rust enforces through its memory safety. But what you do is you communicate information between these tasks by passing messages over channels and well. This can sound kind of inefficient. Rusts region systems allows you to just pass an owned pointer to a thing to another task.

A

So if you have a giant data structure and the global heap, you can just send a pointer to it over a channel and the region system will statically guarantee that the other task takes full ownership of it and remove your access to it and that you get great high performance memory, safe message. Sending one drawback of tasks, though, is that they're, cooperatively scheduled since you don't have magical kernel powers, you can't just pre-emptive for loop while it's executing so these tasks, you have to not screw up. In that sense, no infinite for loops.

A

We can't save you from that. Some runtimes do do some preemption because anytime, you pass into the runtime from a task like when you allocate or do garbage collection or increase the size of your stack. The runtime is free to preempt at that point, but you can write a program that does none of those things in rust.

A

So the most basic type of scheduling that you can do is you take a giant global queue of tasks to work on, and you make a bunch of workers, probably one for each cpu thread that you have and then you just have those workers take tasks off the queue run. The tasks put the tasks back on the queue very straightforward problem is intentioned. This doesn't usually work out very well.

A

You have a lot of threads that are trying to use this queue and they're trying to use both ends of the queue all the time, so they spend lots of time fighting with each other and you get very bad performance and the queue frequently just fills up with stuff and doesn't drain very fast, and it's just not a good solution.

A

So one place to go from there is to have separate work queues for each worker thread, and this is what the old rest runtime in C is doing, and the idea here is when you spawn a task, you put it into the thread: local queue. Well, the runtime does a little bit more than this. When you spawn a task, you just put it in your local queue and when you DQ it to ask you grab it out of your local queue, and this is great.

A

No, because there's no contention as long as all your workers have stuff to do. Everything is great. The problem is you frequently end up for the case where your workers, don't all have stuff to do. One task ends up doing a bunch of work spawning a bunch of new tasks locally and the other threads all finish and go to sleep, and this is kind of bad.

A

You don't want this, so what we need to do is move the work between the worker threads, such that all the worker threads are doing stuff and their couple approaches to this. So, at the very broad sense you can do something like static scheduling where you know exactly what you're doing you know details about the workload and you can think hard. You also know details about your hardware and you can just figure out what the optimal schedule is ahead of time to allocate the work and the problem with this.

A

Is you don't actually have all this detail and you don't have time to think about it in the general case, which is why we can't do any static scheduling like this. So for the general case, you need some sort of dynamic scheduling where you just do something. It has to be really easy to do, and you just hope it works well enough, most of the time, but it's going to be suboptimal to static scheduling when you can pull that off and the particular type of dynamic scheduling we do in rust is called work stealing.

A

This is a very standard approach used all the time, because it's so if and the basic idea is when your local work queue doesn't have any work to do, and you want to do work. You just steal some work from another worker threads queue, and this gives you the benefit of using your local queue as much as you can, but you still have the option of getting work. Someplace else and the amount of effort it takes to get work from someplace else is very low because you're just stealing it only when you need it.

A

The problem with a strategy where you fan out work across worker threads is it's just much higher overhead than this now back to contention, which is a big deal again, because now we're sharing cues between tasks and one thing we can do to make things a little more efficient. Is we let the worker only push and pop on its local queue?

A

So it only uses the left end and then we can make threads that steal only use the right end, separate the responsibilities a bit so owner pushes and pops from the left of the threads steal from the right and there's actually a very, very optimized data structure. For exactly this use case used in almost every work stealing run time. There is called the chase left deck and it's backed by an array.

A

So it's pretty fast and easy to manipulate, and the other notable feature is that it's locked free and you lots of people talking about lock, free code like it's. This fancy new thing and it is pretty great. It's not a magic bullet, but it is pretty great in a lot of situations like this one and the basic idea behind it is that, instead of using locks to manage your concurrent memory access, you use atomic memory operations, which are a little bit quicker.

A

The basic idea for many data structures like the chase live deck, is to use a compare and swap operation, and what you do is you read a word say in this case it would be the index of the top element in the deck, because you're trying to steal something, and then you do some stuff say, take out the element at the top of the index and grab a pointer to it. And then you try and modify the original top index to decrement it because you removed an element from the deck.

A

But if someone else is doing this at the same time, you can have multiple people DQ the same element and decrement this index. At the same time, bad things happen, so the compare-and-swap does because it says only decrement that index, if no one else has done it first. So by being the first person or worker to succeed at this, you guarantee that you are the only one who DQ's that task, and this sort of compare-and-swap style of coding makes many concurrent data structures very quick, a lot more complicated in the code.

A

But very quick and the most important part is: there will be a lot of hand waving grout, maybe lock free code. Isn't that great shouldn't worry about it that much, but for this use case it's empirically way faster.

A

So now we've got all the bits: I'll assemble them together, so we've got the N worker threads one per CPU core, and then we have a chase Lev deck for every single worker thread. And then, whenever you make new work, you push it onto your local Chase lab deck. Whenever you can and you need to do, work you DQ it locally, there's no work to do locally. You steal from a random other schedulers chase left deck on the other side, so results. That's pretty much.

A

All there is to work stealing, it's actually a very simple concept.

A

So the rust implementation and the first fact about it is it's not actually done yet so the work stealing part of it is there and the rest of scheduler is pretty much all working in rust. We turned it on by default yesterday, which was pretty exciting, but the chase live deck is currently not in use, so we're missing that big part should be a big speed-up when we implement it and there are a lot of other really simple optimizations. We can do to make this go a lot faster.

A

But surprisingly, despite all these low-hanging optimizations, we are pretty quick. So here are some benchmark graphs, comparing the old run time, which was written in C++ and the new run time, which is written in rust and has the work-stealing enabled so on one side, we've got the message passing benchmark, which is pretty equivalent.

A

The new run time is a little slower than the old run time for many cases, just because we haven't optimized it that much, but where the new runtime does pretty well is in spawning new tasks and spawn this benchmark spawns a million tasks in a single thread. It's sort of a worst case scenario for straining the spawn system and the new runtime handles it pretty. Well, the old runtime handles it very poorly.

A

Sometimes it can't even complete benchmark par fib is a very naive, parallel fibonacci and it strains spawning tasks and sort of a fork/join style and having the task spawn spread out over all of the worker threads and there we also saw a really really impressive speed-up tasks upon performance or something was just not going well for the old runtime.

A

That's about all I have in conclusion, rust was actually really nice. Language to use for this normally would use something like C or C++ writing this little of a code, but rust is great. It does pretty well the memory safety arrest gives.

A

You is actually really useful, even though you do need shared mutable state in the implementation of a runtime, because it forces you to really think about when you do it and make it very explicit what you're doing and we're hoping the new runtime will catch up on message passing pretty soon and be pretty fast and, of course, I'll. Thank my mentor, Brian Anderson and the rest of the runtime interns and the rest and serve our teams in Mozilla, etc. Any questions.

B

A

Yes, it would throw an error about running out of memory, something about malloc and then it would crash.

A

Spawning I guess just isn't very efficient in the old runtime like it's, not terrible as you can see, it wins and message-passing performance, but it had its false flaws.

A

Any other questions or things I should clarify I think I went through that pretty quickly.

B

B

A

Lowers better, oh.

A

No message passing is the opposite: higher is better than this, so this is messages per second per core and, as you can see, the old runtime gets between 500,000 and a million in the new runtime is more like 300,000 600,000.

B

A

The new runtime doesn't have a global work to you anymore. Yes, the work-stealing part is all finished in the new read time. We just don't have the fancy queue, implementation.

A

Okay, so there are a lot of internal details to the scheduler that I didn't touch on. It's actually really complicated inside and it's very tightly integrated with an event loop for doing I owe management.

C

A

For everyone else, and one of the things we do is every pass through the scheduler and every iteration is actually a run processing an event in this event loop and right now we just need to make sure that the scheduler process an event when there's it works to do so. We just send it events. All the time saying go, do this work go!

A

Do this work and every time we ping it with the new event we're creating a new callback object, which is another allocation and a bunch of work to make sure it does that event? They're doing this a few times per message pass when in reality, we should be doing it only one allocation and then sending many fewer of these callbacks so it'll be a big speed. Difference then get a question.

A

Yeah seems the bends are the same question. Eric.

A

Probably pretty easy, it's hard to judge like some things are really easy to do when they sound easy and some things are not easy to do when they sound easy.

A

Yeah I've got two more weeks. I should be able to finish this pretty easily. It's an example. Easy optimization is: we've fixed the debugging print macro to not format the strings and it wasn't printing them and we saw a 50% performance increase on the benchmarks for the new runtime. That was a one-line full request.

A

So there's we're serious about the low-hanging optimizations here.

A

Okay, I guess: I'll wrap up plus people have more questions.

A

C

A

Yes, an example of how hard it is to introduce these races when I was preparing. This presentation dealing with races isn't actually something I thought of, because I didn't do it very often there were just that few of them. There are other segments of the run time that do have races, but the scheduling stuff had very very very few. If any- and it's all thanks to rust,.

A