Rust Programming Language Air Mozilla, 3 Jan 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Bay Area Rust Meetup January 2016

Description

Bay Area Rust meetup for January 2016. Topics TBD.

Help us caption & translate this video!

http://amara.org/v/2Fhs/

A

Hello, everybody welcome again. Oh sorry, my computers, hi everybody welcome again to another rust meet up. This is wonderful to see you all. It's been a really long time coming and I'm glad that we've had so many people show up here. So tonight's event is all about some of the really need concurrency and parallelism libraries that have started to come out of community, but you know we're back yay, sorry that we had such a long break. I've already got things lined up for you know the next couple months.

A

So hopefully we won't have a big gap like this again but as tradition. Thank you again. Mozilla for feeding us and providing this lovely space guys are awesome. So tonight we have Nico. This is a little small, Oh Nico right there is on the camera.

A

Nico will be talking about his library round for doing very simple threading Aaron will be talking about cross beam for writing, concurrent data structures and then Q on will close out the evening of with simple parallel and then also just want to mention that I've already announced the next rust meet up in February. You should sign up, please sign up for the wait list. We had people 50 people signed up for the meet up before I even announced it.

A

So I wouldn't be surprised if there's number drop off, so you know feel for you, okay, all right, I'm going to pass this off to Nico. Thank you. Please give a warm welcome.

B

Hello, everybody: you can see I'm calling in from Boston it's a little bit chillier here than I I think it is probably over there, but so right, I'm going to talk to you about rayon, so I guess we might as well bring up the slides. Actually, hopefully you can see these look good, okay, loudly if something's wrong. It's.

A

At all, right still, all black and now green, oh.

B

Alright, let's see, let me try something else.

B

Like this, can we get it though? Yes,.

A

It's up, okay, all.

B

Right so um right so I'm going to talk about rayon, which is a library for doing data parallelism and basically what data parallelism means, at least to me, is you have a lot of data and you'd like to process it in parallel all right- and this is a little bit disjoint from like task parallelism, which is kind of I, have a lot of different things. I want to do, and I want to do them in parallel, but they're not operating something I have a big array. I want to subdivide or something like that all right.

B

So the goal of rayon really is to be able to take sequential code that you wrote and make it run in parallel, very easily a sort of minimal effort and minimal risk that it's going to do the wrong thing all right. So here's an example of a piece of sequential code that iterates over a list, a slice or a vector or whatever of stores and for each store. It calls compute price, which is going to do some computation and then I'm going. To sum up the results writing this is the quench alliteration.

B

So it seems like it's pretty clear what it's going to do, but if I happen to know that, let's say all these computations for computing, the price can be done in parallel independently with one another I'd like it. If all it took to to make this run in parallels to kind of change, this dot itter to something like dot pirater any.

A

Other languages sorry dinner for a sec, it looks like on the actual.

B

Soccer case of thing, yeah.

A

B

That's cool all right! You see if I can figure out how to make that not show up or at least show up less uh well.

A

It's still displayed a little bit, no.

B

One of these icons will do it. I.

A

Maybe can just drag it to the bottom of the screen or that's.

B

What I'm trying to do it doesn't seem to want to go down to the bottom.

B

Well, let's see I can make it Maddie.

A

That looks better.

B

It's kind of I can do over the side. Perhaps.

B

There we go all right.

B

Let's try that all right so.

A

That looks good.

B

All right so I'd like to be able to just write, pirater and now I would have the same semantics, but it would run in parallel right. So we do each store. Well, let's say it would potentially run in parallel. Would divide and the reason I say potentially is that it'll depend a lot on the computer you're using right.

B

If you have your running on a single core cpu, it may not make any sense to go in parallel or if the computer is busy doing other things, but but if you have the resources you'd like to use them, and so the kind of goals of rayon is I just said is there is the first goal.

B

I guess is this that it should really be able to enable parallelism very easily with minimal work, but give you some ability to customize and control it later when you need it, um if you want to get the maximum performance and another important point is you should be able to do parallelism as a kind of implementation detail and what I mean by that is.

B

If you have some function, that's doing sequential work, I'd like to be able to convert it to do parallel work and to do the work in parallel, and none of the caller's of that function ever have to know a signature of the function, doesn't change it's just something that happens inside right. It just gets done faster and finally, it's really important to guarantee data race, freedom and what I mean by this is so I said we want to make pearl, isn't easy and I showed you an example where it was very ergonomic.

B

You didn't have to type a lot. That's not the only factor in making things easy right. You also want to know that when you do this, your program is not going to suddenly do the wrong thing or undefined behavior or compute crazy results. So I'd like it that if you say that something is paralyzed, abaut and you're wrong, you get a compilation error. That would be the goal.

B

So, basically, if you put all that together, what it means is, you should be able to sprinkle on a willy nilly, parallel hints and suggestions without a lot of fear that something is going to go in this and your program is going to start misbehaving. So what ray on basically where its current status is I would say it's still very experimental, I, don't think people should build production systems on it. However, it's pretty promising and I think it delivers on these goals.

B

No pretty well, basically, and there's there's some caveats and things, but basically it's able to do all these things that I'm talking about as we'll see in this this talk, so I showed you this parallel iterate PID grillo iterator api, which is something that's in the parallel or in the rayon library. But it's not the foundation around. It's actually just a kind of layer built on top of a more primitive api, which is what I'm going to spend most of my time talking about and that api is called join.

B

So what join does is it takes to closures, as you can see here, and it calls those two closures essentially potentially in parallel, so it may start another thread to do one of those closures and then they would execute in parallel and it will wait until they're both finished, but it might also just execute them sequentially if that makes more sense.

B

So the goal is, you should build adjoin. This is kind of the thing that you sprinkle around right. Wherever parallelism is possible and then let the library in runtime essentially decide when it's profitable to use it so join, is really well suited to divide and conquer kind of algorithms, you're, probably familiar with this term.

B

But basically the idea is, you have some big problem you have to solve and it's kind of intimidating and you're not sure how to solve it, but you can break it up into two smaller problems and then you can recursively try to solve those, and if you keep doing this eventually, you get down to some really tiny problem. That's really easy right and the idea is that when you're doing this kind of divide and conquer step, each time that you you divide up into two problems, you can call join the process.

B

Those two problems in parallel. So let me show you an example. um This is quicksort my favorite sort, probably everybody's favorite sort I imagine most of you know, quicksort or maybe more accurately, once new quicksort and since then I've just used it as a library. It's kind of how I was, for a long time, till I wrote this blog post that I'm basing this on, but so quick sort. Let me just kind of briefly review how it works real fast, it's a classic, divide-and-conquer kind of algorithm.

B

So what it does is you've got a slice of data that you want to process and- and you know it starts out in some random order- like 1, 22, 3, 26 or whatever and you're going to start out by saying well, first of all, if its length 1, I'm done right, one length arrays, always sorted otherwise, you're going to pick you're going to do this partitioning step, which means we're going to pick on number, which we call the pivot and we're going to move everything, that's less than the pivot to one side and everything that's greater to the other.

B

So we're going to rearrange the array in place, that's what partition does so here.

B

If we pick three is the pivot, we would rearrange everything, so we have won 32 on the left and then 22 and 6 on the right, and we have this midpoint that gets returned, which is basically, where was the dividing line such that everything to the left was less and everything to the right was was was greater um and now this is what we can use to divide up our problem right, so we can call split at mute, which is a method that takes one slice and gives you back to split around some midpoint um and we'll end up with here to basically two slices now one pointing at the left, half everything that was less than or equal to the pivot and one pointing at the right half everything that was greater than the pivot and now I can recursively process them.

B

So when I'm done, I'll have first all sort the left, half and then I'll sort the right hand, okay, pretty easy. So if I wanted to paralyze this well, I can just do what I said and insert a call to rayon join exactly at the point where I have two subproblems. So this is the parallel quicksort. It looks pretty similar, except here, I'm calling join, um and so you see that it took very minimal work to do this, but what's even more exciting in I.

B

Think very cool about this is that we didn't change the signature at all, and so we're actually doing a you're, not a completely trivial thing to paralyze here, because we have mutation going on and we're also potentially working with stack allocated data. These things are usually a little bit tricky to get right, because if you start up a thread, it runs asynchronously.

B

With respect to this person, the stack frame that started it so working with stack allocated data can be a little risky if you return or do something wrong, you might pop that data, while the other thread is still active, but the reason that all this works with rayon is that basically world because join always waits until both threads have finished before it returns. You know that it can have safe access to stack allocated resources just fine, because basically the stack frame isn't going to return until both threads have done so.

B

The library kind of handles that aspect for you, and that makes sense for, if you're, taking a sequential algorithm, you were going to do both those things anyway, before you return, so you might as well wait for the threads to finish right. It's the same algorithm from the caller's point of view and in terms of mutation.

B

This works out very well because of Russ basic type system which guarantees you that if you have a mutable reference to something and you're the only one that has access to that data, so it's okay to put it to another thread and let the other thread mutate it there's. No one else can be reading. It will see a little more on that in a bit.

B

um I want to start out, though I want to focus instead on talking about how how does rayon decide when to run things in parallel at all right and the technique that I'm using is something called work stealing, which I did not invent at all. It's been around a long time and I think it was invented, at least the first time, I know of it by silk.

B

It was a project at MIT from the 90s and that silk is actually now available as part of Intel's c compiler, and there have been a number of projects they have used it in the meantime. So it's a really time-honored well well respected technique and the basic ideas is like this. Rather traditionally, when you paralyze, you might take a bunch of work and assign it to different threads, but work stealing is more of a dynamic load. Balancing idea, all right, so I have a bunch of threads in a pool.

B

In this case let's say: I have two threads and they're going to find work for themselves. Let me show you and if it's easier to illustrate it than to describe it, so we start out with two threads here and thread. Be is basically Idol. Initially, it has no work to do but thread a starts out with the job of the bigger the big big problem. I started with. So in this case that's quick sort. That's what q S stands for, calling quicksort on a slice.

B

Let's say it's 22 elements, long to start right, then we saw that what quicksort does once it starts working on a particular problem. Is it subdivides the problem into two two subproblems and then it calls join on both of them right and what join is going to do basically is take these two closures and it's going to start executing the first closure, but just before it does that it's going to post a little a little note into a threadlocal queue with the second closure.

B

So this queue is basically tracking work that we plan to do later, but we haven't gotten to it yet so we're starting in on 0 to 15, and then we've put in the queue of 15 to 22. So we can pick it up later and when we process 0 to 15 will again sub divided into two tasks and we'll put one in the queue that's the 1 to 15 and we'll start in on 021 right away.

B

This is the one length array, so it's going to finish and when it finishes, what's going to do, is go over to this queue and take off the most recent thing that it put in. So it's actually a double ended queue or a deck. You can use it like a stack or a queue it's a minor morning, but that means we're going to say: okay. Well, we finished 0 to 1.

B

So the next thing we have to do is this piece of pending work, 12, 15 and we'll go over and execute that, and this is where it gets a little interesting. So all this time, I've been talking about thread a and thread be, has just been idle right.

B

So what threat B's going to do is while whenever a threat is idle in order, it tries to find work for it to do, and it does that by looking at all the cues from all the other threads. So now it's wet be, might come over and say you know: I've got nothing to do and thread a has this this 15 to 22 task just hanging out and it's Q, maybe I'll be helpful and I'll.

B

Take that work on and we call that stealing so threat be, is going to steal this task and execute it and it's called stealing, but it's kind of a weird sort of stealing it's like. If a thief came into your house and saw that you haven't finished your dishes and started doing them for you right, it's pretty nice, it's a nice kind of stealing so wet B is going to execute 15 to 22 and it'll. Do the same thing.

B

It will subdivide this up into two problems and post, one of them into its local q and starting on the first one and it'll do that again, and so it basically works exactly the same way right. So now, when it finishes a problem, it'll take something off of its q, just like thread a did, and so let's say now it's processing here now. If we come back to thread a for a second.

B

So we've seen that both of these threads are kind of doing the same thing: they're processing an item pushing some of the work into a queue and doing the other half so, let's say threat I finishes the work it was doing and then comes along here now is finish both subproblems.

B

So it's completely done with 0 to 15 and now it would go to this next item in the queue, but it can because thread be stole it away so that, basically now it just has to wait to in order to do 15 to 20 to thread bees already started. It just has to wait for thread be to finish.

B

That means three days idle and when threads are idle, they go look for other work to do so now we can go look at thread, B's q, and we can say: hey here's some work, the thread bead had started but hasn't finished yet 18 to 22 I could steal that back right and so now thread a is going to do 18 to 22, while thread B is over there, and you can kind of imagine now. This proceeds like this until all the work is done, so everybody kind of just does the work they have.

B

They create new work and throw it on their queue, and hopefully somebody else takes it before they can even get to it right. So it's kind of like if that thief came in and was washing my dishes and I see that they haven't got gotten to a particular pan. I could take it back and start working line um everyone's working together to finish all the problems. So what's nice about this, is you get a kind of dynamic load? Balancing effect, so it might be.

B

Quicksort is not the best example of that, but if well even here, you can see actually the tasks don't all take the same time right. Some of them are for big pieces of the array. Some of them are for smaller pieces of the array, um and so, if you have tasks that take a variable amount of time, which is often the case, then it's really hard to pre allocate those two threads, because you might end up giving one thread a lot of really big expensive tasks and another thread.

B

A bunch of really easy tasks, and then the other thread will finish really fast and you just won't get a very efficient use of your resources but with work. Stealing, that's not a big problem, because if one thread finishes fast, it'll just go, take work from the other threads and try to take care of that.

B

Instead, so here's a big wall of code, this is kind of the header of join in a little bit of pseudocode and the reason I'm showing you all this code is that I want to kind of highlight some of the rust features that let let you write something like join as a library and rust and have it work pretty efficiently, which I think is pretty cool. um So if you remember joint aches to closures and closures in rust, here's the arguments to join opera and opera, be those are the two operations.

B

um Closures in rust are just generic types that implement this callable trade right. So here a and B are the types corresponding to the closures, and here is where we say that they must implement the function, one straight, which means that they have to be callable at most once um and.

B

The reason that these are like actually two distinct types, and why that's really nice is it means that every time you call join, you will get a custom version of join the compiler will not amorphous meaning it will create a custom version of join specialized to those particular closures. That means that llvm can basically in line and do optimization and hopefully in general, basically reduce the overhead of join in the sequential case. So there's no virtual dispatch there's only static dispatch and so forth.

B

um Now this is the the kind of pseudocode for what I just described. So, if you remember, we always push something on the queue and then execute the other task. So here's the the push on to the queue part- and the thing I want to highlight here- is that we don't do any allocation when we push on the queue we don't necessarily least in the common case, the operable closure that we're pushing on the queue. We can just push a reference to the closure that lives on the stack.

B

So we don't even have to allocate any memory to do that right. It's just a few rights, and then we can call the first closure and then execute the second closure, and each of these calls, as I said, are statically dispatched, which means they can be in line very effective, very easily by l'a, vm and analyzed very easily, and then finally, here this is just a little loop that says well, if, if somebody stole my work, then I'll help out until it's ready. I mean this.

B

Basically, this while loop is essentially what causes join to wait until both threads have finished.

B

Okay. The last thing on this slide that I just sort of glossed over is this sand trade. So what this is saying is, if you, if you, if you are familiar with rust, you'll know, but so send, is basically or if you've used multiple threads in rust. You probably encounter it's and I should say and send is basically the trait that identifies data types which are safe to send to another thread.

B

So it's things which can be safely transported between threads and won't cause any databases right, and so, when I, when I say that the closures have to be send. What I mean is all that data that you use from your closures. So in the case of quicksort, that would be the array that we're sorting and so forth has to be sunday ball to another thread. That makes sense because join might cause those two threads to execute in parallel or it might not.

B

But in the case that it does, the data had better be safe to go to another thread. Now, as it happens, in rust, most types actually are send and almost everything is sin double by default. The only exception is basically these. These three types and things that use these three types. um So if you use the RC type reference counter data, that's not thread-safe, because the reference count data doesn't use the correct atomic operations to increment the rest count, which means, if two different threads increment the ref count at the same time.

B

They might mess it up basically, but you can use the arc version, which is the thread safe version and it will work just the same, um and the next to ref, cell and cell are basically the ways to introduce mutation into Elias that other a little bit tricky.

B

Suffice to say there are thread-safe equivalents like mutex in atomic 32, but you have to be a little bit careful. So if you find that you're getting compilation errors due to the fact that you have cell and ref cell, but that's kind of telling you is, this code is not going to be trivial to paralyze. So this gets back to the safety guaranteed. I was talking about in the first place, if you're, just adding paralyzation and you'll wind up getting errors, we're actually getting pretty useful information.

B

You're saying that this that there's some shared state that gets mutated on and you want to think about it- you could add a mutex or some other mechanism, but at that point it's no longer guaranteed that your parallel version is going to behave the same as your sequential version. So you have to think about it a little bit, um and so that's one aspect of safety. Basically that we want to make sure that you don't transport types that are not thread-safe across threads and then the other.

B

Is you want to make sure that you don't transport the same data, especially mutable data across threads all right? So this is a version of quicksort, which is almost the same as the correct version, except that instead of sorting, though the low and the high on the two different threads, it actually sorts the low on both threads um and.

B

It's kind of it's kind of funny that I've been using this exit as an example for for a while and I always thought well, I, don't know if people really make such a simple mistake, but then, when developing round I actually made several mistakes exactly like this several times and the compiler called them for me and I was very happy. So what will happen? Basically, if you do something like this, where you accidentally use the same state on both threads, is that the compiler will complain and that's the message you get and what it's telling.

B

You is basically that if you want to mutate state and rust, you have to have unique access to it. So here I have two closures: they both have access to the same data, so neither one has unique access right um and what's what's cool? Is that this? This rule wasn't designed with parallelism in mind, but it fits pearls and very well. um So the compiler isn't really doesn't necessarily know that these running, but it knows that it needs to keep mutable access on unique, and so it just kind of falls out from that.

B

So you can't share mutable state, but you can very easily share a read-only state which is exactly what you want.

B

So that's how join works and I've kind of dump through it dell food in some detail. So this api that I showed you in the beginning was a parallel iterator and I'm not going to go into it in this talk because I don't want to take a long time. But basically you can build up the parallel iterator in almost entirely safe code using join and in fact it was building a car literary prize where I made that mistake with passing the same half to both have both closures, um at least when building around.

B

So so that's pretty cool that the joint is actually flexible enough to be used kind of build up abstraction on top that are that are even easier to use and apply to other situations. And there is. I have written up the details of how this works, at least at a certain level, it's in a blog post that I'll show a link to later.

B

So if you're interested, you can go check it out, um but the basic idea is that it will divide up the work that the iterator is going to do recursively in the same fashion. Right so in conclusion, I guess: there's a couple of points I wanted to kind of bring across and these the point that I really don't want to get across here is you should go use rayon.

B

What I do want to get across is I think there are some some goals and lessons to be learned from rayon. That I think we should try to incorporate and whether that I hope we will end up with a package much like rayon, which is very widely used. It doesn't have to be ran. I'd be perfectly happy if somebody else for a better one, but I think these are some really important things about it.

B

We really want to have optional parallelism, which means you want to be able to say what can run in parallel right, what what would make sense to run where the program will still be correct, but you don't want to have to say exactly what will run how many threads they're going to be if there will even be threads, it's just it's kind of annoying, so it makes your programs complicated and it's really hard to get it right, and it's actually ultimately something that's going to depend a lot on the environment where you're running a code anyhow, so you'd rather just have it be something that's dynamically figured out for you oops and the other thing is that rust, really it offers a great foundation for writing this kind of library right.

B

So the borrow checking rules basically give you data raise freedom, as I showed you like you get. This mutable estate is uniquely accessed by default. It's just the normal thing, so that actually goes both ways, that's great for the library author, because it makes it easy to get this.

B

This kind of guarantees you want, but it's also great for you. When you're writing sequential code, you're kind of writing parallel ready code without even realizing it so to speak, just because that's the way that you normally work to with code and rust, and the fact that we have these lightweight closures that can be stack allocated that have static linking and easy inlining also means you can do kind of just in the compiler.

B

You can build up a lot of great low-cost abstractions, so I'm pretty excited about that I think it's worked out very well and the last thing I'm going to show you is just a couple links.

B

These are more for people who come and look at the video later, but this is where the rayon project is on github, and this really long URL that I had to put in a small font is actually the blog post to that was talking about that kind of goes over everything I said here, but in somewhat more detail, so that's all I have to say. Thank you very much. um If you have some questions, I would love to answer them. I guess.

A

Yeah they're their microphones out in the audience.

C

Hi thinks I had a question about how work stealing might interact with the cache invalidation if two jobs are frequently stealing pieces back and forth from each other. Could that have performance implications.

B

Yes, it can I, I'm trying to think now, so we're stealing, it usually does actually fit very well with cash algorithms, and the reason is because, because of a kind of specific detail that I didn't go into too much detail on in this way, I showed it, but because what you with the way it works, is you start out with these really big problems, and you you recurse, you make them smaller right and each step you're pushing basically it onto your QR, really bigger.

B

Well, you start out with kind of the queue has a really big problems at the top and they get smaller and smaller as they go down right and so the task when a task goes, it always pulls from its own cue from the bottom. So it takes the smallest task, basically that the most recently pushed thing, which means it's the most likely to have the same cache behavior as the thing you just did right, so you get kind of maximum locality that way.

B

But when you steal you steal from the other side, which means you're stealing the work that is most remote from what you're from what the processor is doing right now and thus the one is kind of represents a big chunk like this. The right half of the array that'll be kind of a whole different cache.

B

You know it's kind of just you would get the minimum benefit, basically from from being executed by the original processor. uh However, I am not sure I mean most of the kind of papers and things that I saw that did measurements. I think we're pretty old, so I'm not sure like modern architectures are different and I'm not sure if anyone's done any recent studies, but I know that the ones that I saw they showed a quite good cache. Behavior nice.

C

D

Yeah so I was wondering I does all the code using rayon all throughout my crate and its dependencies? Do they all share the same thread, pool.

B

D

Implementation.

B

By default, yes, I have some code paths in there. That's not very well tested that lets. You create additional threat, pools and in a kind of dynamically scoped fashion, so you can kind of enter into a thread cool and then the work that executes within their will go in that separate thread pool, but in general it's probably not a good idea. Unless you have a specific reason that there should be more zabal thread pulls it's usually a good idea to share them, because I've only got so. Many core is essentially so.

E

So you're talking a bit about the safety of rayon or that having to do with data races, but you mentioned in passing that if you use something like a mutex, you can accidentally reveal the fact that these things are running in parallel. Have you thought about a way to rule that out similar to send I.

B

Haven't mostly because I think it's really useful, so there are a lot of parallel or there are a lot of algorithms where you actually want to communicate between the parallel workers. um So a simple example is a kind of branch and bound search where your or something like this, where you're kind of searching through a space- and you want to know, what's the best result that anyone has found so far um and if you're running sequentially, that's just going to be.

B

You know the current what the current weight is found so far, but if you do have the opportunity to paralyze you'd like to know if someone else found a better result somewhere else that you can stop searching unproductive avenues. So this is used to like solve Traveling, Salesman problem and stuff like that right, um so I'm not sure. That's a problem, basically that you can reveal the Pearl execution like.

F

B

Using a mutex already you're already in a parallel mindset, you've already delimited your transactional boundaries where you acquire the lock and unlocked it. So I'd be surprised if it actually led to bugs and code, but it does seem like we won. Something like send accepted. I, don't know, I haven't talked about it that hard, but it could be excused.

E

Life, you could rule it out. It could be a good application of lattice variables, like the kind of thing you were talking about if the state is monotonically growing or something like that, you still get a determinism guarantee. Yes,.

A

Any other question Kevin.

G

When work stealing, how does rayon deal with thread priorities.

B

Well, it doesn't really right so.

F

B

It gives it gives all the so this kind of comes back to some extent to the to the shared threadpool question that was raised earlier. But so, if you have multiple threads with different priorities and they're each starting up parallel tasks, we don't give one thread pool higher part like we don't give the the tasks and the thread will all have equal priority wherever they came from it's an interesting idea.

B

I don't I don't know how you would extend it to cover priorities, except by having distinct thread pools with different part of just leveraging that the OS like priorities that way good.

G

Threads acquire the priority of the thread: they're, stealing the work from.

B

Potentially, the challenge is that they can get mixed. They can get mixed a like there's, there's one thread pool right, which is basically totally different, set of threads than the one that are normally used in your application. So you can have tasks intermingled from different application threads in the air which may have different priorities, I guess they could set and unset the priority as they go.

B

That might be a good way to do it I'm not sure what.

G

It's worth I think that's how lib dispatch does it on all this time, yeah.

A

Any other questions: okay, once all right, Thank, You, Nico, so much.

A

Next up is there.

G

That's their ego.

E

Cool thanks: okay,.

H

E

Hear me: cool, hey everybody, I'm aaron tron, another member of the rest team here at Mozilla and I'm excited tonight, to tell you about a library I've been working on called cross beam, so cross beam is targeted at sort of a different layer of the stock than the library that we just saw so Nico's library breann, is all about writing sort of introducing parallelism into your program and it relies on some underlying data structures like work. Stealing cues crossbeam is targeted at those kinds of data structures.

E

Basically, so it's it's a library for high performance, concurrent data structures and in particular the the sort of big ticket item, is locked free data structures which also spend a lot of time telling you about tonight, but I'm, also interested in userspace synchronization, which is closely related things like semaphores barriers, latches and so on, and then a sort of central topic that that ties all this together, which is concurrent memory management. So a lot of work on lock free data structures ends up.

E

Assuming that you have a garbage collector, we don't have that luxury in rust, so we have to deal with a memory management in some other way and that ends up being one of the big challenges so I'm going to spend a lot of time in this talk talking about the memory management approach as well, so to start off with a bang.

E

Here's a comparison of some of what crossbeam provides to some of the state of the art elsewhere in the sort of tech ecosystem, so Java is actually a place that has a pretty strong story for things like lock, free data structures, there's a library called Java, util concurrent that has similar goals to cross beam and provides a variety of data structures, and so here I'm, comparing the Java concurrent link to you in that library, a hand-rolled implementation of a similar queue in Scala and then two cues that are implemented in crossbeam and here lower is better.

E

This is the amount of time sort of perk you operation in a multi-threaded test case right. So this is. This is basically just showing like the overhead of the memory management scheme. I'm going to show you today is quite good and in fact it's pretty easy to do better than Java, so the Java concurrent link you hear is highly tuned. The stuffing crossbeam is like textbook algorithms, with very little effort put into them. Okay, oh yeah,.

D

Does the job a benchmark? There include the amortized cost of garbage collection. Yes,.

E

Yeah I mean and no yeah well well, we'll get into more detail on this I should say feel free to ask questions throughout. This is definitely gonna be a little bit more technical than the previous talk. um Okay, so I realized that probably a lot of you don't know a lot about this low-level area of concurrency. So I want to start by talking about some of the important trade-offs in that space, and so, of course we have to talk about the cash.

E

That's that's always a huge determiner of performance and the cash story gets a lot more complicated once you start talking about having multiple cores right, so this diagram shows a fairly typical architecture where you have a number of different cores on the same die. Each of them has dedicated l1 cache, and then they have a shared l2 cache.

E

So the interesting thing about this is there's a coherence issue right. If you have an address and memory that could potentially be in four different caches right and any of the l1 caches of these four chords, how do you actually ensure that all of these cores think memory has the same value?

E

There's a protocol for doing this. A typical one is the so-called messy protocol for cache coherence, and the idea is pretty simple. Basically, each cache line has a state at any. Given time. Modified just means that it's it's dirty. Basically, it needs to be written back to ram exclusive means. It hasn't yet been changed, but this core is claiming. It is the only one with a valid cache line for this address. Shared means. Perhaps multiple cores have this cache line out, but none of them is in a position to write.

E

They all agree on the values and invalid just means this cache line is, is invalid on you, so there's a protocol, basically where, if one poor wants to write to a piece of memory, it has to get the relevant cache line into exclusive mode and tell all the other cores to flush their caches, and so you end up operating in this space, where cash locality becomes way more interesting and all of a sudden, the things you're doing on one core can have radical effects on things that are going on in another core, and so you have to think a lot about this when you're doing these, these sort of fine grain concurrent data structures, so in particular, if you want this stuff to go fast, you need to minimize the amount of coordination between cores, because, basically, every time you have to coordinate between cores you're, taking at least an l2 cache miss right to sort of get out and around one core and into the other you're talking about a cache miss at that level, which is relatively expensive right.

E

So what you want is to be able to access shared data structures where, if you're accessing different parts of the data structures say you have a shared hash map and one core is looking up. One key and another chord is looking up a different key and those are in different buckets. Those cores should not have to talk to each other. They should not be in validating each other's cache lines.

E

Also, if all you're doing is reading, you really shouldn't be switching cache lines into exclusive mode, where you're flushing out the cache lines on other cores right all of these things, if you, if you don't, have the properties I'm listing here, you end up, causing cache, misses all over the place and then finally sort of the strangest thing at all of all.

E

Sometimes, when you're working in this space, you actually don't want cash locality, because it's easy to have packed into a data structure, some data that is relevant to one core and some data, that's relevant to another they're sitting on the same cache line, and now the two cores are basically ping-ponging the data back and forth every time they want to write. This is called false, sharing. Okay, so that that's like some sense of the kind of space.

E

If you want to work, so let me get a little more concrete about what this means for programming all right, so going back to a hashmap example, if you want to share a hashmap between a bunch of threads can currently, the simplest thing to do is just throw me text around it right now.

E

All the operations are thread safe, but this is really bad news in terms of the things that I just mentioned before right so I was saying we need disjoint access parallelism, we need to say, if you're accessing two different keys, you don't have to talk to another core to do so, but if we have a single global, lock around a hash table, every access to that hash table involves essentially a cache miss if it's going between two different cores and even if we're just reading, we still end up writing to memory.

E

We still up invalidating these cache lines. Now, there's there is a sort of finer grained approach where, instead of having a global, lock around the hash table, maybe you just have locks around each bucket, but that's still not as good as you might hope. For because again, when you're reading from the hash table, you still have to acquire those locks that still involves a right that can still lead to cache, misses. Ok, so I'll roughly make sense. The details here aren't super important.

E

I just want to give you a flavor of the kinds of things you're thinking about. Ok, so all of those concerns leads you in the direction of something called lock free data structures. So this is basically I mean if you think about it, locks fundamentally can't achieve the goals that we set out in the beginning. They can't they can't let you avoid doing a right when all you want to do is read, so we need some other way to access concurrent, sorry to control concurrent access to a data structure that doesn't involve logs.

E

So I want to teach you briefly how to write a lock, free data structure and I'm going to use the sort of hello world of lock free data structures, which is a stack and it's not nearly as scary as you might think. It's actually pretty simple, so the representation will use for this stack is just a simple linked list right. So the the stack has a head pointer pointing to the current head and then nodes just tub singly linked list all the way to the end.

E

So can everybody read that it's kind of small? My apologies, if you can't I so here's the code for pushing on to such a stack, so we allocate a new load, a new node, that's not real interesting. We turn that into a raw pointer, whatever those are just some details in rust, and then we enter into a loop and the idea with this loop is we're going to try to optimistically sort of install this node onto the stack and we're doing so in a way that might be racing with other threads.

E

We might lose that race, and so then we have to come back around the loop and try again and what I mean by optimistic here is we're going to take a snapshot of the current head. So we load the value the current head of the stock. That's that's some pointer and then we go ahead and take our node and say its next pointer is whatever that snapshot was, and then we try to install that node in the front of the stock.

E

Basically- and we use an operation called compare-and-swap compare-and-swap is is like the basic building block of all autumn icity on most processors and what it says in this case is I. Think the current value of head is N and I want the value of head. Sorry, I think the current value yeah I've head is head. Excuse me, in this case self dot head is head and I want to update it to N, and that update should take place in a single atomic step.

E

As far as all other cores are considered right, so I'm not grabbing any lock here, I'm just doing the change in place, but if atomically head has some other value than the one I expected, then this just won't do anything and the result I get back is essentially what value did had actually have so I see I ask: is it the value I thought? Is it equal to head? If so, good I succeeded, I can leave the loop I've actually installed my pointer. If not, then I lost the race with some other thread.

E

They got their first I have to try again. Okay. So let me let me show that to you more graphically right so say this is the current state of the stack we have a head pointer. We've got two nodes, a thread comes in and it wants to push this node with the value seven. So it allocates that node. It reads the current value of head and links. It links it into the allocated note, but it hasn't yet changed the head pointer. Meanwhile, some other thread comes along and pushes a different node and right.

E

So now we have our local node that has seven, but the actual head pointer has changed at this point. If we do our compare-and-swap it's going to fail, because we think that we're pointing to the node that contains three, but actually that's changed, that's good, because if we actually succeeded we would have just dropped the node with five on the floor right. That's what we're trying to avoid. So then we have to retry we'll get a fresh snapshot. Weary point, our tail pointer, we do in other casts, and this time we succeed.

E

Okay, that's how you push in a lock freeway before I, go on any questions, I'm, mostly making sense, I'm new.

I

E

I guess see, like you know, I have volatile, there's nothing like.

J

E

Right so if we look back at the representation, basically everything you need to declare is in this atomic pointer aspect right. So that's that's enough to tell rust and lvm. You know I'm going to be modifying this in a sort of concurrent way in a racy way right and that that's what makes what we're doing here, not a date of race, but actually a synchronization race, which is a good kind of rates.

E

Okay, because, like the whole point, is it's valid for different threads to race over pushing out a stack, we're just trying to make sure that they don't destroy the basic invariance of the stack when they're doing so. Do a question? Okay!

E

So let's look real, quick, then at pop, which is pretty similar. So in this case we we don't need to do any allocation up front right, we're just trying to pop a note off. So we enter right into our retry loop. We grab a current snapshot. If we see that the head is actually pointing to null, then we have nothing to do. We have observed a moment in time where the stack is empty, and so we just return the stock was empty.

E

There is nothing to pop, otherwise we take a look at what that the node we just snap shotted. What we think is the current head. We look at what its next pointer is right and then we go back around and try to change head from the one we read to its next pointer and the same story happens to someone else.

E

If some other thread pushed your pop, the node in the meantime, this compare-and-swap will fail and we'll try the whole thing again and for people who know Russ deeply, you understand what the pointer read stuff is doing here for people who don't don't worry about it. The key point, though, is this implementation. Leaks, notes. Okay, you have to understand some of the subtleties of rust to actually like be able to see that this is happening but effectively we're using these raw pointers. These unsafe pointers in rust and those don't have any automatic dropping going on.

E

We have to actually get those into some own value in order for them to be deallocated, so here we're just reading the data out of the node and then not doing anything. Okay, we successfully d-link the node, but we never actually free its memory. We have a leak okay. Now, if this were Java, we'd be done right, because the garbage collector has our backs and that's great, but things are not so nice and rust. Okay, so you know you might you might think? Well, okay, let's just the allocated here.

E

What's the problem right, we know that we are the only thread that pop this note. Nobody else is going to be d, allocating it. Why not.

E

It's too good to be true of right, so the problem is- and this is this is quite subtle, but if we go back to this pop implementation, there's something very interesting going on here. So we take this snapshot of the current head. Now we have a node in our hand, and the moment after that snapshot that node may or may not be in in the stack. We have no idea right. This is totally concurrent we haven't acquired any locks.

E

No other thread even knows we're here right, so we have this note in our hand, but the moment after we really have no clue what's going on with this node, but we're about to follow its pointer and get a field out of it. If that node was popped in the meantime and someone freed it segfault right. So basically the problem is there: are these aliases to the nodes floating around, because other threads are taking snapshots and following those pointers we have no clue.

E

This is happening, so we don't know when we can actually safely deallocate memory and that's the fundamental memory management problem here now in the GC world. Again, everything is great because you know sort of concurrently at some point. The GC will run and will clean up everything it will realize. There are no snapshots left, so it's okay to actually deallocate this, but we have to find some other way to actually achieve this goal and rust makes sense any questions cool okay. So how do we do it?

E

Okay, so they're basically two main observations to make at this point. So the first I've already sort of mentioned, like the basic problem here- is yeah. We know that we're the only thread that actually unlinked this node from the data structure, but other threads could have stale snapshots to that node and they might want to actually dereference those snapshots. So that's that's problem number one or observation everyone, but there's another observation, which is that once we've removed the node from the data structure, no new snapshots are going to be made right.

E

The only way to get a snapshot is to start from the head of the data structure and actually follow a pointer that reaches this piece of data, but we just removed the key pointer from head right. So there's there's no way. Another thread could create a new snapshot, so it's just a matter of waiting until all the snapshots that are in flight right now have gone away. We just need some way to figure that out, okay, but the problem is the whole point of this exercise.

E

The whole reason we're doing lock free programming in the first place is to avoid coordination between threads, but now we need to coordinate between threads, to figure out when it's okay to actually deallocate these snapshots right. So that's that's like the the nut that we really need to get at. Okay. Here comes our salvation. So there's this awesome idea called epoch based memory reclamation, and this is the memory management strategy that crossbeam implements I'm just going to give you the high-level intuition about how this works.

E

I won't go super into detail, but basically the idea is that we're going to have these grand epochs of time that all threads, more or less agree on and as the epochs progress snapshots in old epochs, we know are stale and are sort of no longer a concern. Basically, we have some way of telling all threads have moved their epoch forward and now it's safe to to remove this node, because no new snapshots could have been made right. That's that's like the very vague intuition.

E

So the way we do this is there are three epochs ever three is a very interesting number. You'll you'll see why three a little bit later, but just trust me. For now. We have a global counter that says what epoch is zero one or two, and then each thread has a local snapshot of witchy pocket is and a flag. That basically says is the thread doing something with the data structure right now like is it in the middle of a pusher and for each epoch we have a sort of purgatory.

E

So this is like a place to put nodes that are no longer reachable from the data structure in this epoch, but might still have some outstanding snapshots right, so we can't actually free them. Yet we know we're going to free them later, but we need to wait until the epochs have progressed SAR enough before we can do that.

E

Okay, if we want to do an operation like push or pop we, we do this pinning operation that sort of enters the epoch system. So we set our threads active like we say: hey we're doing an operation on the data structure. We take a look at the current global epoch and we copy it into our threadlocal snapshot. At that point, we get a very strong invariant that says any data that is reachable right now from the data structure will remain allocated until this thread becomes d active.

E

We have pinned the global epoch to this value and we know until we leave our operation, nothing that we could ever encounter will be deallocated and that's that's the key thing. We need to do this risky business of like taking a snapshot and then dereferencing it.

E

On the other hand, if we get into a state where we've popped a node and we actually want to deallocate it, we don't deallocate it right. Now we put it into the purgatory for this epoch and then it'll get cleaned up later. Okay and that's basically the story from the local threads perspective for doing an operation. Of course, eventually you want to actually free some data right. That is the point, and so the way you do that is you look at all of the threads that are involved with this data structure.

E

Any of the threats that are active you check their current epoch counter. If all of them are at the current global epoch, then you attempt to move that global epoch forward by one using a compare and swap to make sure you know you settle races to collect and if you're successful, then you can reclaim data from two epochs ago.

E

Okay- and this is where the three epochs come in so the tricky thing is, you can end up with active threads at any time in one of two epochs, basically, the current one or the one just behind, and the reason is once we do this compare and swap that's being described on this slide. We still have a bunch of active threads that are in the old epoch, but some new threads could come in and be in the new epoch.

E

We just moved to right, but we know that no active threads will be in three epochs ago and we can sort of clean out their data. Okay, like I, said I'm, giving you a high-level description. Don't worry if, like you, don't 100% followed its kind of subtle stuff, but at the end of the day, it's actually not that hard to implement okay, so the algorithm I just described has some really neat properties with respect to the cache concerns.

E

I was telling you about at the beginning right um so I, saying like the whole game here is we want to coordinate without a lot of costs right. We want to coordinate in the lightweight way, and so the thing is we don't have to collect data at every operation right. We can batch it up and deallocate it occasionally right when just like, we were doing a GC when we're sort of running running low and the nursery or something.

E

Moreover, because we're doing this infrequently, the global epoch is usually staying the same, which means all those different l1 caches can have the cache line with that global epoch in the shared mode right there, not in validating each other. When you do that that read of the global epoch you're getting a cache hit most of the time, all right, it's pretty cheap, you're threadlocal epoch is usually in the exclusive mode, because other threads aren't reading it most of the time.

E

So that's cheap to write to that's, not a cache miss, so the only time you incur real coordination costs like cache misses is when you actually do a garbage collection. Basically, and those are infrequent, we win. Okay and, moreover, you can apply a single epoch management scheme to any number of lock free data structures right. So you don't even have to do this in a product data structure way you can sort of amortize it over everybody, which is how cross beam is set up.

E

Alright, any questions at this point, so there are lots of ways you could do that for so cross. Beam works tweaks, a few details from what I showed you in cross beam. Each thread actually has a local purgatory and basically, when that purgatory gets too big, it says. Oh I should probably try to free this. This data, um but like there, you can imagine many many schemes for doing this, and it's basically orthogonal to the main um other questions. Yeah.

F

How do you handle the the per thread state? It's.

A

Like say up ahead.

F

Of time there will never be more than 100 threads, or is it like dynamically sizing or.

E

Yeah, it's you don't have to say ahead of time. It's basically there's there's like a threadlocal variable for enrollment and the crossbeam API just takes care of the rest. So it's pretty straightforward like basically once once a thread has enrolled once then it's it's sort of visible for epoch management forever thereafter and the enrollment is something that the clients don't have to worry about it all, like you, just use the data structure as normal and everything's taken care of for you.

E

Okay, let's talk some rust, so the other interesting thing about this library is like or about memory management particular is. How can we take all those ideas and wrap them up in a nice rust API, ideally with this little unsafe code as possible- and this is one of the parts that ended up being sort of most exciting to me.

E

So the basic idea here is: we have a sort of sweet of data types, that's part of crossbeam that represents working with the epoch management scheme. So the first thing is guard.

E

This is like an RA AI style guard and basically, this represents I am in the process of doing something on a concurrent data structure right, so I described this pinning step earlier that you had to do when you started working on a data structure and when you were done, you unpin by just setting your active flag to false that whole algorithm is encapsulated in a guard, so you just call: pin you get a guard back that guard sticks around and when it's dropped you exit the epoch management scheme.

E

Okay, as we'll see later, various other api's take a guard. Is an argument, as kind of evidence that you've correctly entered the epochs team, so you can't get it wrong right. You can't interact with the data structure unless you've advertised that you're, interacting with the data structure and the right way. Okay, so that's part one part two is we have some pointer types that are sort of akin to the ones in the standard library, but they're meant for this style of memory management, so we have owned of tea. That's basically like box of tea.

E

It's just a piece of heap allocated own data. We have shared, which looks just like a shared reference, except it's its memory managed by this epoch scheme and we have atomic of T, which is just like atomic pointer that we saw earlier. But the really cool thing is. This is the most exciting thing to me: owned and shared, implement drf safely, so they're just totally normal pointers.

E

If you have a shared pointer with a lifetime and you're in that lifetime, you can dereference it without writing on safe code, even though, like that memory is undergoing all this concurrent memory management right, so here's how it works. The atomic type, so this is the the type that you use for doing these kinds of atomic updates, has a bunch of operations, load, store, compare and swap, and these interact with the guards that I mentioned before in a sort of interesting way.

E

So if you load a piece of atomic data, oh I didn't mention this before so there's this whole business about ordering, like you, saw, relaxed and release before ignore all of that, it's a whole nother dimension that you just shouldn't care about right now. So putting that aside, if you want to load a piece of data out of a out of a pointer, that's managed by this system, you have to have a guard all right.

E

You have to have proof that you're in the epoch scheme right, that's this ampersand, take a guard and what you get back when you load is a pointer is as a shared pointer to the data that you were trying to low, but that shared pointer here has a lifetime. That's tied to the borrowed guard that you passed in. So this is basically saying I'll. Give you a pointer, backed like a snapshot that you might want to dereference later and that snapshot is valid for as long as the garters and that's exactly what we wanted.

E

Because I said you know before I said the key invariant is once you've entered an epoch. You've pinned it nothing you can see, could be deallocated until you exit right and that's exactly what this is capture. It's saying for as long as that card is valid until the point you've exited it is valid. It is safe to read this snapshot that you've just taken out okay and then for storing. You don't actually need this guard because you're not getting something that you are going to dereference.

E

You actually have something in hand and you're just pushing it into the data structure. Ok, so there, like I, said there are few more operations like this, but load is really where all the interesting action is so.

G

When does the guard become invalid, when it goes out of scope? What.

E

Ya, when it goes out of scope, so like well, we'll see an example of using the guard in just a second. Oh sorry. Yes, so.

G

The shared there is borrowing the guard was not borrowing the atomic. So what happens to be stored? The atomic, whatever shared.

E

G

Hasn't it safe to sort of the atomic while you have a shared load. Yes,.

E

So the yes, because the atomic was just a thing that was storing the pointer you've now.

J

Got another pointer.

D

J

Yeah exactly this is the.

E

Like atomic, when exactly other questions.

A

E

A

You might get into this later, but I, I I'm, not sure purity, so this or not, but uh as having an outstanding loan, does that prevent epochs from advancing? Yes,.

E

A

Hold like allocation like if you have a reserve, a slot for a long time making. Yes.

E

Okay yeah, as long as you are holding on to a guard, no memory is being free right. This is all about, but the, but so the thing is locked. Free data structures are non blocking data structures, their data structures, where you expect to have a quick interaction and then you're out you're done right. There are these tight little retry loops, so most of time, you're sort of going in and out of epochs very quickly, and so you expect not to hold up the process of collection very often right.

E

It shouldn't take that that long for all active threads to cycle out of the operation they're doing you could get it wrong. Obviously, but if you do you have bigger problems? Basically, okay, so there's one more operation, and this is the one place that you get unsafety cross beam which at some point you have to say, I, believe that this piece of data is ripe for deallocation right.

E

That state is you know when you're using these, these pointers like you, can create a big nest of aliased data. There's no way the library can know when you've actually like unlinked a piece of data from the data structure and should go into the purgatory. You have to say when that is, and you can get that wrong right. The name here unlinked is meant to imply that you're making an assertion. When you call this thing right. This is safe to call when you have a shared pointer.

E

That you know is no longer reachable from the data structure, but that's generally not very hard to find out. There's there's almost always an obvious point where you've now successfully popped right and that's when you unlink as we'll see on the next slide. Okay, so here is the pop code. I showed you before, but now using crossbeam. It looks pretty similar except the first thing we do before we enter the retry loop. Is we actually pin any pot? Can we get a guard back right?

E

This guard is going to stay in scope for the duration of the loop now I know I'm going to get a question. Why don't you put the guard inside the loop, so you're like entering and exiting each time? So you're not holding up data as long you could do it either way. It really doesn't matter.

E

Yeah, I and I haven't exhaustively tested to see which one is faster. It probably depends on exactly what you're doing, but either is valid. This is sort of the simplest thing to do so. Here's the interesting thing right so now we're coming in we're taking this name snapshot from head that we took in the original algorithm, so either we'll get that there was some head or it was empty, empty cases the same.

E

If there was some head now we have a snapshot in her hands and just as before, we can dereference right through that snapshot totally safely no problem, so we pull out the next pointer out of that snapshot. That's fine! We try to do our compare and swap or arc as shared here and by the yet. So the API is slightly different than the one I showed you before. Here it just returns a boolean. Was it successful or not? So if it's successful, then we enter a bit of unsafety, which is just to mark.

E

The note is unlinked and then extract. The data and that's it okay, so a key observation here is like I, walked you through this complicated description of this algorithm, but as a user of the library, you don't have to know any of that stuff right. All you need is this epoch pin guard on length and that's basically it otherwise.

E

The code that you're writing looks as if you were programming in a garbage collected language right, you can take an algorithm from Java, poured it over to rust, sprinkle in a few on links at key points and you're good to go, and you get performance, that's better than Java when you do so right. So that's that's the sort of essence of the memory management story in and cross beam. Okay, I'm going to wrap up here.

E

uh So let me just say a few things about what else is going on in crossbeam, so the bulk of the implementation right now is the the epoch algorithm, an API that I showed you there's also two multi producer, multi consumer queues. One of them has a blocking mode as you. If you want to block when no data on the queue is available. There's also work stealing duck thanks to Alex crane and there's a scope thread, abstraction which is kind of random thing. It's a little different than the rest of these um okay.

E

So that's that's what's in it today, but what I would like to do over time is like I, said at the beginning, really make crossbeam. You know a strong utility library that offers a number of concurrent data structures in this vein, so the most important one to tackle, of course, is concurrent maps in current hash maps. In particular that's a massive implementation challenge, but in rust it's also very interesting, API challenge for a fun exercise, think about taking the current sequential API for hash map and think about how you'd have to change it.

E

If you want to concurrent access, it's less obvious than you might think, there's lots of subtlety there, so I'm really eager to experiment in this space. But there are lots of other kinds of collections sets in bags that you'd like to have as well synchronizers as I mentioned before and then sort of looking far into the future going higher level and actually offering ways of like doing composable concurrency.

E

So in my PhD work I did something called reagents, which is basically these kinds of block free data structures, but with a way that you can compose their operations together, so say: atomically either do this or that- and I would love to do that someday and cross beam, but there's a lot of work before we get to that stage. Yeah. So that's that's! Basically it I'm happy to take more questions.

I

When you talk about concurrent hash maps, are you thinking of cliff clicks or are you thinking of something simpler, something.

E

Simpler to start I I would probably start with uh something closer to Doug. Lee's implementation and John I thought that were striped blocks. I thought that one.

I

Was right, it's.

E

So it originally had some blocking its moved gradually over time to more more locked for your prose. Okay, okay, I, wasn't aware of that yeah, but I'll probably start here actually with skip lists just to work on the API space before really working hard on the implementation. Yeah I mean it.

I

Would be awesome to have cliff like yes,.

F

I

It's like 100 pages long. Yes,.

D

Yeah so I've heard of this a be a problem where structure gets deallocated and then later the memory address gets reused, and so your cast succeeds when it shouldn't have, because the pointer is the same, even though it's a different object. So how does Crosby prevent that? The.

E

Same way, garbage collection does it's beautiful, so basically so the way that right. So let me, let me make sure everybody understands the the problem here right. So the issue is, if you take a snapshot, you have. You have a pointer in hand now you're going to do a compare and swap based on that point or later, but it might be in the meantime that it's been deallocated reused for a different purpose and relinked back into the data structure. Now your operation succeeds and it should have failed and you've ended up with something focus right.

E

This can't happen with the epoch scheme, because that you just read that pointer out of the data structure and you have an epoch pinned and the basic invariant says that data cannot be freed until you exit the epoch. So it's guaranteed that that address, won't, be reused or recycled or put back into the data structure until you've exited, at which point it doesn't matter, you're, not you're, no longer about to do this. Cass and the same thing happens in garbage collected languages like Java, basically they're.

E

The snapshot that you're holding in your hand, is a GC route so that data won't ever be reclaimed or we used until you've exited what you're doing it's basically, the same principle.

F

Judging from like the pointer read after the unlink, I would assume that, like the when you're free and epoch you're not actually groaning destructors.

E

Yes, correct and I yeah I think there's actually no way to safely do so and I, don't think you really want to um it doesn't fit what you, what you're generally doing with these data structures.

J

So this seems like it would do fairly well under high contention, I'm wondering about low contention cases how well this scheme works. If, like you most of the time, only ever from one read reading data say right.

A

J

Is it like overhead of whittling these epochs all the time start to add up so.

E

That's a great question: um I think the the cost of fiddling the epochs is basically trivial in that case you're, just like incrementing someone, yeah yeah exactly and a nun contended Kaz's. You know pretty fast on like modern Intel chips but I, so so that's one potential source of overhead. The other is that you have this purgatory so, instead of like just freeing memory directly, you're actually accumulating memory to free later and there's some allocation involved there.

E

But generally, if you're, following this scheme, I mentioned earlier, where you just you do that free every time you get up to a certain size, you just keep reusing that same purgatory list over and over. So the allocation is amortized, so I think it's not it's not too bad, but it depends. It basically comes down to how fast is a nun. Contended has on your architecture.

G

On a previous slide, you listed easiest, be synchronizer called phasers. What is that.

E

So where was that? Oh that's a yacht, oh yeah, it was way earlier. Okay, um yeah, so phasers are another time. I'm very inspired here by java.util concurrent I basically spent my PhD worshiping java.util concurrent. So it's an abstraction there that is sort of similar to a barrier, but it allows a much greater range of operations.

E

You can talk about threads coming and going like basically choosing to participate in the barrier and then changing their mind, and so now their account doesn't matter anymore, and this kind of thing is actually used to implement some kinds of works, dealing schemes and other kinds of parallels and support. So it's a building block. It's not that important, but.

C

Is the purgatory? Is there a global purgatory or data structure.

E

Neither so in crossbeam there's a purse Fred purgatory plot, well, there's primarily a per thread purgatory, which is shared amongst all data structures. For that thread there also has to be a global purgatory as well, because a thread can be in the process of shutting down, basically fail to advance the epoch, far enough to be able to actually reclaim that data and to avoid leaking has to go plus the two Volturi. How.

D

Do you deal with adding kind of abstract data into the purgatory and still knowing how to what destructed to run for that data? It's.

E

Just a t-table: basically you it's! If you think of drop in sort of trade terms, that's just the usual sort of heterogeneous collection story. Yeah.

H

J

I

J

E

H

Lame but ones yeah. It has some guts to what.

F

H

Super lame question is: how did you get your little like the the lines to move in your animation? That.

E

H

E

So funny enough, it's called magic move and keynote. um So basically you have two slides next to each other. That are mostly the same, and some things are different in it automatically animates them. It's wonderful.

H

That's do highly recommend it all right. Real question on so I noticed that.

F

H

Like I'm, just like a method, that's called compare-and-swap and you said in the beginning that your intention was to create a performance set of primitives, for you know, parallel parallel operations and end data structures. So to what extent does the method compare-and-swap actually used like the underlying instruction set like conv exchange for x86, or something like that and I just to build? On top of that, um because I think x86 has like comp hand and all these are hand over hand, locking or skip lists and so on.

H

To what extent does cross being have the facilities for using the underlying instruction set architectures support for locking primitives right.

E

Okay, so for the first part of that question, compare-and-swap here is just a direct map to the sort of basic hardware architekten. So there's nothing really interesting to say there I think for more specialized operations, like I mean also fetch. An increment is a really important one. I, there's, no fundamental reason that that can't be incorporated, um I haven't really thought about incorporating blocking in into this space. I, don't think, there's anything too difficult there like.

E

Basically, I don't think the crossbeam api would have much it interesting to say, like the key thing at the end of the day is to know what memory is being potentially has these concurrent snapshots that you have to worry about and to know when you can mark it unlinked when you know for sure that it's no longer reachable from the data structure, so anyway, you have of getting yourself in that situation. Viet partially locks, some mixture of comparing swaps or other operations. It doesn't really matter. The algorithm should still work. I, see.

H

I I guess just to add out of that. The reason I brought up locking was just because there's.

F

A book I think it's called art of.

H

Parallel programming and I, some some guy in Israel, and it has a bit about saying that, on on sequential e, consistent architectures, like x86 like versus armor week, weekly consistent architectures, it's important to use the underlying microcode, or rather like the instruction, that's responsible for handling on parallel access or locking, even in a case where you'd normally use like a block free data structure for performance reasons. But I, don't I, read it a long time ago. So I don't really. I thought.

E

You'd know more than okay, so do we do so so so this is I think this is connected to the piece that I told you not to worry about at all, which is the memory or during peace. So actually x86 is not sequentially consistent.

E

Yes, it is a it's a you know its memory models, something called total store order, but basic. This is a huge topic.

E

Let me summarize it by saying: in the C and C++ world right with C and C++ 11, there was an attempt to create this memory model that has these kinds of memory, orderings like relaxed release, acquire sequentially, consistent and so on, and the point of this was to not do what you're saying basically to let you program at a higher level of abstraction, where you can say I'm doing a reader right. I need to have this amount of information about what others it's going to be published to other course.

E

Basically, like is there going to be a global order overall, reads and writes for this, or is it okay? If you know there's something more subtle going on and then the compiler will choose the best underlying operation on that piece of hardware to do so so for x86, for example, sequentially consistent operations require fencing the memory barrier, but any of the other orders don't. So it's actually really to your advantage to think carefully about when you can get away with something more relaxed right, but whatever you do.

E

There also applies to arm and gives you wins there right. So hopefully you can stay abstract away from the hardware. So I don't know, maybe that's just an optimistic take, but that's certainly what the memory model is trying to do in rust inherits all about a hundred percent yeah.

E

All right, we've run out of questions awesome. Thank you guys,.

A

I'll be quick humans already on the screen, so he won. Please go take it away.

K

Can you hear me, we.

A

Can yes, we don't.

K

See it looks something.

A

Like that, yet, though, oh.

K

A

K

A

K

Okay, um hi I'm Hyun I'm a volunteer on the rusty and I'll, be talking about my library, simple parallel, and so there's a link at the bottom. If you want to type in it type it in quickly, which is got various links and the slides and so on.

K

Alright. So what is simple? Parallel it's designed to sort of be a little bit ly crayon except worse. It's designed to be a really a really easy way to add a whole pile of parallels into your code without having to think about it at all. So the main API has got these three functions for map and unordered map and I guess the best way to to understand what they do and how the library works is to look and look at an exam.

K

So for this example, I've got a program here, it's kind of long, but the main point is: it takes a list of files on its command line, arguments which are images and then we'll resize, those images to be 400 by 400 pixels and then write them out. But of course, the resizing image bit. Isn't that isn't particularly interesting? So, let's just focus on the loop in in the main function, which is just this bit and so, as I said, it's got these this files iterator, which is the list of command-line arguments.

K

It will run through it and resize each of those images. This is very sequential. It's only using one core and, like my laptop, has eight cores and most even phones nowadays have more than one core. So this is a waste of resources, so instead we should. We should use parallel, ISM threading, for example, so I'm using the great scope thread pool library here, which is good. It gives you precise control about exactly what you want to run where, but it's kind of annoying to use.

K

So you first you have to import the crate and then you have to spawn a pool, and then you have to avoid the problems with destructors by using this scoped function and then, finally, you can run over the images and then spawn a thread to will spawn a job to resize each of those images. So it's kind of annoying. It's a lot more work than just writing a for loop. So what?

K

If, instead of just using normal threading, you could use magic threading, so here's simple parallel: it's got this all function, it's basically literally just a for loop with the aerator and then a closure, so that closure will be run on each of the elements of the iterator and and then do its work and preferably in parallel. If you've got multiple cores, obviously like I'm saying fast, but is it actually fast? So I ran it on a directory of images.

K

The sequential version took 41 seconds, scope, thread, pool and simple parallel. Both took about 10 seconds, so we're saying it's more than four times faster on a machine with four physical cores or eight with high ready. That's about what you expect for a like a computationally, intense task like this.

K

So how does it work for the this library is designed to be safe and easy to use, and so it tries to imitate a full leaf as much as possible, which is why it takes an iterator as an Intuit errata, instead of just a like a the iterator trait. This allows it to buy. You can just pass in a vector and that will automatically be converted into an iterator.

K

These iterators will then yield this item type, which has to be sin, which makes sense. It's get art it gets passed across threads because that's how the parallel ISM works and then the function closure is a function that takes the item from the iterator and it has to be sync, which means it can be called. It can be called concurrently on multiple threads, so that signature basically describes exactly what is going on in terms of parallel ISM and what what safety constraints you need to set up.

K

I satisfy, of course this makes more sense with a diagram. So here we've got the iterator. Take the iterator value which, to the red squares, some long sequence of things, then the bottom we've got a thread pool in this case. It's just two threads and each of those has a reference to our closure, our funk equation and so to start off.

K

First, we'll take the first two elements of the iterator and pass them off to the thread pool and then they'll run that the functions will be called concurrently and they'll be running in parallel and then, when one finishes, the next element will be pushed onto the thread pool and then so on and so forth until the iterator has been exhausted. At this point, the four function will return and we'll have finished our for loop and hopefully run jobs very efficiently and quickly, and so now sharing and safety like rayon and like cross beams scoped threads.

K

This is designed to offer flexible sharing, so you can have data. That's you can have references that a point into your main thread stack and manipulate them freely without worrying about problems. So in this case we've got a our array, data of ten elements and we run in a for loop across those ten elements. Adding is outside variable to each of both. So this would result in data having one two, three, four, five, six, seven, eight nine ten in it after the for loop returns. So this is perfectly safe.

K

The fact we're using iterator and that simple parallel is fairly careful, means that each of those elements won't alias, there's no data, races, there's no mutation of variables that shouldn't be mutated. This particular example wouldn't be very efficient because of false sharing and so on. But you know it's safe, so that's nice!

K

What about if we made a mistake if we'd flipped around outside and elam, so instead of adding outside to each of the elements of the array, we were incrementing outside each time, so this would then result in outside having theoretically, if it ran properly, it would theoretically result in outside having value, 46 or 36. Something like that.

K

But it's it shouldn't run because it's a data race, so there's no synchronization in the new in the mutation of outside. So if we're running this in parallel, we've got multiple threads modifying this one memory, location and there's no synchronization, which is literally the definition of a data race. This is undefined, behavior and rust and can lead to arbitrary problems, let alone not giving the expected answer.

K

Fortunately, simple parallel protects against this problem, so this is the error. The compiler emits it's saying that you can't assign to data, because we've got a fun equation, which is telling us that we've got a problem here. We can't you take things we've captured. We can leave mutate, the things that come in through the closure argument, except if we've got a mutex or a like an atomic variable. Of course, if you like, if you add your own synchronization, then it's fine.

K

But iterators can also do more than just a for loop, so we've got a instead of giving a useful error message about any errors to resize the image. How about we just tell the user how many errors occurred. So in this case we map the resizing over the iterator, and then we filter out the ones that are errors. We count those ones, and so this will again run sequentially and it takes about 40 seconds.

K

It's not particularly different to the full loop version, but with simple parallel we can paralyze the map, so we can do very set up with cross beam, but then, in the middle we've got the map here. So we've got a map which takes files and a closure, and then that will return an iterator that we can then filter and count and we'll get the same answer except faster. So the reason we have to have the crossbeam scoped is for the exactly the same reason that stood thread.

K

Scoped had to be deprecated and eventually removed, because we can't rely on destructors to run, which would allow us to sort of leak, references and have them still access. Still accessible, even after they've been destroyed, it's somewhat annoying. It makes the the syntax here obviously much worse, but it's the the pains we take to safety.

K

One subtle tea with this function is that it will yield the elements in the order of the input iterator. So if we have files a B and C and we map over with this will get the resized errors but like the result of the resizing for a then B, then C, but it might be that say a is much larger than B and C, and so the resizing takes a lot longer. So it might be more useful to get the errors for B and C before a finishes, which is what unordered map does.

K

So it doesn't compose the same constraint as map. Does that the the output has to be in the same order as the input which allows you to start processing things as soon as they finish, rather than in the order in the order that they came out of the iterator?

K

This is fine for many examples such as this one, where we're just counting and so in this case, using unordered map might be slightly nicer, then using map, although both of them take about 10 seconds so I've, sort of good in a high level overview of the API and so I'll be now sort of summarizing the ugly bits, the bad bits and the good bits of the API. So I can end on a positive note, so the ugly bits are really the internals are pretty ugly.

K

You don't want to look at them at all as channels going everywhere and threads being spawned when possibly they shouldn't, but um that the like the API it exposes is nice, but it results in quite a lot of performance overhead and then there's confusion with panik's, so you'll usually end up if a job panics. If one of the closures you pass to four or map panics, then you'll usually end up with double panics in your application will abort.

K

This is really waiting for panic recover to be stable, so that I can catch those and then repro kate them properly without having to rely on sort of stack variables and destructors and detecting things that are difficult to detect and the other ugly bit is that to be generic.

K

Simple parallel just assumes that the its inputs are iterator, which means that it can only call next. So that means that you end up distributing work very sort of slowly. Unlike like rayon, you can split your your inputs. For example, if you take a slice, you can split it into at halfway.

K

You've got about half the work going to one thread and half of we're going to the other third, but if you're using just an easier bound, he can these like steel elements off the front, and so this is much slower and much much more contention if you're trying to do it. Concurrently, however, specialization should allow us to sort of pull back some of that overhead. So for spore, iterators and types where you can take work off more efficiently.

K

A simple parallel will be able to do that internally, so you'll just magically get faster, faster code, if you're, using an appropriate tape and on the bad, unlike rayon, simple, parallel, isn't really built up from like a primitive like join, which means it's not appropriate for dividing conquer strategies.

K

So, if you've got a tree of work where one job will generate to each of those will generate two more and so on and so forth, like the quicksort example that Nico was using, then you then simple, parallel, probably isn't the right library there's far too much overhead for this. It doesn't use fancy work, stealing, data structures internally and so on, however, rayon exists, so you don't need simple parallel for these things. That's means it's not such a downside. However, if you've got a flat, embarrassingly parallel work load, then simple.

K

Parallel is great, especially when those work pieces are quite large. So, as I said, there's quite a lot of overhead, but if you're doing something like resizing the image or downloading something off the internet, then you'll be easily able to parallel lies the simple parallel and get big benefits from it.

K

And, of course, as the name suggests, it's simple, you can almost use it by doing text substitution almost but um well, adding a few parentheses as well, but that's the high-level overview of simple parallel, easy parallelization without without much control but designed for sort of big blocks, big, big binaries and so on. Not fine, grained, work, I think, that's all the stuff I've got are there any questions is the link as well to be the slides and the courage and so on all right.

A

Any questions I, don't see any. Thank you here on. Thank you very much and thank you all for coming tonight. uh It was great having you all and I will see you in February I think we have the room for a bit longer so calm and mingle. Sorry, internet people next.

F

A

Solve it, love like oculus rift, stop I, think I'm off the air.