Rust Programming Language Rust Australia - Sydney Meetup, 12 Oct 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: FridgeSeal - A foray into thread-per-core programming in Rust

Description

Tom comes from a background of maths & statistics, getting his start in programming in Python and R for ML and statistics workloads.
He found Rust a few years ago whilst looking for a new language to learn and now uses it at work and for side-projects.

A

Alrighty cool: uh this is a little talk, I like to call a 4A into thread per core programming or or architecture, uh something I discovered, semi-recently and thought it was just exceedingly cool um and when my friends volunteered me for a talk, I was like yeah sure I'll I'll talk about that. That seems neat um cool yeah. So this is a 4A into thread per core programming or.

A

Or I bought the whole CPU, so I'll use the sole CPU making your program go fast and do other things fast too.

A

I want to put like Zoolander meme in here, but I wasn't sure about copyright, so alrighty, uh let's just let's just set the scene a little bit a little bit background. um Lots of applications, probably lots of ones you're. Already writing you've already used, um probably do quite a lot of stuff, there's, probably a lot of like tasks that have to be done. There's some like there's a reasonable degree of parallelism. There's, like you know some you know, do some work here and then maybe kind of like when that's not being done, drop back.

A

Something like you know, serve requests and you know kind of like latency sensitive, but at the same time you want to run some background script to clean this up or vacuum some table or etc, etc, etc. um Lots of you know reasonable degree of concurrency uh all familiar to us, um also core counts on CPUs or in the case of something like Lambda these these go up. This is like the num number goes up. It's like every year. It's like CPU manufacturers are like, and here's something with even more cores in it.

A

um You're like so cool yep, that's great uh and rust is nice and Speedy, and it's got some cool parallelism things. You know, sharing it so immutability, rayon, etc, etc. Work stealing, that's cool!

A

um Could we do more with that? Probably like get more out of these cores? um Okay, oh and I missed the one about nvme devices, because a few years ago, obviously storage would have been slow and you know don't serialize, that into disk, because disk is slow, blah blah blah blah uh that Gap somewhat small. With things like nvme drives, you can get commodity stuff. You know seven Giga. Second, like read it's nuts um anyway. So what does this lead to? This is going towards kind of like unlimited power.

A

If you've got an nvme drive and you've got a model Linux and you've got like 32 cores on your thread. Ripper. It's like you can do a lot right. If you can, if you can harness all those um cool so uh introducing the concept of thread per core programming, pretty pretty straightforward, it's basically, you have a thread per core, terribly uninteresting uh kind of kind of unto itself, but that's kind of it's really only the start of it.

A

uh The sort of the underlying theme Here is that we want to like divvy up our application um yeah. We want to divvy up our applications so that we separate out the incoming work into as many kind of like independent shards as we possibly can, and we don't really want them to communicate too much, um because that would mean waiting in queues or waiting for locks and mutaxes and stuff and, as evidenced by the uh next slide, welcome to the inconvenience queue because waiting for things is boring and bad.

A

So let's not do any of that. um So uh we can do some things like we can Shard our incoming data, which means, and then we can hand that off to a to a thread, we can pin the thread to a CPU. So this your operating system won't then punt your thread and Shuffle it around. That's great for your instruction case locality, your data cache locality, because next time your thread runs uh you'll, see if you, oh God, that's you'll, see if you won't be like, oh and by the way here.

A

Let me reload up all of your data back into the cache. um If you don't need to spend time waiting from that, that's time you can spend like doing productive work right. That's that's good um and you know not shuffling stuff around I already covered that uh good for throughput and latency.

A

um Just some stats to back this up, uh this cool piece of research was basically like.

A

If you follow this architecture and you can cut your like your tail, latencies um I think it was Microsoft who did a study on basically like if you can cut your like the worst of your like tail latencies, you can end up reducing your like your main application sort of like serving latency, which is cool because it's kind of like Karen and short if you're like oh make the Fast Parts go kind of kind of fast and that's how I get my application to go faster, not quite um what else uh sealer DB the Cassandra, DB drop-in uses thread per core uses, Library called C star and red panda, which is a Kafka drop-in, also written in C plus I know.

A

This is a rust talk and C plus buses, uh but bear with me um also uses thread per core for reducing some of that tail. Latency isn't getting more out of their machines, um cool that sounds great I. Really I really sold it in lots of lots of marketing talk. um How do we get started? How do we do anything with this um cool introducing thread, purple programming and the way to start? Is you go to your local Spotlight and you find the thread of your choice?

A

No okay, yeah moving on um seriously oh God. So seriously, though there is a very cool framework, um uh called glamio or or glamio how I don't know how to pronounce it. um That gives you thread per core sort of functionality and some async functionality for your threads. Along with, uh if your Linux kernel is recent enough, um a thing called IOU ring for async Io, which is really cool and directio for your.

A

If you have an nvme Drive, um what they do, directio lets you skip the file system, cache on the way in and out, which means you can kind of. You can write directly more or less uh to your nvme drive and read directly from your nvme drive, which means noisy applications, don't sort of slow you down. You have to wait for the page cache um to like to flush. You don't get held up by okay, other things going on. You can also.

A

Yeah very handy, very cool, very, very modern, um and it also has some really cool. uh Does anyone is anyone familiar with control theory and Engineering? Yes, yeah it has. A h thread gets its own scheduler, obviously, because each thread has a local async executor, and then these controllers that are attached to your thread are then powered by control theory. So you can go. Oh that's. Cool um I would like to now have separate async task queues and maybe for one task queue.

A

The latency doesn't matter for another task queue the latency does matter, and then the tasks can specify their latency and the scheduler on your thread will be like oh cool. There is no work in my task queues. I can just you know, continue ticking through my like non-latency critical work and then, as work comes in from the high from the latency sensitive task queues, it will shunt your latency insensitive tasks out of the way and be like sorry. These things have to complete first.

A

So it's very kind of like without writing, a lot of Code by hand to manage you know, firing off background tasks and doing this and doing that um you can have all those functionalities. You can have it scoped to your own thread.

A

um It all gets nice and logical, very kind of easy to get your head around because you know exactly what's going on. um The other neat thing is that because these async executors are all thread, local, uh your your futures and what you await no longer has to be send and sync because it never leaves the thread. You can have thread unsafe things now, because it's like no. This is fine.

A

It's not going anywhere your own ship's, all good cool, uh so putting it together, I've started a little project um called called tarkhein, because it's a forest um and I thought it was cool. uh This is a a it's. A reverse text search little application.

A

um It uses these things called percolate style queries traditionally in a full text search. You would persist your documents, you would index them and then, when you make a query, the query is ephemeral and you kind of you look through all your documents and you find things that matches and you come back at that point in time with the said: results, uh percolate style, text search, uh works, the opposite.

A

You store your queries and you stream the documents through it and you then sort of you build up a persistent set of results and you can like you, can notify on change or you can notify on. uh You know, you've got more hits or the the search order, change, etc, etc.

A

um Useful if you want to search a lot, a lot, a lot of data and you have a reasonable idea of you know what you're looking for um or you would like to rerun the query, lots and lots and lots, um and you don't want to kind of sit there waiting for it to troll. You know hold like 150 gig index every time you search um so I thought that made for a fun problem and would make for a sort of good fit to this cool.

A

Yes, uh threads and little native communication, we can Shard up that data quite quite easily. It's an unsolved problem at the moment. Bearing in mind this project is all of like two weeks old, whether their best fit is I Shard the queries and pass all the data through all the threads that I Shard the data and pass the yeah or whether I do it the other way around.

A

um But that's why it's a foray into thread per core programming and not uh and not an explanation of the virtues um and maximum. We also care about maximum maximum utilization of all those resources, because the throughput of the documents is kind of Paramount right. So you've got. You know, hundreds of terabytes of stuff you want to trawl through. You don't want to be waiting an unnecessarily long time, just for each single one, uh one to complete um I think someone at a search engine called Manticore. They have this as a search functionality.

A

They have this as a feature. Elasticsearch has a feature in in their documentation. They had like shootout, and so my my goal is to be able to beat the elastic search throughput for uh for percolate queries, which I think several thousand documents a second, so fingers crossed I can get there.

A

um The other cool thing obviously, is the I o urine and direct, I o being able to um shunt that stuff to disk being able to shunt things to desk as quickly as possible.

A

You want to keep those you want to keep the right and read cues for those drives as more or less as like full as possible, so they're always doing like nice page size levels of work, and this is a good fit for that right, because, if you're streaming, something through, you can probably like you've, uh you've probably got hits occurring, and so you want to like feed that off to your drive as quickly as possible, um yeah and that's uh for them to drive and blocking, which is the time spent searching new docs yeah.

A

uh That's the other Advantage. If you're, not um if you're, just waiting for the drive to complete that's a dead thread time, you don't we don't. We don't really want that.

A

That's uh we could spin out a new thread to do it, um but then, if you do that too much you risk like CPU over subscription, because you have too much like thread contention and you're like your OS schedules, like hey man, got a lot of threads and so I'm just going to start shunting things off, because this thing completed in the middle of this thing, you're like no, that one was doing work, um which is obviously bad. If you want to focus on throughput and sort of like you're optimizing.

A

Your utilization, cool I have some code samples here. My code's not great, fair warning. um I. It occurred to me quite late that I was like. Oh it's rust made up. Pro people would probably be interested in code samples because crazy program, language, who knows um sorry? Oh it's laser pointer, neat, cool um yeah, it's it's pretty straightforward.

A

um It has an API that, like very much resembles your standard, threading API, which is like nice, obviously, because you don't have to learn anything particularly exotic.

A

um uh In my case, I spin out um almost as many kind of like I think I have 16 cores in my computer and so I spin out, most of them for the main sort of indexing workers and they're a reserve, a couple for um a like a Tokyo runtime to run my like Network server, and then you know, leave some for spare um the placement fixed. There is basically you're telling the operating system and the CPU. It's like hey this thread's attached to this physical core.

A

You can't you can't bump this off like it has to run on there, um which gives, which is what gives us the instruction cache affinity and the core and the data cache Affinity. That's that's so useful.

A

Yeah, uh this one's also a little light just um demo of like the core kind of like indexing Loop, um it's kind of ugly, uh it's kind of blocking. It's not very good, but um it's like it's fundamentally kind of quite a simple things. There's a bit of everything there stuff these threads, get their work off a lockless queue.

A

A lock free queue, rather coming in from Tokyo, um uh tries to process them and then spawns a mutable file Builder, which is just like a nice high level interface over some, like direct, I o as an async task. So whenever it gets those matches, um it's sort of like it's scatter gather style, API out over your out of your drive and then they just sort of like complete in the background, and then the OS comes back and it's like hey all your stuff's there and I'm like cool and then and then on.

A

We go um oh God, yes, that is that is my talk. Are there any questions.

A

I think! Oh no! Yes,.

A

Say again, sorry.

A

Oh I haven't, but that sounds cool. Oh sorry, yeah um uh have I explored using thread local memory allocators, yet uh I haven't um the moment. My focus has been on kind of like can I Stitch this framework into my code and like and get it going um I believe there is some discussion on their like zulup chat about um a sharding and like thread per core aware allocator, and what the best option is to use for that. So we'll probably get there, but yeah.