Rust Programming Language Rust Linz, 20 Dec 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Rust Linz, December 2020 - Stefan Schindler - Atomic Counters and Cache Lines

Description

Stefan (https://twitter.com/dns2utf8) speaks about Atomic Counters and Cache Lines

How do atomic counters and cache lines interact with each other and why do they affect the performance of my programs?

About Stefan Schindler
Stefan has been working with Rust since 2015, is a member of the RustFest Project and the maintainer of the threadpool and sudo crate, and many others.

Rust Linz at https://rust-linz.at
Twitter: https://twitter.com/rustlinz
Speak at Rust Linz: https://sessionize.com/rust-linz

A

Okay, hello, I'm stefan sorry for the delay uh not used to this kind of presentation. Yet so we'll see how that goes. um Quick introduction, uh then we'll go over integer types, so this will be more uh of an advanced talk. um Our mc kindly has uh an eye on the chat, so please enter your questions there and then we'll go through some examples and it will be like yeah. Why do we care into like the nitty-gritty details into an actual application?

A

How can we use that in rust and yeah? So about me? I am currently studying at my masters started with rust five years ago.

A

I'm also organizing frostfest, give some talks every now and then, if you want to watch the rustfest talks, you can do so at the watch, the rustfast.global or in a couple weeks, maybe less you can watch them on youtube on your favorite, rust channel.

A

So what's topic, we have to differentiate between memory and symbol. What's the symbol? What's variable, what's register what are atomic operations and what's the visibility problem and also what is a cpu cache? Why.

B

A

Have it what is a cache line before we begin? Yes, most people come up to me are like yeah. Rust is pretty cool, but it's too slow and I'm just like. Have you tried release mode?

A

So that's one thing um that gives you a speed boost of at least factor to usually factor 40, sometimes even more and that's important, because you don't have to optimize right, you just tell the compiler. Do it do optimize for me, another thing I like to run is cargo watch. You can see it on the second line.

A

Cargo watch watches all the files in your project. It watches the source and static and all the other folders, and whenever you change one it waits a little bit. I think it's 0.5 seconds or something. Then it runs these x commands and these are equivalent to you typing in cargo check, cargo of fmt for format and cargo run, and of course, you can add, quotes around the run and then it's run dash release in quotes and it will run in release mode. Why is that important uh link? Time is really bad in rust. Still.

A

So if you have a huge project with like 300 400 dependencies, which you get to that point really fast, if you use actics as a web framework and maybe some image libraries for image processing, then you have another image generation library for the list: little fancy icons and then you have database and maybe database pool and maybe some websocket handling and whatnot.

A

And then, after all, that there's your game logic. That's like 300 crates into the build process. So cargo check is really fast because it does not link your project. It checks the syntax, it checks most of the types and then it tells you if you're, okay or not- and that's great format- keeps your code clean in the sense that it has one kind of format.

A

That's really great for working together with people uh there's nothing worse than looking at someone's code and and just not having a feel for where the control flow is going just because they uh have some uncommon, uh indentations and whatnot.

A

If you're coming from c like languages or javascript, where brackets are optional around the, if or the for loops or the while loops, then you know what I mean it's horrible right, because someone can write stuff like if four and then one statement semicolon and another in one line, and then you have to remember that the last one is not included in the if or the for loop leads to a really really dirty box uh moving on. So so. Why do we care about atomics I've? Given you here, uh ls toppo of my machine?

A

It has 24 gigs of memory, four cores with two compute units, and you see it's pretty powerful and that's a laptop so to make problems solvable. We need a lot of compute power and python is great for for teaching, algorithms and whatnot. But when it comes down to raw compute performance, we want to be able to leverage all these eight cores or whatever we have. I mean I've worked on systems with 128 cores.

A

So would be a shame if you only use one or two of them.

C

A

How many times do we need integer data? Most people think like yeah. I will work with strings or tables and databases, but you can do a lot with in teacher data. um One of the big things is counting stuff or indexing stuff.

A

So, even if, if we work with strings indexes have an impact now, if we have multiple objects in a list in an array or in a vector, we need to be able to calculate these indexes fast and have fast access to the stuff.

A

So this is a quick question and let's see if we can make this work, I will try to open that chat myself.

A

I don't think I've seen any suggestions where will the index be stored? So let's resolve that this is these. Are the answers so index is size, it's not on the stack. Thank you. Vladimir just seen you.

A

Index has type u size, because that's where uh the data index brackets that forces the type into u size and because we operate on the index level, it actually lives in rsp. It's a register so that symbol does not have a memory address at runtime and because llvm is pretty cool, um the data will get analyzed by data. I mean the program code will be analyzed and then the compiler decides. Oh it's just for loops right. It can determine that it's just for loops from that control flow and it will say too bad.

A

We will unroll that, so you don't see this loop in the assembler anymore and that's pretty powerful. We don't have to care if we run this on a really small arm machine. This will give us a huge improvement because it doesn't matter what the our machine is. There's no more index calculations, it's just like jumping into the fields and reading it out.

A

Where's my moving there we go so so what happens if we want to watch a value from another fret like something is happening over here, and we want to watch it and react to it.

A

So, let's make a scenario, so we have two frets a counter and a watcher, and these we will set these up with the control thread, that's running in the main function, and it's basically what it says there. The counter will just loop around.

A

It will count from 0 to 99, and the watcher thread will try to count how many times does it change like every time the value changes it will increment itself and it will record to us how many times it was able to observe the same value, and one would expect to have this one-to-one correlation right, because this one increments by one, this one sees it and also adds one and follows suit.

A

So this is our code. If you have been using rust a lot, you probably see the the nodes here. This is directly from the example, so you can run this code after the talk. The last last example I have to upload still so that one will be available soon after unsafe. As I wanted to say, this is here.

A

This is it's a sign that you have to be really really careful, and you should add a note why this unsafe is safe and, as we will see, this one is not moving on the watcher thread. uh It waits a little bit at the start. It waits for 500 milliseconds just to give the operating system a chance to like set everything up and the other uh one is like a clock right. It goes round and round and round.

A

It doesn't really matter when we will start observing we'll, just if we miss it by one cycle, no matter we'll just go, and also here to be able to read the threshold value. We have to use an unsafe block.

A

Another thing you have notice: if we didn't copy it into threshold local, we would have to fetch it from the memory all the time for every time we access threshold local. So this is another hint that um this is probably unsafe and probably not so stable as we have it here. We have no comments, not great in production.

A

If you use unsafe, always write a comment for the next person and even if the next person is you in a couple of weeks, explain why this unsafe is necessary and maybe, in a couple of weeks, you'll revisit this code and say I can make this safe now and never have to worry about this doing weird stuff.

A

Also threshold local is in the register again. So, okay in the main thread after the watcher has observed, I think it was a hundred or or something samples it will print how many it has recorded, and then it will also print the last part of the history.

A

So we can see how big the runs are and yeah, as I said before, one would expect naively that this would be like a small number and under more importantly, a constant number, like maybe one or two, because it's simpler to watch um a value than to write the value.

A

But this is what we get we get 88 and then we have observed it once and then we get 93. We have observed it. One thing 98 observed it just once so this zero is how many repetitions we have um we wanted to have. I think it was hundred thousand samples and we only recorded um 99 000 transitions. So this is like all over the place, but now we think yeah, maybe it's just too slow right. Maybe it misses it because it's debug mode and debug mode is slow. So, let's put it in release mode.

A

um This is this is even worse. Now we don't see any transitions anymore and.

A

You can guess now why, and I will go back to the code of the watcher thread.

A

Why is this not recording any changes anymore.

C

A

C

A

hmm Patrick is typing also, someone has posted the unroll, crate and dog service.

C

A

I don't see any uh when proposing any answers anyhow, so the main problem is the compiler looks at this from a single threaded point of view and says hmm this thing here is a symbol that we see out here.

A

And yes, compile optimization. This is correct, so this one is a local symbol. This one is always the same, so we don't need this update anymore and then this value will always be the same. So this is always true, so we can just increase the counts. This branch here gets discarded completely and then we have max test which just contains a count increment. The compiler is smart enough to transform this loop into just assigning the max value to count itself and then it notices.

A

No account is never read actually so it doesn't need count anymore and then we see last isn't read either. So we didn't need that anymore. Either: history yeah. We still need that. We need to allocate that thing because the user forced me to and then we have to give it back and that's the end of the story. So the only thing that survives this is a huge allocation and a return statement, not very obvious in in this small code and get even worse in bigger code. So.

A

So, let's fix that, so we have the counter function and we say: hey: compiler, don't optimize this away. Yeah uh vladimir just wrote in the chat that rust needs the volatile keyword and we don't. We really don't want the volatile keyword it's badly designed. It does mean different things in different c versions, and it does mean different things in c, plus plus and in c.

A

Although the same compiler compiled that code, if we actually need volatile, then we have this read and write, volatile pointer functions and we can actually do stuff like that, and in this case we don't get optimized away because the compiler knows yeah. We have to touch the memory. Therefore, we have a side effect and this has to be like attributed to something. So this will actually write to the memory always and then we get somewhat correct numbers.

A

Oh there's some slides missing, I'm really sorry! So on the next slide, there should be something.

C

Hot fixing slides during talk.

C

And frame, why is this.

C

A

I'm really sorry, I have to read you the numbers from the source code, um because there is an entire segment missing here.

C

A

See no, no, it just reordered. Okay, never mind this chapter. um We just have some weird reordering.

A

I'm not confused.

A

Okay, let's trust the slides, um so now we need atomic operations, really sorry for messing up light text, um so atomic access is uh stuff that appears to the user of the cpu.

A

um As one thing like, we cannot differentiate, half access and whatnot.

A

There are some optimizations for that. That means we have to pay attention to caches. We can do misaligned access, but it will be very slow. This is especially true if we have uh bit flags that we have to check, because we have to fetch and load the whole block and then access the word and then have to perform and or operations onto the thing, and then we get the bit we want, which is terribly inefficient.

A

So um another thing that most people are not aware of is that atomic operations also have effects on memory reordering and instructional reordering, and to make sure that this works properly on every cpu is very, very difficult because it's different it's different uh on amd 64..

A

It's mostly the same on intel, x86 64, but it's different on 32-bit and it's definitely different on arm depending on what arm it is it's very different. Sometimes we have reordering in the cpu. Sometimes we don't all of these things. We have to keep in our head all the time we compile the program. So why not use an abstraction.

A

This is the same about the lock prefix, so these are the most important thing is actually the the fat keywords. These are the assembler instructions that will behave differently with the log prefix and the log prefix is what our abstractions with atomic variables and pointers and counters used to actually work for comparison, incrementation, not sub. Whatever all of these operations, we need cpu support, the hardware has to guarantee uh atomic execution and on amd 64, it's called lock, but on the other uh iso, on the other instruction set, it's different, so oh yeah.

C

A

Almost forgot about this one, we have very different um guarantees, so um on intel intel is very optimistic and everybody has heard of spectre. I assume and other side channel attacks and it's buried in this number. How many cycles does it take one cycle? For instance, if you increment one number single threaded, it's one cycle pretty fast atomic operations 20 to 120 cycles, it's not great, but it's still.

A

Okay, the trade of this laid off is worth it, but since intel was not able to implement amd 64 instruction set back in the day, they emulated it. So the cpu that is running intel.

A

Sorry intel cpus are not running instruction sets in hardware like amd's are they are actually virtual machines that emulate the cpu and behave the same most of the time. That's why we need micro code updates all the time, because intel makes mistakes and they have to be fixed, amd opteron. As the last platform, I was able to get solid numbers from it's 40 cycles for every operation, very deterministic, very nice- for everybody that relies on real-time systems for the ryzen and epic.

A

I haven't been able to get solid numbers, and I haven't had the time to measure it myself so, but it will be roughly in the same neighborhood, if not better. So um let's do a counter race again.

A

So we have a 100k increments that we want to do and four parties, and we want the four part is to increment this global counter.

A

Const is super because the compiler can insert that number into the various places and static means it's one field in memory, and everybody knows where it is and we'll just write to this field.

A

So, oh um for those of you that are new to rust. This is more functional style, uh I'm really sorry if you're confused by the thing. So let me go quickly through the thing.

A

So you know four symbol and then in so we can do ranges. If you come from python, for instance, you can say four x in range colon and then it will iterate over this range. This is the same thing. This is the syntax for it one number dot dot one other number.

A

We can also have inclusive range, then it's one number dot, dot equal and then the other number brilliant. Why do we need brackets? Because if we put this dot map here, it would go to the second number. We don't want that. We want to have the range and then from this range on we want to map so range are iterable. So this is a for loop.

A

Sorry, yes, a for loop and then we use map, and then this I and this I transforms.

A

With this function into whatever we want, so we have underscore I so we ignore the value, so we just want to have a for loop. That runs as many times as we need, and then we spawn uh frets.

A

This is the closure that we spawn inside the closure. We uh get the counter pointer because we need that and then we use read volatile sorry. This will read volatile to get the current value incremented by one locally and then write volatile back into the thing, because we know where it is, and we can do that and it's unsafe because you shouldn't do that.

A

So this is our setup. This is our spawn handlers and then we use collect collect means everything that has been generated by the previous iterators. So these handlers of these functions um now we collect that in it to a vector and because we cannot actually um well this in this case, we can't this one is just laziness, so I don't want to say uh what exact type this is. This would be join handler of a function that returns uh something, but I don't have to, I can just say, underscore figure it out compiler.

A

I don't care, and then we use this little handy-dandy thing. It's called intuitor, which takes a container an existing container and iterates into the thing. Why do we do that here?

A

The reason is we need to buffer all of the jobs, because if we don't buffer them in this vector, then we will just spawn one at a time and process them just like in order, so they don't run concurrently, don't want that so start all of them. At this point all of them are started and then once everything is started, we'll loop over what we have and then we say: okay for each.

A

We uh wait for the result, so we'll just wait for the first one, the second and so on, and then yeah that that's it because we operate on the global thing and then we can ask ourselves: uh how big is it so we'll have to calculate the pointer again with this and then we can just read it read the point or read, volatile and say: okay, what we expected is how many parties we had times how many increments we wanted to do and them the result is pretty bad.

A

It's way off and the reason for that is sometimes actually most of the times, because we're pretty close to the lower end. um All four frets read the same value: do the same calculation and write back the same value if you're a little bit lucky they get a little bit out of sync, and then the ink value starts to increase a little more than one thread would have done alone.

A

So, let's fix that, let's use atomic use size, it's the same code, but instead of pointer magic. We just tell the ros compiler figure it out, make it work on this hardware and we'll say fetch, add, and then this one is new ordering and I will get into this after a sip of tea.

A

Pardon me, I haven't spoken that much in a long time.

A

I also see carl has the perfect reaction. It says it's fascinating but way too low level. I agree for most people. This is probably true.

A

Oh moving on ordering, if you come from c plus plus they introduce the concept of ordering, because there are different trade-offs when you write to stuff, for instance, there is to see this at the bottom sequential consistent. This means we don't want memory reordering and we don't want instruction reordering. So at that point, when we load this thing, every code that has been above this has to be finished. Every memory read and write has to be finished at that point.

A

That is great for stability, but it's pretty bad for speed. How bad? I will tell you later in the measurements, on the other hand, relaxed is fire and forget. So what you tell the the cpu chip here is yeah.

A

It doesn't really matter if I get reordered as long as you don't lose the increment I'm doing and that that's all the magic there's uh acquire and store as well. So this means you don't care about this.

A

These are like the two extremes that relaxed and sequential consistent are the both ends of the spectrum and in between there is more where you can say I care about all the reads before everything I do and then I want to care all the rights before I do and then there's one where I care about most of the read and writes before, but not about the location in instruction reordering.

A

These are details. If you get into that hire me, I mean I have time for a project just studying a master's degree, all right and now, with fetch, add and load. We get what we wanted. We have exact store. uh The runtime is comparable, if I remember correctly, but I don't have the numbers here, so that's what we should do. Also, as you may have noticed, we don't have unsafe anymore, so it just works by the way, if you're using a more recent compiler atomic usage in it.

A

This constant has been replaced by cons generics code. So you can specify a certain number at compile time, so we don't have to have uh an init function that runs before our main. That then, sets the atomic value to a certain thing. If it, if this value is not zero, so we can just write it into the code and put it to 42 or whatever we need so cache lines. So why was this so?

A

um Why did I build up to this?

A

Oh there's, a question in the chat, so not met. Yes, what's the difference between the atomic group and mutexes.

A

A mutex is a pretty complex thing which you can build out of atomics, so what you can do is you can have an atomic counter and an atomic pointer, which is, let's be honest, basically the same thing see, then you, if you want to access the mutex you increment the counter value of the mutex and say hey.

A

This is me now I want to access the thing and then you have to check again that no one else has changed the mutex and then, if not, you can access the value safely and when you unlock it, you can decrement the counter.

A

This lock is a really bad one, so the real mutex is little more complex. Also, this is relatively slow, but if we don't have an operating system, that's how we are going to operate. We have to have an atomic linked list linked list in rust, usually really bad. So this time yeah, we need it as a queue, and then we have this atomic queue where everybody that wants to acquire the mutex has to operate on this atomic queue, and then we have to create these data structures.

A

uh Add ourselves to the list, see that no one else has manipulated the list in a way that we don't like and then check. If we're at the head of the list, and if we are, we can access the value and manipulate it, and if we are done with the list, we can remove ourselves from the list and point the thing to something else, and then we have to actually wait for a bit until hardware flashes out the cache information, and then we can free the thing.

A

Otherwise we have a use after free, potentially in another thread.

A

If you think this is insane who would write such a complicated thing, um that's actually how the linux driver stack works with just writing stuff around in memory and waiting a little bit in the hopes that no one has access to it, and then there is a value in the documentation.

A

If you acquire any structure from the usb uh kernel driver stack inside the kernel, you're allowed to hold it, for I think four seconds or something like that, and then you have to let go. Maybe it's less nowadays um so yeah, the point being uh hot plug is hard and doing that stuff in c is harder and that's the solution. They came up with so um oh yeah before before we move on there's another cool crate called parking lot and that leverages operating system support.

A

So what modern, big cpus can so the bigger arms and the amd 64 cpus, and also the big intel ones. Of course they have memory areas that we can add callbacks to and if we access them or write to them, we get a trap.

A

So that's uh the thing that we get the virtual interrupt from the cpu and then we can react to that and that technique can be used to build new taxes.

A

They're called food tags that are faster than the traditional mood techs, because we leverage the hardware support but doesn't work on all platforms and usually requires operating system support. But it's really cool and it's really fast look into parking lot. If you're interested so cache lines taking another sip.

A

Our cpu is great, let's be honest: it's a marvel of technology whenever you think computers are shite, um think again think about how many things a cpu can do, how fast it is and with what dumb instructions we have to feed it, and it still makes stuff fast right. You can.

A

I have spoken to a professor a couple years back where he says like yeah: it's no fun with these modern cpus, with the instruction reordering and the memory reordering, uh because, even if the, if you have the worst assembler code, these new, I back in kind of ai7, this new i7. So this was like 20 10 ish.

A

These i7s are bad because they reorder your really bad assemble code and make it performant, and you don't even see why so there's no point in optimizing, assembler anymore and they're just like yay, so they actually canceled that class, because the last time they had hardware that was old enough. That would not do instructional, reordering and memory reordering was with these old, 8086 uh 32-bit cpus single core really old, like they don't need a cooler. They are that old.

A

They're, like this big, like a aaa battery- and they run at, uh I think 80 megahertz, but that's already overclocked, so default clock is 40..

A

So focusing back, I'm sorry, because otherwise this gets too long. So we have one luma node in my machine, our bigger machines have multiple. We have l3 cache. This is the cache that the latest ryzen generation has a lot of, uh like, I think, 16 megabytes. Instead of four for a laptop cpu, which is great, it improves performance. Then we have l2.

A

We can see here. Our core complex has two pus, and these two share one l2, and then we have an l1 data cache and an l1 instruction cache that's important, because the instruction cache with programs that are linear in flow like have this instruction.

A

They feed a lot of instructions, while the data stays mostly the same. On the other hand, if we have a program that is fast but runs in a loop like for a graphics calculation, instructions are usually the same, but the data gets fed through really fast. So we need to have these two caches that are independent and the data cache is uh smaller in in my cpu, these caches grow with every generation, because usually it's the easiest way to improve performance by 10-ish percent.

A

If you double these caches and that's what manufacturers will keep doing because uh clock speeds are at the limit, we cannot go above five gigahertz without super special cooling.

A

I've seen some sub-zero cooling now, but, let's be honest, if we waste 500 watts on cooling solution and then another 300 watts on our cpu that just blows our energy budget for everything. So we cannot use that the solution is more cores, as we see with this laptop and bigger caches.

A

Moving on uh this is a more colorful diagram, this one's from amd. So we see the instruction cache here and this one gets fed by brand prediction. That's what I said with instruction reordering um instruction reordering is before that then this branch prediction, so the cpu actually calculates both branches of your. If and then does operations. Caching and then micro up q.

A

So this means amd also starts using virtual infrastructure inside the cpus it's faster, but I really hope they don't overdo it because in the end we will end up with spectre on amd and meltdown and all these not so fun things that will make our lives really really really bad and compromise security and, at the end of the day, there's also a chance that there will be a new command at some point where you can just say from here on out, stop optimizing just be a dumb cpu.

A

Even if we lose performance of 40 or 50, we don't care. We don't want anything weird happen until there's this other flag that comes on and said yeah from now on on its application code, it's not security relevant anymore, you may optimize and decode in any in any fashion. You want also um what's not shown on this diagram. Is the decoder and the op cache they instruct the memory uh prefetcher.

A

So the prefetch will actually fetch memory, that's not used by your program yet, but the algorithm is like yeah there's a good chance. It will be used soon. So, let's fetch it from the memory, because the memory is really slow. um Another fun fact for the cpu. This part that we see on this diagram, uh the cache is mostly transparent.

A

So it does not see the difference between l1, 2, 3 and memory just operates on the thing, and the memory prefetcher will fetch stuff from the memory over l3 into 2 into l1, and then the actual compute units, the orange and the red here in the middle will then do their work. There's other compute units, and this would be another talk on what you can and cannot do with all of that.

C

A

Who measuring this? It's it's insane right. So many things, it's just between two instructions and you usually never have to think about it. um But one common thing is: if you have performance counters, um you may use them with. uh What's its name, um prometheus right, there's a prometheus crane, you can say: okay, here is a bunch of counters.

A

I want to measure, I don't know how many requests I get in my web server per, I don't know second or in total, uh maybe per route right. You have main page. Some api calls some other api call. Then you maybe want to measure how much time does it take to fulfill all these requests? How many are authenticated?

A

All of that? That amounts to a pretty significant number of counters and now we'll dive into what happens if we store these numbers too close to each other and yeah. So, let's build a test by the way. This is the test harness. This is built-in to rust. Nightly um you can create a file.

A

Let me quickly show this to you.

C

Why is this not working.

C

A

So we have our project here, oh by the way, this is xr for the people that don't use it. It's very colorful and then we say in this rust tool chain file nightly. So that means all the cargo commands that we uh that we issue will automatically get redirected to nightly.

A

If we exit this folder and use cargo there, it's whatever else we have in my case this is stable, so I can recommend that for um measurement projects like this, just as a side note, black box is a really handy function because it tells the compiler don't optimize this away, like don't touch it just treat it as random input output.

A

We really need that and we have a different between the relaxed and sequential consistent. I will show you the measurements afterwards yeah there's some unsafe magic, because I'm using some trade magic to swap out the the things I want to show to you.

C

A

See if I can show you the code, real, quick, what source main? So this is the trade. These are the counters and then these are the tests. They're called bench- and I have this type argument that then gets matched by a constructor and then the work gets done so moving on each structure has eight fields, because I designed this test for up to eight cpu, I only have eight and yeah just increasing the number by the alphabet is important to ask. So a is zero b is 1 and so on normal is just atomic.

A

U sizes all the way, then we have alignment so we align the whole structure by 64 bit. Bytes pardon me. That means we will start at the cache line right. So when the compiler um allocates space for that thing in memory, it will start at the cache line. This means it will not lap over some cache lines because the smaller structure may fit in the first half in the first cache line and the second half in the second, and this will cost us performance again.

A

We don't want that, and the first structure is everything is aligned. Like the structure itself and the fields itself, so each member has its own cache line here. You see dot dot equal from the ranges.

A

So if we just run one fret, it's like eight milliseconds per execution, that's not bad and then two gets it gets worse right and then it gets really bad. um So, okay, let's, let's just make sure that it's aligned- okay, let's let's align it. So it starts the same. But apart from that, there's not much difference right. So the the small numbers change, but the big ones are basically the same.

A

um What does if we have cash line? Awareness? Interesting? Isn't it right? So we lose hardly any time anymore, because the cpu is actually able to access different parts, different caches, and I'm really sad that that's like pure fireworks. That's it! If we look at the numbers combined, that's it! In this case we save a factor of 10 with eight cores and that's great.

A

If you have a bigger cpu, this number will only get better.

C

A

I'm really sorry for everybody was like, oh and now this is it.

C

A

Yeah, if we order differently, if we have ordering relaxed, I hinted it before see these numbers. They come down a lot like a lot a lot, because now we can tell the cpu yeah just whenever right. We don't need you to interrupt your memory prefetching, and this test is backed by. uh I think 100 k entries that we, so we allocate an array, one block of memory with 100k entries and then, depending on the kind we will access them overlapped or in different cache lines, and we actually gain speed.

A

A lot like remember this one, the first cache line I wire it's like 8.7, always unless it's normal with relaxed. We are faster even with eight cores working on these numbers. So the takeaway here is, if you have performance counters, always always use ordering relaxed, because otherwise you will interrupt the I'm sorry. I want to head to the conclusion. You will interrupt the the program flow, the memory optimization so badly by your measurements, that you will distort whatever you're measuring too much to have any like relevant data, yeah other conclusions.

A

We need atomic operations for many things. uh There are lots of great crates, standard libraries, one of the big ones that abstract away all of these things.

A

It frees our heads from the burden of thinking about all of this stuff, like I think most of you heard um about this stuff for the first time, uh just be happy that someone else took care of this with an abstraction, and you can manage other problems than like running around and handling individual threats. Just use a pool or think about which hardware capabilities your program has to use to be able to run or not. The compiler will just tell you and that opening to questions.

C

B

Patrick hello, yeah, there is still one question you already answered a lot of them, but not matthias still has one question: is there any reason to not cash align everything? I guess the performance will drop when doing that.

A

I can answer that question, so my demo project here has, let me make this a little bigger. It has also a normal run mode.

A

Actually, I should should be a good citizen and show you the release mode. So this is the size. How big is everything atomic use, size, eight bytes right, eight times, eight sixty four bits: that's what we expect. If you align the structure like eight fields with uh eight um things, then it's 512 bytes long. So it's half a kilobyte for one data structure and the others that are normally aligned are just 64 bytes.

A

So that's the trade-off right. We trade space for speed.

B

ah Right yeah, that makes sense. Yes, it's pretty cool. I really loved your talk. It was amazing, I'm really into performance talks and I loved it. A.

B

B