Rust Programming Language Rust Cologne, 16 Mar 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Rust Cologne, March 2018: Florian Zeitz - Caches and You

Description

One of the great things about Rust is that it gives you a lot of control if you need it. Amongst other things it gives you control over memory. How big is your data structure really, and where should it be allocated. This talk will look at caching in modern CPUs in conjunction with Rust data types and data structures. We will see how efficient code can be written to best utilize the cache.

More on the website: http://rust.cologne/2018/03/12/what-are-you-plotting.html

A

So yeah and I'm going to talk about Kesha's, quick, quick, Paul, I guess who thinks he's so caches, PU caches. That means with that who thinks he is somewhat intimately familiar with caches, so I think that's one and a half to two and a half people right, okay and then actually, because we haven't really spoken about that, you feels like he's still more of a rust beginner. Just so I know: okay,.

B

A

That gives me some pointers. How much to talk about this? um Okay, so to lead with. Like everybody knows, there are two hot computer burns in computer science right, the first one is naming things. Second was intentional' addition and third one is offer own house obviously, but actually- who you want to talk about today, he's like three sections. So first I'll give you like a quick cash is 101, just what is the CPU cache? What does it do? Why is it there? Whatever then I'm going to talk about caches and performance?

A

What do you want to pay attention to? What do you need to do in order to actually get performed code out the whole thing and we'll look at some of these done across data structures just to see how they perform with caches and which ones we want to use and maybe use for which use cases right?

A

So quick cash is 101 so morning, computers right, there's a problem. Access to ram is relatively slow and particularly is relatively slow compared to how fast modern cpus actually are like modern cpus run it about 4, gigahertz and accessing like just one piece of data from the ram. It's a lot slower than that. So a lot of the time. If we're just reading from random access memory, we're waiting for memory, we can't compute with that memory right. So the solution that's been deployed since forever.

A

Basically, is we get caches in our CPU that started the whole day at the stator in smaller memory? That's a lot closer to the actual CPU core, so we can access it faster and don't have to wait for em all the time we still have to wait till usually, but we can get to the data.

A

A lot faster right, usually caching is like done in two to three levels, so each level is further away from the core and then the level closer to the core will from the upper-level some schematics pictures for that just so we you have seen it once so usually have, like course, let's say photo processor in this case. Their registers in that course, which are physically things, that's parses to access. You can usually read from them in a single cycle.

A

Computer with them do everything or lesson in cycle, even depending on what information you have and then you have the level 1 cache to yellow one cache, which is the closest to the core, which is there's one of them for each core and actually there's two of them for each core, one for the instructions that you feed the CPUs, so that it can actually has an instruction seem that stream that it can run and the other one for data and that's the fastest cache that you have right closest to the CPU.

A

um Then there is an outer l2 cache, which is a bit bigger, which is significantly bigger already and yeah one cache. So you get more data in that when it's a bit slower. So if data is already missing from your one cache, you might be lucky and it's stolen.

A

You know if two caches you get it from there, instead of going back all the way to the ground and then again one layer from that um is your three cache which, usually, if it exists, doesn't always, but if it exists, usually spends of course, so you have sort of shared caching between course, which is also kind of useful. If you have threats that actually migrate between the cores, so they don't.

A

After migration have Texas Rams such they can read format three right um and then beyond that you obviously have Rahab itself, which is the slowest in this picture after that, in theory, there's swap space like if you have your RAM full and all that custom.

A

So in terms of how caches actually work it's not as if they read each byte individually. If you get data from the RAM, it will always or typically always end up in caches and will end up in all caches up to the CPU. So if you read but later, it is an l1, l2 and l3 cache, and it's done at the granularity of what's called the cache line. So you can't read a single byte or four byte at a time, but always would call the cache line. Cache is working that granularities.

A

If you have to throw away data from cache all that data in the cache line will go away. Sizes of that are different between CPU is obviously typical. Size today is something like 64 byte um right, so cache innovation writes the the thing for our initial jokes, so modern cache is ours called coherent, so you've seen like we have separate caches between separate course, so each core has its own l1 cache. But the thing is: if you read data from multiple cores and also write, they different multiple cores like the same data.

A

Obviously you get problems with coherence. If you're not doing anything special because then you'd write on one core, it would only end up in the yellow one cache of that core, like it's not necessarily going through to the memory. Another core might have the same data in its cache and it would read the old data because it didn't actually see the right. Even if the right was going to ram it would have all data cached from the read and would still read 0 theta.

A

So what you have to do in order to get rid of that is implement some sort of what's called a cache, coherency protocol and invalidate data. That's in other course people's caches right. So what if anyone writes it will have to novella date that same data and everyone else's caches? And, additionally, to that, we have another reason that, like we have two other reasons that caches might drop theta.

A

One is obviously therefore so you're reading new data and the oldest or whatever algorithm using gets discarded, but also, interestingly enough modern caches or at least on Intel systems, are usually inclusive and what that means is in order to have data in an r1 cache.

A

It also has to be in all layers above that, so, if you're dropping data from the l3 cache, because, let's say Core 2 is reading a lot of data and your F 3 cache is full and you have to drop data from the yeah from the O 3 cache and that data actually belong to.

A

Let's say core 0, then that data would have to be dropped from l2 and l1 cache for core 0, which is just an implementation, detail and property of like inclusive caching, um but also something to keep in mind like even if you're working on something that fits comfortably in the l1 cache, something that another course doing could potentially invalidate that data right. So some numbers for that. Just to give you an idea, I want caches. This is some info skylake processor, so relatively recent i7, I think I'll.

A

One cache is typically like three 32gb bike and size. L2 cache is over a 256 kilobyte in size and three caches like the whole eight megabyte, but on the flip side, the further away from the cpu you get, the slower. The whole thing get like l1 cache is already like about one on a second I'll see you cache is about 3 nanoseconds, 2x, chess and he'll.

A

Three cache is about ten nanoseconds and once you get to ran it's a whole 60 nanoseconds, which might seem relatively little, let's say, but this is based on like a core running at four gigahertz, which means that in that case, the that l1 access actually took four ticks right because, like one of your heads would be one take per nanosecond and then four takes is like yeah on an a second portfolio is cool, so in theory like assuming, you can dispatch one instruction per cycle or protic it.

A

Actually, even if data is just another, one cache have to wait for ticks before you actually can do a computation and like getting worse to the other side.

A

Yeah quick last thing in the like 101 section, there's prefetching, so if you're reading linearly to dietary either forwards or backwards, theta is not fetched just when you actually access the data, but the processor also recognizes that hey someone is iterating forwards through later. Let me already fetch things that you'll probably access in the future. So while you're still doing calculations are not necessarily thinking about accessing the data, the processor will already have gotten it into cash, so you can access of that which is kind of useful property.

A

B

A

We're not a few things about the whole thing. What does that mean for our performance and the way I've done that is I have like, if you let's say typical examples. Some are a bit more esoteric than classical use cases just to get something that you could actually measure.

A

So I think this is this super classical one for this one like matrix multiplication. So like the usual way you do matrix multiplication is let's say this, so you have a function. You give it two slices. What I've done in this case is this are two-dimensional errors erase, so the slice contains a set of erase the errors, have all the all have the length of dimension and the slices which you can see from this, but they also all have the tanks of dimension. That's the theory about that.

A

So and then the output is a vector and what we're basically doing is we're allocating a new result vector for that and then iterating in rows and cells of the whole result vector with indexes, and then we have that variable K to do like your usual matrix multiplications. So if you don't remember that matrix multiplication is um multiply the row by the column and add that up and that's what's that, that's right- a is indexed by the column with the second index over K and B is indexed by the road, so we're the first two.

A

Next. What that means is practice that it goes linearly through a so each element in a is success, linearly one after the other. It goes through B with huge jumps where huge means dimension elements right, because it goes through the Columbus instead of through the rows. So the second element in that case will be one whole line behind that in memory, so different approach to that store your data differently.

A

So that's like the usually administration you hear about make sure that matrices, by which you multiply, which you might actually have like a fixed set off and you might use multiple times of stuff, don't store them in row, major order boosterism in Calamba major order, so that in memory you have the elements of a column next to each other instead of the elements of row.

A

So basically, you transpose it ones which, like the difference here in the in the code, is very very minut, like a priest fields just become the second index which is k4b, so this is again row by column, but assuming that B is transposed right and turns out, this is actually a lot faster like this is about nine times faster on the core i7 mentioned there, which is like Jesus laptop.

A

So what we want from that is CPUs as a volunteer at are very good at linear reads like u2 prefetching, but also due to the fact that, if you fetch one element, you'll have a whole cache line. So in our case we have to use 32 so you've had one element. You get sixteen elements raising the into cache and don't have to do gram reads for sixteen elements right and during those elements you have the time to again do prefetching.

A

In fact, your next cache, fine and stuff, like that, the way you store data is important, like you can always think about that. Try to have data the nearly in memory instead of with big gaps and yeah. What's keeping large chunks of data also note that what I've done is deliberately I've used those areas?

A

So if this was like a nested vector, I'd have a heap allocation which basically contains pointers to a heap allocation, which means I'll, always have dimension elements directly adjacent, but then I might have a huge gap and then get the next time mention elements so using arrays. In cases like this, it's often helpful to actually get.

B

Memory adjacent.

A

um The reason specifically I chose a Veck at all, like you obviously could do two-dimensional array like just phrases and then put dimension twice. Is that I needed in a heap, because the size at this point is so much that it doesn't actually fit on the stack and we don't have a good way yet to allocate actual array on the heap. So that's what that does, but it's the easiest way to get away with too. The other thing as a vector in the nursing is an array, and he gets a nice data layout in memory.

A

Ok different example: a bit different, let's say: you're I, don't know writing a physics. Student simulation writing a game. You have a huge pool of objects and you're reusing those objects, sometimes, but a lot of the time they are not life. So to speak right, so you always store them, but you don't always need to do computation of them. So you put a nice. Is life tag as a first element of that object and then, as you can see, on, the right would iterate through that list of objects and just check for your own.

A

Is it life sure the computation at the end, if you do have to do work?

A

So the interesting thing in this case is each of those objects is larger than the cache line like if we do the math, it's, let's assume 1 by for the bull, because rust will helpfully automatically do reordering for us, so there won't be any padding and then it's 8 bytes for the IDE, it's 24 bytes for the string like 400 tanks in capacity each you 64, Oh, more you size technically and then 8 bytes for H of CF 64 and the thing that goes to be like 65 bias or something like that.

A

So slightly cars are other than a cache line on this particularly system, but typically on many of them these days. So what we're doing is we're fetching like a cache line or maybe even two for the object and checking, let's assume the very first bite of it, and then we don't care about it anymore if it isn't life- and we do that like for each object. So basically each object XS, even if you're doing no computation on earth at all gives you. Okay, ash miss and you have to load a cache line.

A

So that's obviously bad. As I said, data locality is important, try not to skip large chunks of data and what you're reading and the large chunk is basically everything after its life in this case.

A

So, but what do we do instead right? So there's these things that you can typically do. One of them is the separate vector just for that and like make sure the indexes line up that way, you can iterate over the first word to Chelsea liveness or whatever other information. You might need right information you to write over often as opposed to some data that you just did rally um and then that will be cache efficient, because everything is just one after the other and give you a nice PDF.

A

The other way to do it is the conversion that it's not too typical, I think but seeing relatively often in and things like physics and high-performance computing stuff, which is baby turning an area of structure into a structure of array. So, where you have now a vector of objects, you would have the object type and for each field of the object, type you'd have a vector containing the elements in theory.

A

Compilers can do that automatically, but it's let's say really successful, and the main reason for that is basically that, in order to be able to do that optimization, you have to be sure that nobody else's to eat looking at that type, because, if you're changing the type and the external interface obviously becomes different. So if the data is sufficiently private, compilers would actually, in some cases, do that optimization. For you.

A

I'll just ask questions up to this point. Anything that I should talk about some more.

A

Okay, so he's everyone is done, fundador him doing a decent job. Okay. So this is this one. I have not a cheat code example for, but just like it's a general point code size matches. So if you have some hard code like, ideally, your code would fit into your one instruction in that case cash, so he bought set it to kilobyte.

A

Of course, you can always do that like depending on the algorithm, but if you have like a tight loop, you might consider trying to get all codes that is running that loop into that limit and, like some effects of that are like calling a function that is already in cash, might actually be cheaper than in lining that function and not doing function.

A

Call like usually he'd, say that um in lining is like always cheaper or whatever, because function calls or else except ends have put and everything, but in particularly it's code, that's being repeated, it might be nicer to actually outline that code into the function. If that makes it fit into cash, it you don't get a cache miss every time that code is running. It could do to like large amounts of code it overall.

A

So if that's what happening to you, you can try to annotate functions within line ever if you think that gets you speed up. Also interesting is that generic functions they'll get for each type argument you put into the generic function a copy of the function, um so that also blows up the size of codes. That's actually being executed at a time and you can try to get rid of that to some extent by passing trade object. I, don't know if that's a trade object, a reference to trade anyway.

A

Never quite sure we say trade objects just for box trader also for reference trade, but basically having something that just implements a trade as a reference um and do it that way, which also won't give you the code duplication, but for all of this device, very obviously measured like if you're optimizing hard code and you don't you never as a rule of thumb, you never know what's actually a problem, you can try those things you can try to see how the bigger the code actually is, and you can try some of this advice, but always measure always try to make sure if that helps or if that doesn't helps, and if it doesn't help, then do it but like in particularly the cynics like inline ever on, like very small function, counts or significantly make things worse.

A

So that's something to look out for okay, um so this is. This is a code that I was referring to before. That is like super synthetic, um that's the thing you will ever find like an actual code, but it's something that really nicely shows the phenomenon.

A

So what I'm doing here is basically I have a function that will increment elements of an array so I have this array X up there, which is like into an atomic reference count its into the atomic reference counting type, so it can be shared among threads and within that I have 16 atomic you sizes, so tamaki size in this case.

A

Just for reference is 8 byte, because this is a 64-bit system and then what I'm doing is I'm getting another reference to a reference count to the X array and passing advantages, new threat and within the threat, 100 million I think right, yeah, one of the milk x incrementing an element and outside the thread, so my own thread I'm, also 1 million times incrementing. Another element, the particular elements which I'm incrementing are like pass through. The function is IJ right so sufficiently key to everyone. What that function, Utley does.

A

Not or shake your head.

A

Maybe hmm okay, so basically like increments to different elements of an array. A lot right, it turns out like incrementing. Adjacent elements in that array is a lot slower than incrementing elements that are in this case is seven eight, eight apart right, so like any any any takers for y8 that might be happening. Any ideas.

A

Yeah, so the answer was maybe SDM. There's a cache plan um yeah precisely like that's that's. Actually the problem um so what's happening here is that, if we're having exactly adjacent elements, we're both writing on the same cache line right. So one third is writing to the first eight byte of the cache line. The other side is writing to the next. A part of the cache line and, as we said before, if writes to a cache line happen on a different core, then you'll have to invalidate on it on all the other course.

A

So, basically for each ad, in which an ad from the other cob came in between write, the cache line has to be reached by the core.

A

So this is phenomena that's known as false sharing. So basically the neccessity, for this is just like one thread reach memory. The other thread writes near it so as that it invalidates the cache line in the threads that's reading.

A

So typically, this is a cruise if values are on the same cache line, obviously because otherwise you won't have this false. Sharing, like will be different cache lines and each one can really add independently on their own cache line. Access needs to be by different cores. Access needs to be somewhat frequent, let's say, because only only then will that actually make a difference.

A

So if cache is invalidated anyway between your writes like but something else, but just using data, this won't matter right, but if it's somewhat frequent and you actually invalidated, while someone else cool I've wanted to access it, that's a problem, and there has a piece to be at least one writer, because if they're only reads no curtain will ever be invalidated, I think this code actually might want to run just to sew something. um So, oh great.

A

Is that sufficiently large.

A

So this is just basically the function I've had on the slide right, so increment, stuff, 100 million times, and then down here. Sync I'll say some quick words about that.

A

um It's basically how I benchmark that so first thing I actually run the function once without without measuring anything, and the reason you'd want to do that if you're measuring anything related to cache is that otherwise, it's usually like messes up like you'll, get cases where you're measuring, in theory different things and notice that the second one is always faster, even though maybe you didn't expected it to be slower, and that case usually, the effect is that your cache is not warm right. The first time it ran everything was loaded into cache.

A

Second time it ran everything. What's already in cache. Service was a lot faster, and that also applies to different code, of course, if you're doing another algorism entirely, but on the same data, the second algorithm would always get positive effects from the first one running, because they'd love it. They tend to catch so basically, I'm, always warming up cashing than just using like rusts instant interface like very, very basic to get measurements and yeah I guess what's wrong. There.

A

So it takes a while, but actually not too long. I hope, maybe I need power if it does right so immediately adjacent a entries in this case more than five seconds right. So a lot of time I'd like and we set up eight and Reese apart, so not on the cache same cache line less than a second, it's a very, very noticeable effect of both sharing images as a way to actually like proof that that's what happening I can like force those threads to be on the same core right and if I do that.

A

You'll notice that suddenly the adjacent letters entries are lots faster, but the spur energies have about the same timing and yeah. What's the same timing so reason for that is opposite. They are now using or sharing the same cache.

B

A

So that's good, so I do actually have a different example for for file sharing and it's a lot more real-world problem. Is it's not actually like I'm measuring the wrong thing and I'll show you real, quick.

A

So let's say you source.

B

A

So this one is is a different example and actually like credit for this one goes to UM I. Forget the name like the basic thing you want to do. Is you want to count the number of odd elements and then in the metrics again right and we'll do the threaded? So basically, we get ourselves a resolved error like sufficiently large to be able to use it for different numbers of thirds and then we'll, like iterate over the metrics but channeled into elements like the number of lines divided by the number of threads right.

A

So each thread gets one chunk of the matrix one of rows and then each that independently traits across that row and currents, elements that are odd.

A

The way we count elements in this case is like we are writing to one of the elements of our result, vector right, so first strike rights, gentleman, zero of the result, vector seconds roadster element, one so like obviously, should be pretty classical case of false, showing to you right there right into adjacent elements. They're different threats are in different course, supposedly so it gets a problem. So the way you'd fix that typically is pretty obvious. Actually, then, can I get them both up on the screen. Well, sorta, OOP! Sorry!

A

So will you fix that? Is you introduce you to the local variable right? So within the threat you just have your own local variable account on that? It's not in the array not on the same cache line or anything and increment that and just at the very end right to the result vector once problem with this code is, as I said, it doesn't actually measures the right thing because turns out and I've not sufficiently looked at the assembly, but turns out even with a single threat.

A

Weather should not matter at all, like the false sharing thing. The one was. The local variable is significantly faster and, like my guesstimate of why this one is always significantly faster, it's just that the variant without the local variable will always access memory, like always writes to the vector, but they were one with the local variable can increment the variable in a register, so you're never like accessing memory at all. Not even that one cares not anything right. So yeah I think that's like it's it's bad, that it doesn't show false sharing like that.

A

For me, I think it's an interesting lesson, because what that shows is best thing to do about. Caching and accessing memory is don't access memory like if you can do it in a single register for some time, then just access memory once when you actually really. You need to do that. Have a small local variable just like this. um Also.

C

Interests yes,.

A

Okay, any any questions about that. Actually I.

A

Think it claims to it actually claims to.

D

A

I'm not sure it doesn't.

A

A

Probably not but like I make do with ten months or whatever I have I I hope and in one charging related questions.

A

Okay, so um I thought I'd just do a quick, quick run number of like typical or us data structures like not about a lot of them. Few of them selected vectors, obviously like from a cash point of view. Amazing, because vector is like one contiguous data structure in heat memory, no gaps, no, nothing! You can iterate through it pretty fast. As I said, then your reads what CPUs are rated at?

A

If you want to read through that forwards or backwards, you're good from the cache prospectus, that's nice and turns out they're so good for caches that they often outperform other data structures. That's discipline that those other data structures are specifically designed for at least up to a certain number of elements, just because cache, often like trumps.

A

Everything like if the other data structure for some reason has to use the separate nodes that are a part in memory and just not very, very sparse, very, very tied together, you'll you'll get negative effects from that and everything being a one cache as it usually is for elements and lectures. It's just it's just nice for that, oh so, linked lists linked. This is almost only in here, because if that talk was done in another language, someone would tell you about linked lists, because it shows how events who's ever used linked lists in rust.

A

Right and it turns out I've never seen anyone use linked lists in rust. Actually, but it exists is in standard. Larry right has always been as the collections linked list is there so think lists I'm, not so grateful Cassius, and that's usually why they're like the example of don't ease this link. This have one heap allocation per node like you have those separate nodes per element right and they have one pointer to the previous element. What is the next time at the actual data?

A

Each of them is a net separate heap allocation and those he publications are not necessarily like contiguous, like unless you're allocating them like in terms of time closely to each other they're very likely to be spread apart across your heap right and even if that's not the case, the heap will usually like look for bins that are pretty close to the size reallocating, so they might still be spread, even the fear. Locating them just back-to-back.

A

So iteration is pretty expensive, like you're chasing as I say. The next pointer is through the list of linked lists and you are likely to get one cache. Miss per element right because you're shading across the cache so and then again the point I'd usually be making at that point. It is insertion on linked lists is terrible and stuff, like you might have been told. The new certainty that linked lists are great for insertion, because insertion is very cheap.

A

You just like change to pointers, but actually you'll also have to walk the list up to the point where you want to insert, and then you get the super, slow, walk and actually vectors are like cheaper up to a few thousand elements and blah turns out rust, can't insert off the middle of moving list.

A

That function is straight up, does not exist so yeah, that's nice, I, guess what what does exist is we have in the pen function, so you can link to linked lists together, which is a pretty super cheap like again changing to pointers but yeah. Actually, because we have a pointer to the very back up the list, you don't have to run across it or anything, so that's like legitimately, cheap and legitimately a lot cheaper than chaining together.

A

Two vectors right, what's also relatively cheap, is pushing to the front or the back of the list, and particularly to the back, it's again very, very likely to be slower to pushing to the back of a vector because for the list, you'll have to do the evaluation for the new node for the vector. You probably already have the heap allocation for that element because vectors like have a capacity separate to the length of that grows and powers of two, so you only really rarely actually reallocate memory for lectures push front.

A

Actually, it doesn't exist for vectors, interestingly enough, if you actually need push front and push back on a data structure use vector deck, so that's the trick to basically get a push fund. Tech deck is that the underlying data structure is a vector so vectors great and it's one contiguous heap allocation or things like that, and it's basically a global ring buffer right. So it's get goes across the the last element of the vector and begger of the front, and whenever that ring buffer becomes too small, so any area limit is full.

A

It will actually click double in size. Like a vector would for memory allocations and you have a larger ring buffer yeah, and that way you actually have like super cheap push front and push back on that one too and their vector cheap and not linked this cheap, like you're, not doing a heap allocation and putting a pointer to that heap allocation, but you're actually like writing to space and a vector and you're done and like changing two indices, because it's a ring buffer.

A

So that's pretty nice about that.

A

Right did I forget something if you like forgot something.

A

So hashmap another data structure- that's like kind of interesting kind of used to a lot and half times actually happen to be pretty good interest in terms of people seem to be that say a bit pissed off about C++. Well, what is it unique map I. Think right is that the equivalent to the first programmers?

A

Yes, yeah, a city map, is the one that people agree on shitty anyway. Oh yeah, unordered map. Yes, right, brain is working okay, so you have a city map and yesterday I know. Redmond St under odd map is pretty much what its equivalent to our hash map. Let's say, because map also needs to keep up some watering properties, I guess in terms of insertion order, water level and the register.

A

So it turns out the interface that C++ programs on everyone gets is apparently done in a way that only really works if your hash map is implemented, one specific way- and one specific way in this case means like the implementation you probably heard about in university. Let's assume, which would be you, do a hash and that hash gives you an entry in a list and that entry in the lists itself is a list so for everything that has to the sine value.

A

You have like a linked list of elements which all the things you don't want about linked lists right and the other thing you like get to do is put them all in a long vector like all of the elements and then, if you notice that that element is already taking, you have basically two cases.

B

A

To do either you do linear problem, so look at the next of the minute, then, if the next element or just jump ahead for some time and hope that there's nothing there, the one being better for caches, obviously, is linear probing because, as I said elements next to each another, in the same cache line, great and rest like she does linear probing.

A

So that's pretty cool linear program was Robin Hood bucket stealing in particular, as the documentation will tell you, and there have been tons of blog posts, written about rust, hash maps and why that's amazing that we use that algorithm and how fast it is and how cache efficient and whatever.

A

So, if you care deeply about that, you might want to look at one of those in any case yes, cool much better than and other languages, but necessarily like just because it's a hash map, it's not as good as a vector in terms of like doing things, but it also does a complete two different things right. So it does key value. Lookup and vectors just stored in their data.

A

Separably I should make the point here that what I said about vector e Souls like if you have really few elements, it might actually be faster to just iterate over the whole vector and compare the key of each element might happen just because that fits into cache way better, but obviously, like the cutoff point, is not not super obvious there.

A

Let's one icing b3 map, also interesting data structure, I think basically a binary tree so same thing as a hash map, but you need to have all durable and I think comparable, yeah a little honorable values.

A

Also there, the documentation will tell you nice things about how this is better than the regular binary tree, and probably also there that are plenty of blog posts. Basic idea is like instead of doing is usually binary tree, where you allocate a note for each tree element right and again would get that pointed chasing kind of thing.

A

Each allocated note as a memory actually contains multiple elements and then from each of those nodes you into multiple again elements which contain multiple animals. So to some extent you have sub trees and just regular vector.

A

So again, making it a lot better for cache efficiency, then, like you, usually, the binary tree would be okay, super quick summary, so caches are good at two things, which is temporal locality, so accessing the same thing, often in about to say, is about the same time and spatial locality accessing things that are close to each other right for aforementioned reasons.

A

So try to do that as often as you can try to keep your working set small, though the data that you're working on both in terms of instructions in memory, if you need to be fast so particularly in hard loops. Obviously, because that's what do you want to optimize and use like like if in doubt, use neck and second do that? Maybe use lag deck if you need a push phone for whatever reason?

A

Alright, that's it from me. You have any questions.

B

What about memory? Alignment regarding the cache 664 writes I strongly believe they have to be aligned and I. Don't know if the rust is able to express loss maligne, not in the memory. The.

A

Question actually I'm also not sure I think there might now be an align attribute. I couldn't tell you for what, though, like they're technically multiple places, we're having them in the lion attribute and say makes sense right. It would make sense on actual data type definitions, so you could, let's say, define your vector like in as an SIMD vector type and make sure that's always aligned to 128 byte or whatever the requirement.

A

There's it probably isn't six and by the line, something like that and then, whenever the type is used, you said alignment and it would make sense on separate variables if you have just a very generic type like an array, but you have a use case in mind that needs a certain alignment I think at least the formal one has an attribute might be nightly only, but I can really on that, but certainly space to be filled. Yes,.

A

How to insert elements in the middle of what anything I think I.

A

Think part of the answer is just do like coming coming back to the point of when we're usually would talk about, insertion and linked lists. Right is, as I said, they are not as good as you think. You are anyway, because you have to walk the linked list, but also they are surprisingly inefficient compared to vectors like things.

A

Cpus are also, apparently very good at and not surprisingly, seeing how a vector in straighten everything out arrival nowadays is shifting data around so I haven't done the benchmark myself and I have forgotten the actual numbers, but people compared like linked lists and inserting in the middle to inserting in the middle of a vector which obviously needs you to copy the whole bunch of trailing memory over right, and it was surprisingly late that linked lists were actually faster like few few thousand, if not 100,000, elements of something like that for I. Think typical 32-bit.

B

Integers or whatever.

A

So yeah just do and don't expect it to be slow just because you feel like shifting it around should be slower, so think part of that answer. The other answer is just use the insert function where it exists. Apparently it doesn't exists for for data structures where it's inherently stuff. Interestingly enough, like I, feel like just I, see pretty well in terms of what you should be doing with the type like.

A

If you tried to use push front on the back, you'd see, there's no push front and then let's say you go to the rust, IRC channel and say: okay. Why said it performed what the and people will tell you use the back deck? It basically does the same thing: I select clear, but it has a push fund and the push one will be a lot faster than it could be on the vector right, so I feel like Russ is guiding you to some extent with which the operations are actually available.

D

Are there any good performance measurement tools for rust, like itching for C++ I.

A

Don't know with you, it seems like performance counter. Reading.

C

Like cash grind, right.

A

Honest answer, I, don't know like really really white answer is the same: that work for C++ often work for rust, like particularly in select the little space and open source space. If it reads to off debug info, there's, no inherent reason that it wouldn't work as well for C++ as it does for us, like you, should be able to get the sample counts for performance counter, matched up to an address anyway, like that's what they do and that matching that address based on Waffen, first, two source line.

A

Is this the same thing for C++ and rust, so in theory, cache Brian and all those two work just as well.

D

I have one more question in regards to announce in regret, would announce innovation.

C

D

Some piece of advice that is often floating around us that to keep in am variance of basically the same size. For example, you have like you typically result and you're. Okay value is no a string and your error type is a custom enum that is using some variants that are very much larger than 24 bytes. So this is bad.

D

Do you have some kind of advice in that regard, or some experience in, at which point it's better to use like the heap allocation and box type instead of in like that? For that.

A

I think in general, that would depend on a use case like the point at which that makes sense to do right, but I think the UH preservation is sound, like you're, obviously wasting like wasting a lot of potential cash space. If you have got variants in the enum which may not matter, those are the variants that you use of for the result type. The specific argument that I've heard people make is that, like a good case into your case, that almost always happens?

A

Is you get the OK variant and then, if you do so, if you, of course, if you do things like map over something that we turns a lot of results and in theory all of them are like you ate, and one bite large Patera type is like 20 bytes or whatever. That means you have to access 20, bytes of enum and theory. Each time you you access one of those results and have much larger gaps in between.

A

So if you have those cases which you have let's say fairly often because, as I said, okay should be the case. That almost always happens. Unless you have a special code.

A

It probably makes sense to use box instead and then the point at which that makes sense probably depends on a lot of factors like if arrow still happens somewhat frequently, that heap allocation might ruin his performance.

A

If the arrow type is not that much larger, it might just not be worth it, and things like that, but I probably has always some advice for that is measure.

D

B

D

Any more questions.

C

C

A

We are talk so.

C

I'm a newbie to us, so my question is: is there some special trick that you can do in rust that you cannot do in another language? So, for example, and the good example is this in Lyon case, where you say sometimes it's better to in line and another time it's better not to in line and yeah could put the language help you with this. Maybe you could say in line only short stuff or what what kind of things can language to do? This.

A

Good question I think just having the in line attribute upon itself is already something you could mention, as the language can do that right like if you can give it hints standard c and c++, certainly don't allow that there probably GCC extensions for that, though, and probably MSB c extensions as well other than that good question. I.

A

C

A

Right so something I've like mentioned in passing is what will help you to keep data small is like they are just structure appropriately. Well, that's the C advice for life right because in C your struts always will have the same order s. You write out the fields. So if you do things like put them 8-bit, integer first and then a 32-bit integer you'll have to have three bytes of padding between that right, which is space that you're potentially wasting.

A

If you put the eight byte integer one byte integer at the very end, it would not necessarily waste all three bytes. That's actually a terrible example, because it's still for alignment reasons, but if, as soon as you have like two of those one byte integers, that would help that's something that rest does automatically like from the beginning. We have said that structs memory, representation is undefined in that way, like fields can be reordered by the language and by now they are like. This has been implement within I. Think the last year is amazing, like that.

A

So that's helps because just keep stages smaller than they would be in other languages, and you can still that right feels in the order that feels semantically meaningful. For you good point things that can help that come to mind not necessarily specific to the language, but having things like actually development or debug output, for how large functions are actually got in terms of code size during.

B

A

After computation stuff like that, I do think some of those things are implemented or have been proposed for the rest, compiler and not sure. What's a state of de them is actually but yeah.

A

That's the smallest that comes to mind immediately.

D

Sir, was blog post recently about some people, optimizing JavaScript source map passing by reimplemented it in rust and compelling to web assembly, as you can probably tell from this title- it's not that related to caching, but the cost is like just the time, compiler and whole other language runtime between, but one interesting optimizations they were able to do was to use trait objects or mono more fast traits very easily.

D

Like what program recent a few slides ago showed was using trait objects that is not using generics, but using the reference to its rate, and that is all the change you need to do to switch between these representations so in in some cases, it's possible for the compiler to generate a very specific version of a function for a generic parameter and inline.

D

This and basically optimize it down to very tight loop, because it knows this type, but in other cases it's not possible and the compiler will generate a huge bunch of code for each of these functions and in this case, is much more easy, much much easier to reduce the code size by using its write object. And this is quite easy in rust- and it's not possible or much hardline as a languages.

A

Okay, yeah thanks for the attention.