Ceph Ceph Code Walkthrough, 30 Nov 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph code walkthrough: BlueStore cache autotuning

Description

Slides: https://docs.google.com/presentation/d/1WoA7Ghzqaaxd10SZIT58gDHPv5URWTo5fObq3K2_GJs/edit?usp=sharing

A

Good morning,.

A

Good morning, Josh.

B

Good morning mark alright.

A

Good deal I'm glad my audios working right now.

B

That's always a good start. I know.

B

If you're ready, you can please jump into it. um Yeah I wasn't watching the recordings rather than attending the live session. So, okay.

A

Sure all right! Well, let me see here if I can share my screen this one.

A

Okay is that showing up and.

B

I see your hold of stuff. Yes,.

A

Let me see if I can I'll just blow up this thing for now full screen. What.

B

A

That wasn't excells.

A

Okay, that was bizarre, yeah.

A

This is the first time I've tried to use the Google bones. I say present. Does that does that do oh wow, okay, well, I! Guess this.

B

A

Close enough yeah, alright, alright! So I've got this big fancy sounding title here about 15 minute Ami and continue learning. He said well really what this comes down to is I want. I've got some slides here that I wanted to go through kind of, as we talked about the code, but what it really comes down to is I'm trying to make the OST smart I really I want them to do smart things and and not to kind of force on the user.

A

A lot of decisions that one they're they're they're, not real, necessarily real good at making and and two is kind of irritating to have to do so.

A

What I'm gonna make a little bit of an assertion here that configuring stuff to me right now feels like managing a hive where you, as the user, are, are kind of expected to be like the queen bee, telling all of the OS DS and all the mods and everything else exactly what they are supposed to do.

A

I kind of got this the analogy recently I've been reading, Ender's Game with my kids and there's a really good quote there- that every single one of our ships contains an intelligent human who's thinking on his own and every one of us are capable of coming up with a brilliant solution to a problem. They the the kind of enemy bugger fleet, who are kind of like giant insects.

A

They only come up with one brilliant solution at a time they think fast, but they're not smart, all over, and you know in in terms of how seffle kind of works in so many other ways we have kind of distributed. Smarts right. The whole concept kind of behind crush is that you're you're distributing this decision-making process. There are rules, there are very strict rules about how it happens is all deterministic, but but Madonna have the user having to kind of come in and say: okay, you know this.

A

This thing went down so let's place data over here or something you know. Be it be crazy right. Maybe we wait too much work. No one does wants to deal with all that. So I think that we should do something we should think the same way about our options. It kind of a very high abstract level.

A

My thought is that it's much better if users define goals, this is kind of the the goal that I'm trying to accomplish with the setting and I. Don't want to think about it. I don't want to have to you know, try to finagle or finesse very fine-grained, behavior I just want it to do the right thing and what here's?

A

How I I think the right thing is so right now um you know: we've got like I think last time, I counted, it was somewhere I, think between 200 and 300 options in option, stuff, CC and at least a variety of those not not all of them are documented and then controllable through stuff manager. Probably I assume there's some GUI that that has at least a subset of these, if not all of them to find that they can set.

A

Just is this it's kind of madness? Frankly, so I guess: okay, here's a argument about crush, but we can't went over that already.

A

My thought here is that, okay, what what can we identify is kind of like a good starting use case for thinking about things differently?

A

The the one that that that came up was memory management in the OSD I actually observed some users struggling with it, especially in the way that the cache is in boost or defined and how they affect memory, usage and kind of a the the the thought in their head was okay. Well, I've got these nodes that have this much memory available. How do I set all this stuff so that you know my OSD will stay within that and use this memory in an effective fashion at the time this was this was like eight months ago.

A

Nine months ago we didn't have a really good answer. It's can like well tweak these things and try it out and see where you end up, and it was awful. So this seemed like a really good starting case or this.

A

So the the very first thing is that I kind of well, this isn't exactly true, but um but one of the early things I started looking at was trying to just say: can we make the OSD fit within a certain memory target you know? Can we can we make it? So that's it's fairly well behaved kind of optimally right, you'd, actually pre-allocate all of the memory, and you would stay in it that way, and you have a lot of guarantees, or at least you'd, be more guaranteed to stay close to that memory range.

A

But we don't. We don't have the luxury of doing that right now, it'd be way too much work, so I eat it and instead what I kind of started looking at was okay. Well, DC Malick gives us all this information about the the heap memory usage and some other useful statistics about like what's mapped and what's not mapped, and this kind of thing so we've already got code in place to programmatically grab. All of that so I I was started. Thinking.

A

Well, okay, you know, I have control over the size of all these caches and in fact you know we already kind of set a target value for the caches in general and blue store. So maybe we can just start adjusting this up and down to have react to how much memory the OSD is using it's it's a little more complicated than that, but it sort of generally works.

A

So after thinking about this, a lot I came up with kind of a very simplistic algorithm for deciding when the size, the total amount of memory available for caches should increase and when it should decrease. So the idea here is that we have an OSD memory target, how much memory we want the OSD to use. We also know how much memory has been mapped by the process and that then kind of gives us the ability to defined a ratio of mapped memory to target memory.

A

This particular example gives a case where it's lower, so the map memory is smaller than the amount of target memory, so we want to increase it so kind of over time. We we take a look at this and we say: okay: where are we at?

A

What's this ratio now, based on the minimum cache size that we want and the maximum cache size that we want, and what that previous value was, we can take the ratio of memory, that's been mapped to the target and we can use that to increase the memory kind of based on what range we have. So.

A

The idea here is that we're kind of slowly growing this memory and as we get closer to cache max, we end up growing, even slower, so we're just really kind of incrementally getting closer and closer to cache max, but that's the limit we don't. We don't want to exceed it, but now what if or unknown reasons other things come in the amount of map memory is greater than the target memory.

A

Well, in that case, we instead look at that ratio, which is basically just kind of what gives you have the ratio of overage, and then we remove memory based on that, but instead now we're looking at the kind of the the the previous cache size compared to the min. So we we bounce away basically and we bounce away with kind of a greater force than we do when we're approaching it. So as we approach we get kind of weaker and weaker right, we don't want to. We don't want to end up.

A

Kind of aggressively hitting that cache max we just kinda wanna, get slower and slower move closer to it, slower more slowly, more slowly if we exceeded. If we exceed our target memory, then we can want to shoot away from it quickly and just kind of again start growing slowly as we approach it. So this is I'll actually go into the code and a little bit here, but this is basically the algorithm that we just talked about that that's implemented there and it. Actually, it took a little bit to get to this.

A

There were, there were a lot of false starts that ended up kind of making some comical looking memory behavior in the OSD, but this this works pretty good the. So actually you know what, let's um let's actually go. Look at the code now so.

A

That is what I want.

A

Is that readable or do I need to zoom in more.

B

That's beautiful might help if you put it into a unified diff mode. So it's the whole.

A

A

A

Thank you, Brian Nick, even just a little bigger all right. So maybe we'll start out here just that the the options that were specifying.

A

Of course, we just talked about the OSI memory target, this OST memory base and OST memory expected fragmentation. All this is really doing is lowering the maximum cash to be some ratio of the OST memory target. We're basically saying here: okay, well, we're we're kind of estimating some base memory usage in the OST once it's kind of at a steady state, and then we're expecting some level of memory.

A

Fragmentation, though, as the amount memory that we're requesting changes as the user changes that the that requests time out the target changes it just makes that curve look a little bit different and you can kind of tweak it.

A

But it's not it's not really something that that users need to worry about, as we account for more things in in this system right now, it doesn't have any knowledge about the the rocks DB right ahead: log mem, mem tables, so the buffers that are there, it doesn't know anything about those or anything about like PG log right now. So this 768 megabyte number is just kind of made up to take that kind of stuff into account.

A

But at some point we could actually really factor it in you know make it so that it was about these things knows how much memory is being used for all these things, and then this kind of expected fragmentation and memory based number could go away entirely. It would just you know, kind of have an idea of what other stuff is potentially using memory.

A

Okay, so OSD cache min. That is the minimum total amount of cache of memory that should be devoted for Bluestar caches. So if, for example, we have a memory leak somewhere and the auto tuner is attempting to like fight it. Basically, it won't go below this amount of memory available for caches it'll. Just at that point say: okay, I've said everything to the lowest values. I'm Louis aggregate I'm, a memory I'm going to assign to this stuff I give up now.

A

I've done my best, we're done or in a scenario where you know potentially you've you've got the memory targets so low. Like say you know, 256 megabytes or something you know it. You can't do much with that. It'll say: ok: I've set the cache as the total aggregate cache space down to 128 megabytes. I can't do anything else. The the memory usage is going to exceed our target. That's just the way it is um the resize interval here is how many seconds we wait between cache base between basically doing this.

A

This this analysis of the cache dies and and how much memory is being used. You don't want to set that too low. If you make that too small, it will really make a TC malloc kind of thrash around it's. It's just you know not. It doesn't really it's not really worth it. I've played around with it a fair amount, but yeah, it's just not really not really useful.

A

One second is enough, though, that, if we're not if there not like some kind of crazy exponential growth somewhere, that that is a problem of itself, that seems to be often enough that we can do a pretty good job, typically of of kind of keeping things reasonably steady.

A

You know not not totally perfect, not not 100 percent within guaranteed to be within our target range, but you know pretty close. All right so well, so we have here now this is this is just you know, changing other levels and things not nothing interesting. All right are the cmakelists change. That's also not particularly interesting all right here in blue stored SEC we're adding in per flew heap profiler that we can make our calls and find out what we're using.

A

This is all just nothing interesting really. This is just getting all of our stuff into the place where we're going to look at the the options and things.

A

It in trim shards, we are now making it so that we are looking at potentially the auto-tuned amount of memory versus the kind of static one that's defined in the ACEF comp file or wherever.

A

The idea here is that we want to still allow a user to to set the stuff themselves if they want to. um You know it's the kind of idea right is that it it does things automatically by default, but these are wants to override it. Then they get very.

A

We don't do anything automatically at all. At that point they they get control. They do it. So this is all stuff to support that. Basically, um okay,.

A

Okay, this is more of the same nothing interesting there. Here we go now we're getting into the meat of it so to encash size, we're grabbing all these options. Basically, these things are are being read in in in the blue store code, but then we've got this vent pool thread now or we're actually just asking to grab them from blue store. So blue store is the thing that, hopefully you know, registers listeners for all this stuff and and and sits upright, and then we just grab it every time here.

A

So we talked a little bit about fragmentation and the cash max values up here you can see that's exactly what we're doing we're, basically just restricting our cash max based on the amount of fragmentation we expect and the the base memory value.

A

We grab while we here we go we're grabbing the the different heap statistics that were interested in. We also beforehand, every time we're we're iterating through this we're releasing the the free memory. There is a way you can do this through the admin socket, but people don't I, don't I, don't really think people use it. I would much rather use it here on kind of the same schedule that we're doing everything else and I mean. Is this we're not doing that often really so this? This helps us keep everything kind of well controlled.

A

So, okay, then we are grabbing the heap size and the number of unmapped pipes bytes, and then we compute the number of map pipes bytes from that.

A

That is not the amount of RSS memory that is used by the process. Unfortunately,.

A

Can't force the colonel to reclaim that it does it on its own, as often as it feels like on Ubuntu, we saw it have in in low memory pressure environments, not reclaiming memory to the point where the the amount of mapped memory differed from RSS by like 20 percent, but on Sun toss our CentOS machines. It was very, very close and I suspect on the aboon tube machine.

A

It would probably the kernel would be more aggressive about reclaiming the unmapped memory if there was a lot of memory pressure, so I think it's probably ok, but the the the kind of irritating thing is that, when users look at the amount of RSS memory by the process, they will see that it's maybe higher than what they set the target to be. Even though the mapped memory is very very similar. So you know it's kind of unfortunate, but it is what it is. I'll show some graphs later on.

A

Okay, so we've got or defining new size and we're first saying it to the the old size, the auto-tuned cache size and then we're basically just bounding it by the cashman and cache max. So if, if things have changed, if some of these settings have changed, you know the in in between iterations. This basically just sets everything so that were within our boundaries nicely.

A

Okay, you know this is this is the code that we were talking about earlier? This is all it is right here.

A

Basically, if, if we're under the cache max, we are going our sorry if we're under the the target value, if the amount of of mapped memories under the target value, then we're going to adjust our size so that it kind of slowly approaches cache max, but on the other hand, if we, if the amount of mapped memory has exceeded whatever the target is at this point, then we're going to more aggressively reduce the target size so that now hopefully in the future, the amount of mapped memory has shrunk and we'll end up back in the stage where we're slowly growing back up to the target and really that's that's it.

A

There's there's some other stuff here for dumping stats out and everything. It's it's really surprisingly, simple, there's not a whole lot going on yeah. This is, you know just setting these things, it's not interesting, so, okay, Josh, how much time do I have am. I do I need to be done soon or is we have more time we.

B

Can go further half hour if you like Oh.

A

Fantastic okay, so then I will keep talking. So that is the PR that introduced the memory kind of auto-tuning, but we did more than that. Have the the simultaneously while this was going on. I was trying to think about well, okay, that's great that we've now got kind of the the overall target for the memory usage of the OSD people. Don't have to think about that much anymore, but we still have this issue where no one has any idea. What any of the blue store caches should be set to.

A

It's super confusing it's confusing even for me, sometimes and- and that has to do with just kind of how our caches used to work and how the rocks TB block cache works because it it's not intuitive. Unless you actually read the code, there are things that they they have kind of this concept of high priority and low priority cache just in their LRU cache, not in the clock, cache and what's in the high priority cache and what's not in the high priority, cache isn't isn't super well-documented and there's a lot of options that govern it.

A

So um anyway, I I was thinking about all of this and trying to think about okay. Well, maybe we can make this automatic, because it's there's no real good default for these things. Anyway, if you have a rgw workload, you might want a lot of own ODEs to stay or sorry I'm, oh man entries to say in cash, but if you're doing RBD on on blue store, you might want a bunch of own ODEs to have really really high priority in cash, but own ODEs can also be cashed in the rocks to be block.

A

Cash is hierarchical and theoretically they might be a little smaller because we're doing var engine coding, but then that encoding means that there's more CPU overhead to do the encode and decode. So you know, there's there's tons and tons and tons of variables and it's impossible for anyone to really figure this out. So why want to start looking at is: can we make it automatic? We kind of use the same idea where we're periodically looking at different statistics and then adjusting the amount.

A

The ratio of that memory target that we or that memory amount that we set in the previous PR and and then split that up in a smart way between all the different caches. So, okay.

A

The first attempt at this was basically just to implement a very high level interface or our caches that that that basically started specifying priorities and byte values for priorities.

A

The idea here is basically you've got some kind of controlling thread. That's periodically going through and asking each cache for each priority. How much memory would you like for this particular priority? And then this kind of controller at each priority? Ask these these questions and then based on how much memory is available and kind of a fair share ratio for each cash based on kind of the initial ratios that are set in the set that comp file says. Okay, you get memory, you get memory, you get memory kind of based on these things.

A

You've asked for more than your fair share ratio, so I mean it's kind of keep you in mind, but here's how much your you're gonna get right now and because you request a lot or someone, another cash has said: oh I, don't care I, don't have anything I, don't need any memory? Okay! Here, let me just give you the small amount you want or nothing.

A

If you don't want anything at all, although it will get something because we give each one a little bit of headroom and then at the end of this round of a given priority, then we say alright who's still. Why memory? Which cash is still one memory, even though they already got their fair share?

A

Okay, so so, let's go through now and then start dividing up the remaining memory and see. If now, we've satisfied the requests and we keep doing that at the given priority level until every cache is satisfied or we've run out of memory. One of the two- let's let's say every cache is satisfied great okay. Now we've still got some memory lists left. So let's move on and repeat this process at the next priority level. So you know the highest priority.

A

Things are first, king memory assigned based on their fair share, but eventually we keep iterating through and they they hopefully get as much memory as they requested for that priority level for a very high up priority level, and then we maybe go down to a medium priority level or a low priority level, and and just can't repeat this process, so so no low priority items will get fulfilled unless there's memory available because we'll we'll give everything what it wants at higher priorities. First, all right.

A

And weird, okay, so so have a first pass at this was looking at the rocks, TV block cache they they already have kind of this concept of high priority items, specifically the the bloom filters and the indexes for the SST files are a really high priority. You really want those to stay in cash because, if even if, like an item, is not in cash by having that that bloom filter in cash, you can avoid a lot of reeds as you iterate through all the levels of rocks DB.

A

If you don't have bloom filters, potentially you could look for something in every single level, after new reeds in every single level to to find where exactly in rocks DB. It is. But if you have those bloom filters in cash and if you have a reasonably high number of bits devoted to them, you can say with with fairly high certainty or you can avoid I guess situations where you have a something that that wasn't there. But you have to look anyway right. So you can. You can avoid a lot of the reeds, basically.

A

So those things are really important indexes filters. They also include recent reads: I didn't actually like that that was one of the things I didn't want there, because I kind of want more control over what things go into the high priority. Cash pool that they've defined and I want adjust that cash pool I want change the size of it. They don't really typically do that in Roxy, be the capability is there, but they haven't exposed it as part of their their public interface for the cash.

A

So so our first have attempted this was to did the cash interface to expose all this, and it worked that that's how it kind of first worked. We submitted PRS to the proxy, be guys then, and they sat there for like two months. They were. They were super nice. They were encouraging about all of this, but they're kinda like well.

A

We really don't want to modify the interface and that's totally understandable right, because that's a nice static interface that they've had for a long time- and you know here we are coming in with this weird custom stuff that most likely we will be the only ones that will ever use so the the end result of this Kevin. This first pass was that we still had it. We still had our own modified version of rocks DB with our own modified public interface and we used their cache, their LRU cache and inside our our implementation.

A

This is actually in the the rocks. Db store the the key value store. We we implemented the the priority. Cache interface kinda, weird thing to implement it here, but but that's what we did initially and here you can see we are fulfilling the the priority cache requests. So this is when this kind of the the blue store, mempool thread is the thing. This kind of acting is the overall controller for all of this, and it goes in and it requests from the caches. How much memory do you want at a given priority?

A

And here this is exactly what we're doing we're saying: okay, let's look at what the hi primal usage is for prior prior zero. That's the high priority here and then we're going to chunk it so we'll we'll basically push it out to some larger boundary. So we give ourselves some Headroom for growth, and then we also potentially have some already assigned memory here.

A

So we're only going to request more if it's you know kind of not if it's you know greater than this, and then we've got this last priority here and those are the only two priorities we're implementing first and that's for everything else. Basically, the the rest of the usage, not just the high-priority pool usage, so everything else basically get shoved into the last priority.

A

There are interfaces like this for the other caches as well, for the the blue store own cache and the blue store buffer cache, but they don't really do anything. Yet all they're doing here is is basically sitting at the last priority, just like the the rocks DB block cache the only thing that we're really prioritizing with us and kind of a prototype implementation. Here is making sure that the the indexes and filters are always being cached.

A

The the one thing that it does do, though, is that that whole process that I talked about, where we're kind of iterating through the different priority levels and asking things what they want and assigning things and then CAD giving things up based on the ratio. If something doesn't request memory, if, like the buffer, cache, isn't requesting memory, then it's still the case that the Oh No might get more of it. um We had some kind of weird behavior like that already in blue store, but it didn't work quite right.

A

It actually had some punch in it and- and the ratios didn't look like they really looked the way that they were supposed to in in previous versions of stuff. So we almost had this kind of situation before where we told the user, you need to figure out what these ratios are, but then we we kind of almost overrode those and sort of use them, but sort of made our own decisions.

A

It was kind of like you had worker bees, but the worker bees were, were you know, insubordinate, not to me what they're told so um the end result of all of this stuff is that if you look on the right here, that's kind of the the auto-tuning disabled, behavior and and actually the correct, auto, treating disabled behavior, not kind of the the weird kind of hybrid behavior we had before, where we've we've specified specific hash ratios for the different caches and we're not using the memory auto tuning that that we that we implemented that we talked about earlier on, so the RSS memory or sorry, the the sorry, the RSS memory usage or the process is quite high.

A

A kind of static, 3d, white, blue store, cache size has been set, and then all of these different things are kind of using. You know they're up to their their ratios here. Right with you figure out with the 40% of three gigabytes is or 20% of three gigabytes. Is there they're kind of they're sticking to them? In our auto-tuned case, though, now we've specified that we want a four gigabyte SD memory target, so you know they're they're, sort of comparable right. You think well, okay, I do a three gigabyte booster cache size.

A

Maybe that makes my OCD process around four gigabytes when you count the other overhead and things. But in this case now we've just set the target and it's figuring out what the cache size should be so now, instead it's it starts out like around 2.7, that's based on our osde memory base and fragmentation values that kinda starts out there, but then it turns out that that's too much we've we've kind of gone above and so now we're dropping the amount of cache memory down and kind of adjusting it below three 2.7 gigabytes there.

A

It's you know, maybe more around 2.5 or 2.4, or something had variable you'll see that the mapped memory in the light blue is very, very good. It's very, very close to around 4 gigabytes. It's something spikes over a little bit, but it's staying really close. The RSS memory usage is high. This was our boon to note where the kernel was not aggressively reclaiming unmapped memory, and so the RSS memory usage of the process ended up being higher.

A

I suspect that if we had memory pressure that the kernel would be more aggressive about reclaiming that, so those would probably closer on CentOS, we see them much closer. At least in our test lab. The CentOS kernel settings appear to be far more aggressive about reclaiming unmapped memory, even when there isn't a significant amount of memory pressure. You can also see here that very initially, the amount of memory allocated for buffers grows really quickly.

A

That's that little green line it spikes up, because when we're doing writes like this initially you you end up with a lot of of buffers that potentially you could cache and that's getting the majority of the memory at first, when we're going through that iteration process that I described, um but as time goes on, we see that Oh nodes, the amount of Oh nodes have been written increases dramatically, and so that's starting to get a higher proportion of the cache and over time.

A

This is a sorry. This is an rgw workload, the amount of Oh map data starts increasing and those indexes and filters get bigger, and so we start seeing that.

A

There's some value in caching that and at the same time, the amount of buffered the amount of things in the buffer cache the amount of data in the buffer cache and the amount of data in the O node cache ends up decreasing over time, but ends up recovering when we have these spikes, these spikes I think are maybe when rocks TB is doing compaction and it is storing the SST files in the block cache and so all of a sudden. There's this huge request or more memory- and this is the the auto tuner.

A

Basically, our this whole thing, this cash priority system. Responding to that saying: oh, hey, look all of a sudden! We've got this. You know more data coming in probably both at the high priority level, which is this kind of continual growth and then at the lower priority level, which is the spike here, but because we've said that we want 40% up to 40 per 12, a target of 40% of the the meta cash.

A

Sorry, the kV cash, because that's that we've said that high then it gets in the first round a higher percentage of the memory and these drop down. But then, potentially, as those SST files end up being evicted from Roxy B's cash, then it doesn't need as much it says: I don't request nearly as much I, don't want it I, don't care, then the the buffer cache and the didn't met Akash end up getting more of it.

A

That was an rgw case. This is kind of similar here for for our Reedy. In this case, you see that the the amount of data in the key value store for Rox DB is much lower. um This is, this is kind of a simplistic view of it, because in this case we had enough memory that we were able to fit all of the blue storoe nodes in cash.

A

That's you can kinda see here that it stabilizes right around one gigabyte and then there's there's plenty of memory for for additional medication and for for everything. Basically, it turns out that the data cache ends up. Basically, just getting oh I'm. Sorry I got this backwards. The the the code cache ends up is the yellow line that gets the steady state.

A

It grows to a point and then stops and the the data cache is basically getting whatever is kind of left over here we've said: 20.2 is getting more than point two, but it's basically the leftovers.

A

/ point: it's the leftovers that the kV cache and the B data cache didn't. Sir. The met Akash did not want, though, so that's why it spikes up again here at the beginning, but then it shrinks as the kv cache and especially the own ode.

A

Medic ash want more, and you can contrast that with Auto tuning disabled and there you've got higher you've got close to 5 gigabytes of OSD memory usage because we're not tuning that at all, but the the amount of cash that's devoted to to these other things is: is you know just static whether or not it's it's useful?

A

So that's kind of the the overview of how it works a so. We don't have too much time left here, but I can they can quick, pull up.

B

A

This is so here's the the request cache by its interface. Oh sorry, the the implementation in the rocks DB store. So that was what we were looking at earlier. That's not what I want to show you guys. I want show.

A

So there's just some random cleanup stuff and some changes to trim shards and that thing, but the really kind of interesting thing is, is this: this is the the the process by which we're iterating through the different priorities and the different caches.

A

So we we start out with this kind of global, not global, but this member veil and then we go through and ask well, we go through for each priority. Is it's kind of what we're doing here and running this? This balanced cache priority method and eventually, after we've gone through all the parties, if there's any left. After all of that, then we just kind of equally well. We divide it based on the theory, shows they've been set.

A

So just if nothing wants cache, then that just ensures that that everything gets some share of the memory, but then balanced, cache pry. This is where, at a given priority level, we've we've passed in the the amount of memory that's available, and now we are iterating through looping through and doing that process that I described earlier or asking each thing what it wants and.

A

Then we're looking at kind of the the ratios that that we can, that that this kinda, like the fair share right of what it theoretically should get this level.

A

And then, if it wants more than kind of its fair share, we we give it what it wants, but we keep it in mind and go through and can't do this for everything right. So each thing is just getting up to its fair share to start out with, and if that satisfies it, then we remove it. Otherwise we keep it around and we go back through and do this process again with whatever memories available until all of them have been removed or or until its its memory. So so that's that's kind of it.

A

That's that's how this works right now this is more or less. What's in math master, there's one other commit in master that that basically goes through and and then I had mentioned, that the Yurok CB guys didn't like this, where we're changing the the interface for for the cash. So what we did instead is we switched back to master and we took the ROC's DB LRU cache code and implemented it in stuff itself. We we basically took it imported it and then now we we don't. We can use the interface the same way.

A

Rocs DB gets our cache through the public interface, but now we can, we can go in and poke at it ourselves. We can. We can look at what is happening in the high priority cache we can. We can change what's happening in the high priority cache we actually here.

A

This is something that that changes there, whether or not recent reads- recent read- hits are promoted into the high priority cache and that's that's not actually useful for us right now. We don't want that um specifically because, what's coming in the future and another PR is that we have a system of not just two priorities, not just low and high, but an arbitrary number of pry we're by default.

A

It's set to be 10, but we can stratify all of this and we can make decisions about whether or not something that's a priority level 1 in the the O node cache is more important than something as a priority level 2 in the other cache, and we do this through kind of a system of age bins that get added to all the different caches. We don't have time to talk about here and the codes not actually in master it's in another PR that, unfortunately failed QA due to a seg fault, but it mostly works.

A

Actually, when it doesn't sync fault, I, guess it's really neat, because then you can kind of have the OSD dynamically reacting to the kind of the the kind of I/o that's happening. Maybe Oh nodes are more important. Maybe oh map is more important. Maybe there's some trade-off between these things. You may be still always want the kV cache to absolutely get the indexes and filters cached. So that's always high priority, but then other things can kind of shift around.

A

It's really neat to see at work. There is a one problem, though, with all of this is that when you get aggressive about doing this it definitely it really makes TC Malik kind of unhappy it wants to like shove like size things into bins, so that it it can make like good allocations with them and all of a sudden here now we're like moving we're saying this thing. This set of things is getting more memory and now we're changing it. Over here and doing things, this has to be kind of a slow process.

A

We can't do it quickly. We I think it's. You know one second or changing the the amount of memory that's available in general and five seconds for this process of like kind of renegotiating how much memory should be devoted to each individual cache. um So let me just check and see if there's anything else here and to go over okay yeah. So we just we talked about this where we pulled in the rocks. Db cache our selves one.

A

Other thing that's nice about this- is that now we can kind of stuff I it we can make memory allocations for that cache, be part of the mem poles so that we're tracking how much memory is allocated to that cache, not not just in terms of like this priority system, but in terms of you know, allocations and maybe someday you know, making slab allocations or something else for it and and we're using common data structures.

A

There's there's they've got Roxie became had their own custom, small vector like implementation, and you know we can just standardize on this now and use it in multiple places depending on. If we want, there's or or actually small vector itself or whatever also they, they did some kind of shady things with memory allocations where, if you ever heard of the struct hack in C, it lets you basically do a malloc and then or for a larger amount of space and a struct aches.

A

But then, at the end of your struct, you have like a an array that you walked off the end of, but you know you can do it because you've got malloc space for it and it's it's really evil, but there's actually a legal way. You can do this and C, but not in C++ and so they're they're doing it the illegal way and C++. Basically, it works with GCC. It works definitely with with other other compilers. It might work, but it's definitely not not valid C++.

A

So we changed that it might be a little slower, but it's it's definitely nicer to work with now and it was necessary to be able to do the things that we're doing where we're modifying the caches in the future for doing this edge bidding so yeah I think that's, basically it Oh. Also, there are some other things: they're coming automatic chunk sizing for or doing the chunky allocations.

A

That just basically means that the bigger the more memory you have, the bigger chunk sizes it makes but you're you're kind of not just alkane tiny little chunks if you've got tons of memory available and then also there's a where this new PR is moving. The cash balancing code out of blue store into this kind of generic priority cash manager.

A

Class I would really like to even if it's not that useful I'd really like to make this OSD wide for file, store and blue store, and maybe potentially do something like this using this priority cash manager in like the Mon and other things so that users get away from this idea of setting okay I'm gonna have a 512 megabyte cache for Rox TB in the Mon. Well, no, let's not do that. They don't they don't know how much they should assign.

A

Instead, maybe we just say: okay, the Mon should use four gigs of memory or two gigs of memory or whatever, and then we just handle it behind the scenes, and then the user knows. Okay, this is the envelopes the Mons gonna try to fit into the cache gets adjusted based on you know what other stuff is happening and and then we just can go from there.

A

I think have making it all Universal and very straightforward for the users kind of the end goal of all this, and if this all works, if this works really well, then maybe we can start thinking about things like doing similar stuff for disks. It's more complicated for sure and the the act of migrating data around is is definitely very impactful, but you know I'd love it if we just handed OS DS or handed diamond demons in general, a chunk of hardware memory disks, whatever maybe course to operate on and said you figure it out.

A

You know the user doesn't need to be involved, just just figure out how to best use these resources and- and you know, use your own internal statistics to do it. So um that's it! That's! That's everything.

A

Any any questions I.

B

Can freeze, but you mentioned the automatic trunk sizing in the future. What kind of impact you see they're having.

A

Well, let's I think it's it's kind of a small thing it really all the chunk sizing is doing is basically making it so that when you, when, when memory is, is kind of divvied up between the different caches, it's is chunking it right, so that in the future like you're, not you you, you might not need to grow it if it's growing kind of slowly, it just stays the same so kind of the whole point of that right is to to make it so that we're not changing the ratios, so maybe maybe TC malloc doesn't have much more work.

A

To do, I mean all it really does is say how much you event right when you're going through and doing cash eviction now you've got. Maybe you know less less stuff that you need to evict right, because you've got a little bit of headroom, so you don't do it until you hit that with the automatic chunk sizing. All it really does is says: okay, if you've got lots of memory, give it more Headroom. This count. What comes down to or if you've got like nothing.

A

There's no memory available, then then make a really small chunk size so that we can like it as much into our space as possible. So I guess you know it's a small memory values, then you're not gonna have quite as much waste as having a static chunk size and at large values. It makes it so that well, you've got a lot more Headroom. So you know eviction doesn't start happening.

A

Maybe you stay ahead of eviction if you've got room for growth and- and you only start really evicting once you've kind of gotten to the point where you're so saturated with everything that they have to I I, don't know I, don't it's kind of neat it's useful, but I, don't know that it necessarily it's a huge difference in the long run. Maybe it does I, don't know, okay, that make.

B

Sense, yeah thanks.

A

Any any other questions.

C

Hey mark, so would you be making these flights available as part of here yeah.

A

Sure that's holy finite, I literally stopped working.

A

Anywhere anything but yeah, and my I'd like to do too, is once I get the the next working right and get that into master I. Think you'd be good to do another one of these, because the next piece is the really cool piece for work. We're actually modifying the caches to do this age binning. It's kind of what makes everything really work. This stuff is master right now you know more or less works, but it's it's really simplistic. There's basically just two. There are two priority levels, and one of them is only used for.

A

So what are you kind of trying to do or what? What are you interested in I'm.

C

C

C

A

This this stuff is all you know. This is like you know, digging into the guts of uh you know, kind of blue stores, working I would I would look at how crushers and how kind of the overall kind of design behind the monitors and the OS DS work. Probably it I know Josh. Do you think sages paper would still be useful to read yeah.

B

I found that level of details paper still relevant yeah for a higher level picture. There's plenty of like intro to talks on YouTube. They go over their architecture, the higher level, and that might be a good sister.

C

A

Alright well, I have a meeting aspirant, so ever have a good one guys. It was fun night.

B

Thanks mark that was fantastic, see you later.