Ceph Performance Weekly, 17 May 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2018-May-17 :: Ceph Performance Weekly

Description

Weekly collaboration call of all community members working on Ceph performance.

http://ceph.com/performance

A

Well, I I reached out to sage to see if he was gonna be able to make it today and I have not heard back from him. So maybe he's got a conflict, so maybe we should just move ahead. Oh.

A

Here we got radix- oh maybe maybe just another 20 seconds and then we'll start.

A

All right, I'm, I'm, moving on all right, so, let's see this week for pull requests, not a ton of new stuff. The only thing that I noticed in in in my list this morning was the PR I had actually submitted for kind of a first step toward doing priority based cashing in blue store and I'm, hoping, maybe eventually the OSD. The the idea here is well, maybe I'll get to that later.

A

If no one else does anything they want to talk about, but the the basic idea is that it's really difficult in master right now for users to set ratios of how memory should be used for different things in the OSD and specifically in blue store right now. We allow you to kind of specify how much memory should go to metadata and blue store data and blue store and Rox DB's block cache, and even inside rocks TVs block cache there.

A

You potentially have the option of specifying high priority and low priority pools for kind of how much memory you want to devote to guaranteeing that indexes and filters remaining cache, and so we allow you to set at least some of those ratios and then beyond that. We also allow a minimum value to be set for the the key value block, cache and kind of when you. What we found out was that, essentially, when users try to do this, it's really not clear at all how all of this interacts and calves on top of it.

A

We actually had have currently in master a bug where we tend to favor the metadata cache over the data cache kind of beyond what the user specifies. It's not always a bad thing, because actually it turns out that favoring may take a cache kind of helps performance in a lot of scenarios and usually doesn't hurt it even when you might expect it to so um it it's it's kind of inadvertently doing a better thing, but it's not doing what the user requests. So it ends up being really really confusing why things are set internally.

A

The way they are kind of given what what the user is specified, so this PR tries to make it on, on one hand, really really straightforward when the user specifying ratios and does not enable any kind of auto tuning of the cache that it just does what the user requests and it's just setting the ratios to what the user asks and that's it there's a flag.

A

Now where you can enable Auto tuning and then kind of through a a more complicated scheme it will, it will try to do make better decisions about where memory should be assigned and then eventually revert back to those user-defined ratios. If it can't, if it doesn't think it can do better. So that's that's it in a nutshell.

A

Beyond that there were a couple things that closed. There was a fairly small PR from Igor that that fixed something with alligator pruning I, don't think that was super complicated, but but good.

A

This async messenger PR from how am I that that kind of improves the locking behavior merged I keep merged that that potentially is really good for kind of latency related scenarios. I, don't know that. We've done any kind of extensive testing on it, but it at least the testing that was done. There makes it look like it's a good PR, and then there was this other one: continued recovery, optimization for overwrite apps that did not merge. It was just closed by the author I'm, not entirely sure why?

A

But I guess, for whatever reason that was closed, a couple of ones that were updated, Igor's, new bitmap allocator that he presented last week, sage reviewed it looks like he really likes it and wants to replace the the old bitmap allocator code with Igor's new stuff. So that's good expect at some point soon here that we will have a new bitmap allocator implementation that works far better than the old one. So so that's exciting.

A

There is this lib RBD throttle: PR I, don't really know too much about it, but I guess that one must have gotten updated. I should probably figure out. Why that's what's going on with that anyway, yeah I, don't know too much about that one and then Radice slavs work on the the crypto SSL PR here and that's going through testing and getting fixes and I. It looks like it's making progress. So that's good.

A

There's a couple that haven't seen movement that are newish this disc right coalescing by Peter.

A

Some of the stuff from Adam looking at hashing he's been working on. That I think I. Think the latest I've seen is that it's that that kind of Sage's, more more restrictive version of it is, is working and helping. So so maybe it doesn't need quite as much as as Adam had originally proposed.

A

So that's good I think a user is involved in testing that another one from Peter about reducing buffer list rebuilds during right, head log, writes redick's huge pages, PR more stuff from Adam I'm, sorry, Q, I, guess in this case regarding AES and crypto yeah. That's that's about it. Don't say anything else, real reason here. So that's it for PRS would any1. Does anyone have anything that they would like to discuss this week?.

A

All right: well, then, I can go over what I've kind of got and where I'm planning to go with the priority cache. If folks are interested, I see Josh nodding his head, so maybe I'll do that. Then.

A

All right, let me see if I can share my screen here. I.

A

Don't really have a real presentation made up or anything so I'm just going to go through the the stuff in the pull request.

A

Ochs see this.

A

Yeah, okay, cool, so I kind of earlier on gave an overview of the problem here right now, it's really complicated for users to adjust all of our cache settings. It became clear when we went through some of this with um was with one of our customers beyond just the fact that it's kind of confusing it's really difficult.

A

Even if you know what all of the different ratios do to figure out what they should be set to and it kind of changes depending on the workload everything that I have seen indicates that for like an RB D workload, if you can keep everything all of the o nodes for blue store in the metadata cache, that is the kind of number one priority. It's less clear. What happens once you can't do that anymore?

A

If not all the o notes fit into cache, maybe you're better, actually doing like a full swap over to Rox tepees block cache, because, theoretically, you can keep all of the do nodes and block cache in an encoded form, rather than reading them directly out of memory with blue stores. Oh no. The the trade-off right is that, when you're reading from Rox TVs cache, you have a lot more work to do. You're you're copying memory around you're, doing an encode step to put them there and you're doing a decode set to get them out.

A

So it's highly higher latency, more memory traffic, just overall, more work, that's involved. Obviously the the the most work is actually going to disk and fetching them off disk.

A

So at least, if you can keep everything in cash, that's kind of what you want to do, there's there's kind of a bad in-between state where it looks like. If you don't have enough metadata cache- and you are doing reads from disk- we end up double cashing a lot of data, so you end up putting the exact same data that is in blue stores. Metadata cache in rocks TVs block cache. It's not clear whether or not you yet you get some added benefit by doing that, like sometimes you're, maybe hitting stuff in rocks.

A

Tb's cache that you're not hitting in blue stores, metadata cache, I, haven't really gone through and done. An exhaustive look at like the hit rates to to determine that, or even like a /, a /, oh no trace of where it's hitting in what cache, but just from the performance results and from the behavior that that's showing it looks to me like, generally speaking, it doesn't work that way.

A

It looks like either you're hitting blue stores, cache or you're fetching from disk or or at the very least, the the overhead of getting things out of rocks. Dvds cache is so high that fetching from rocks TVs caches is not really much better than just fetching it from disk in the first place, so that kind of is the conundrum right now is. What do we do in all these scenarios?

A

Have the the good news is that we can at least avoid some of it by saying? Well, if we aren't using a lot of kV cache yet and if we have memory available, I guess actually, if we're not using any of these caches up to that ratio that we specify. Potentially, we don't need to allocate that much memory to any of them.

A

We can say well, you know: we've we've allocated 40% of our cache memory to Roxy's block cache, but we're only using 5% of that right now, so you know split it up between the data cache and the metadata cache, depending on what each one needs, and this PR is kind of designed to do that, not not just at one level but to assign a series of priorities where you can say at this particular priority level, it's important to try to first cache things at the ratios that we've set, but if they're still memory available, then let's give any cache that wants more memory for this priority.

A

What memory we have left before we move on to the next set of priorities. So in that way we can say if for high-priority items, we want more metadata, cache beyond, say, like the the one third or the forty percent of the cache that we specified in the ratios. Well, let's, let's give it to that: let's give it to the metadata cache then, before we give lower priority items a shot at that cache. So that's basically what this does right now in this PR. We don't really make use of those priorities very well.

A

The only thing that's specified as high priority right now are the indexes and filters and rocks TVs cache, because if those end up paging out its kind of bad every time you do a write in blue store.

A

You, you are looking up metadata in rocks DB to see if certain data is already present related to that object, and so the bloom filters and the indexes are really really good to keep cached so that we're not doing reads from disk on every write for those, and it's not that big, it's small enough that, especially for like an RB D workload, is minimal for an RG w workload, especially with many many small objects actually does grow pretty big.

A

So that's a scenario where, potentially, if you did not specify a large K V ratio with a limited amount of memory, you might be better off increasing that ratio as this number of objects and rgw grows.

A

So so, let's take a look here. This is essentially the old behavior of Matt. Well, the current behavior of master on under an rgw workload like that you can see that we have kind of a couple of different ratios that are set here: 40% kV, 40%, Mehta, 20% data, so Roxy bees block cache is getting forty percent of that three gigabyte total cache, the the blue store, metadata cache for o nodes and Ono related stuff is also getting 40% and the the data cache for for four objects in blue store is gained twenty percent.

A

In reality. That's what we've that's what we've allocated, but in reality that's not actually. What gets assigned right now in master blue store kind of tries to do some kind of auto tuning itself, where you can see the yellow line for data actually spikes way up at the beginning and then drops back down and at the beginning we kind of start out without much meta cash, but it spikes up.

A

If you notice, the amount of data that is allocated is over the course of this run is actually larger than the amount of data used that that's kind of that bug. That I was talking about where we tend to not actually assign as much memory for the data cache as the user specified, and we over allocate memory for the metadata cash versus what was requested.

A

It's not necessarily bad for performance, but is kind of confusing, because even though the user said that they wanted 40% of the cache for metadata, it turns out that it's closer to like 48%.

A

So anyway, you know there's that as we move through this test, we're not using all of the cash right away, it kind of takes a while before the kV cache fills up. So so eventually, we hit 100 percent cache utilization. But it's not until about an hour into this test that we get there.

A

This test is: writing out 256 gigabytes of rgw data to one OSD, the 16 kilobyte objects, so very small objects, probably about the smallest that you'd ever expect to see in rgw I mean theoretically they could be smaller, but it sounds like this was kind of roughly about the smallest that we had really realistically seen, customers putting into our GW.

A

Okay in master or sorry in with this PR, if we have the auto tuner disabled, sorry we're switching to our BD now, maybe I'll actually go down and show the rgw behavior, because that was um oh, we had just talked about and then we'll go back to our BD, okay, so for our GW behavior, with the auto tuner disabled, its it's really similar to master, except that we fix this issue where we're under allocating data and over allocating metadata.

A

So again, you can see really similar behavior, it's hitting the user specified ratios. We are not doing any kind of auto tuning at the beginning, where we're kind of taking some of the the the metadata cache and assigning it to the and then having shrink back down that kind of is now we moved into this auto tuner design, so that, if you specify that you want auto tuning it does that, but otherwise it just does exactly what the user requests and doesn't fool around with it.

A

I think, personally, that that's kind of if the user is specifying they want these ratios, then we just give it to them and it's kind of up to them to to muck around with them. So anyway, that's a very, very similar behavior to what we saw previously.

A

I also ran tests, looking at different ratios for the kve block, cache metadata cache and the data cache and again you can see here that were we're hitting all of those ratios really quickly in this case, actually we're filling up the cache much faster, because the metadata cache is able to fill in quicker.

A

One thing, I will note is that it's not real apparent here yet you'll see in the RBD results, but we're actually double caching, a lot of the same metadata and oh no data in the medic hash and in Roxie B's key value, I started in the block cache a lot of the same data as basically and populated in each okay. So now, with the auto tuner enabled you see really different behavior right again, you see that the data cache spikes way up as opposed to master we're not just doing this for blue source caches.

A

With the a couple of changes to rocks tb's cache, we are able to expose some of the information that they have internally about high priority, the high priority, pool and low priority pool and how much memory is being used there. So now we can actually try to allocate just a little bit more memory than is used at any given point for these, so so that's kind of what we do.

A

You can see that for the key value line here, the blue line, essentially we're always just allocating a little bit more there than is used over time until we get to the saturation point at which everything kind of hits. You know the the allocations that we specified were we're also fully utilizing the cache much sooner than than we did previously.

A

One of the things that you'll notice too is that at the crossover point kind of around one hour in previous to that, we were assigning more metadata because we had we had it available. Both data and metadata are able to kind of use memory that was assigned to the the the key value store or to the rocks movies meta cache, but because there wasn't that much memory being utilized there. Yet we could give that to the the metadata cache and data cache.

A

But at that crossover point then we kind of revert roughly back to the user specified ratios the difference being that now we're we're at with high priority.

A

We are caching, rocks T, B's, indexes and filters and over time those are growing because we're just adding more tiny little rgw objects, and so you can kind of see here that in fact, the the amount of memory that's allocated to the key value store block cache grows over time.

A

So it's a little bit different and it's not entirely clear that we we want to favor the block cash in the key value store beyond making sure that those indexes and filters are always cashed. It might be that that's a really high priority thing, but beyond making sure those are cashed, we're still better off giving the rest of the memory to the blue store own owed cash.

A

That's kind of the next step in this and I'll talk about that more later, but for right now the behavior currently is that we kind of revert back to the user-defined ratios, except for the fact that Roxie B's indexes and filters are given high priority over other things.

A

And you can see that too, when we specify different ratios here, that we're sort of keeping to the user-defined ratios once we hit catch cache saturation, except that again, the the racks would be an mixes and filters are getting the first shot at memory and in growing over time that the such that you know we're sort of sticking to our ratios, but we're also letting the auto tuner make sure that those things are cached with high priority in terms of performance. Unfortunately, none of this really affected our GW performance at all.

A

There was a little bit of change, but it's it's such a small amount that it's possible. This was just random variation, I'm, I'm curious as to whether or not maybe we have enough other latency in the stack that that kind of the effective reading from Roxie breeze block cache or from blue stores. Oh no is really not significantly different than just reading off of the nvme drive or sorry. In this case, the SSD drive directly.

A

You can see. The the latency here is is relatively high, not not ridiculously high, but but high enough that it might be that that you know those reads from from the SSD kind of are relatively minor part this so anyway, that's this kind of you know what we see with our GW right now are.

B

Good data, but like W benchmarking, how many objects there are ends they come and with the other insight is how many nodes would fit in the cache. For example, I got we just without we're flowing the cache entirely there. Yes,.

A

We've gotten to the point in these tests, where we cannot fit Oh nodes into the memory assigned to the the the Blues for cash and we cannot fit all key value data into rocks, TVs block cache either and obviously we can't fit the data into the data cache. That's the first thing to go.

A

So yes, in this scenario, we are contending on everything which.

B

A

Of why we're reverting back to the the user specified defaults? The next step in this I think is gonna, be really interesting. Where we now look at recent kind of binning of the cache items so say in the last five seconds with we, we cache the kV cache and the metadata cache with high priority and try to assign everything.

A

That's wanted there and then we kind of do these steps where we go through and look at older and older stuff and try to you know, divvy stuff, up, I, think that will be really important here, because then it will start telling us how much of the recent things are key v8, you know our block, cache items versus you know: blue store, metadata items and maybe we'll start being able to deviate up a little bit better I think also. Another really interesting thing would to do would be to differentiate in the different block caches.

A

Oh map data versus you know, Ono data, that's also potentially cached in rocks, TVs or sorry in blue stores met Akash because we might be able to start doing some smart things. There too,.

B

And it's good point and I wondered that might be more beneficial for a card disks, for example, to keep the even the human and encoded data around since there's more CPU despair. Yes,.

A

Yes, I suspect so I. All of these tests have been done with SSD, just because I'm really paranoid about making sure that none of this has such high CPU overhead that it starts slowing things down beyond you know, kind of just doing the the the kind of you know most simplistic or dumping.

A

You know the the right now the auto tuner is working at like five second resolution. It's every five seconds, that's going through and rebalancing the caches. The next step in this potentially could add some overhead to trim in the cache, which is a little more scary, but I I think the auto tuner itself kind of as long as you're, not having it do. Work too often is, is hopefully not going to impact things too badly.

A

You said with hard disks, since you know the the amount of work being done is so much lower. Potentially you could even you know, do this more often and have it not really hurt things too bad? Okay, um let's look at our BD. This is kind of the the more fun one, so okay, I really behavior, with Auto tuning disabled with this PR I did not grab graphs of master. Unfortunately, I thought I had, but I I didn't grab the data about what it was doing.

A

Unfortunately, so I don't have master, but but essentially it's gonna look a lot like this, except with kind of the that bug that that we've mentioned, where we're not quite assigning as much data cash as we should and over assigning metadata cash, but otherwise it should look really similar to this. What you're, seeing here at the beginning, is a prefilled stage where we are filling up. The RBD volume with four mega byte objects just to make sure that there's there's data Ellery allocated there.

A

That's that kind of initial slope to the metadata used and and can overall total used at the beginning here at one. At some point we actually may be I, don't know seven hundred seconds. Roughly we actually hit the limit of how much cash we can assign to the metadata a cash. How much remember you can assign to the metadata cash? And so now things are getting swapped out. That's where the the meta used line, kind of goes horizontal.

A

Once we've totally filled up the volume, then we start random rights and all of a sudden, you see that Cavey used spikes way up, and that is because we are now doing reads from disk of some of the own ODEs and it's populating both rocks DB's cash and blue stores. Oh no, and so we you see the the the total use spikes way up and now we're utilizing all of our memory, but in reality we're actually double cashing a lot of stuff. There's a lot of data.

A

That's in Roxie, B's, Cavey cash, which is basic, is just the exact same data. That's in blue stores, meta cash and just not being accessed because it's you know, any kind of reads are coming from the meta cache. Theoretically, maybe some of the reads can hit the the rocks DB cache rather than going to disk after this, the the data, that's in the block. Cache should be smaller because it's encoded using the variant encoding versus blue stars- oh no, but it it doesn't seem to really have a positive effect.

A

The B cost of doing those reads seems to be high enough that it would have been better just to allocate more memory to the meta cash. We'll see that when we look at the performance results, but essentially that's what's happening there. That's why you see that that huge spike in the kV used.

A

Line there around probably the 5 for 4,500 5,000 second mark there is, when we're switching over to doing random reads that the amount of memory and the cashes doesn't change dramatically at all. There's there's a little bit of a spike down, but not not not anything, spike, I guess a little bit of dip down, but nothing really there. Oh, you can really infer from that is that's where the the random reads are starting okay.

A

So what happens then, if we don't use the auto tuner, but instead specify that we want a lot of the cache devoted to metadata cache that works really? Well, that's actually what we've kind of ended up doing in the last year or two is devoting most of the memory to metadata cache, because performance results were good when we did that. The reason for that appears to be that at least in a lot of these tests that we've done so far where we've got you know, maybe turn 56k byte or even 512 gigabyte of RBE data.

A

It lets us keep all of the Oh nodes and cache when we, when we specify a high meta ratio, you can see that here during the pre filled stage.

A

We never get to the point where, where the the we start trimming data from the meta cash- and we make it all the way up through the pre-fill stage and into the the random right and random read stage and and the meta used remains constant and- and this is this, the the sign that will tell you that that we're double cashing above is that the the kV cash in the scenario that the block cash and rocks DB never grows. It stays really really small. It's it's like hovering right around 1% of the total cash.

A

So like 1% of 3 gigabytes.

A

We never actually use that space, though that that could have been assigned. We've we've over allocated the meta cash. We only ended up needing around 51% of the the full cash to cash Oh notes, but we we overdid it in this particular scenario. If we had 512 gigabytes of our BD block to cash, then we wouldn't have, then we would have actually under allocated a little bit, but at least in this exact scenario, we've allocated more than enough and we're not utilizing any of that extra memory that we allocated.

A

One of our partners was playing around with these things, trying to utilize memory in a reasonable way. They had a lot of spare memory.

A

They had enough to give like each OS, D, 20 gigabytes of RAM, and they did so, but they were confused when the OSD wasn't making use of it, and they didn't understand why I think this graph right here explains it when we don't have enough o nodes to utilize the ratio that's been specified, we just don't use the memory, we're definitely using as much data as as was specified and the kV didn't grow at all, so so we're not really using as much kV thes as there, but we're really, especially not using as much Mehta as was specified.

A

Nevertheless, this graph in this particular test, the the graph- that's that represents this test. In this test, we saw higher performance than the previous one, where we split the ratios more evenly and and we're using less memory to do it, but we're not really effectively using the memory that the user specified all right. So now, what happens with auto-tuning enabled well with auto-tuning, enabled we're making much much better use of the cache right away.

A

The amount of metadata that's needed for the meta cache to cache. Oh nodes grows as the number of Oh nodes grow. So, as we write out blocks to our BD that grows and then stays constant. Once we've we've saturated it, the amount of data spikes way up at the beginning, as we have plenty of available memory and then drops back down but drops down to a ratio much higher than we specified the user specified 20%, but with the auto tuner it knows it can it can give it more.

A

So it actually ends up giving it around. 42% and kV doesn't grow because we never enter a state where we need to double cache. If we are writing out a gigabyte of our BD data than we would and then we'd enter a scenario where we'd end up having to revert back to the user specified defaults again with the exception that we'd be index, caching indexes and filters at high priority right now, that's the behavior, but we can do much better than this PR in the future.

A

Using the priority scheme, I believe you'll notice that in next graph, where we've, the user is specified very, very different ratios, again 10% kv-85 percent Mehta and 5% data. The auto tuner converges on the exact same solution at the very beginning. If you look right around, you know time 0 up to time, 100, maybe the ratios are very, very different. That's it's kind of an initial idle state where we're just recording data about what the cluster is doing before anything is written out. Those ratios are: are the user-defined ratios? It's reverting back to those things.

A

It's not really doing much differently. It's it's slightly off, but it's pretty close in the previous graph. Those ratios again are are roughly what the user defined is just slightly different, but really close, but then, as soon as we actually start writing any data out it quickly convergence back to the same solution, so I'm excited about that. That's that's telling me that this is trying to be smart about making sure that the cache is well utilized.

A

So so that's that's personally. I like this I really like seeing this behavior and in the RBD performance. It looks pretty good too. The the worst-performing solution by far is when we have tried to equally well, not equally, but but when we have tried to divvy up the cache equally between the block, cache and blue store meta, and then you know giving giving the data cache some, but not quite an equal share of versus the other two. It does not do particularly well when we.

A

Move over to still not doing any kind of auto tuning, but giving the metadata cache most of the memory so that we're not ever not entering that double cache scenario with rocks DB it does better, it does pretty well, actually, it's still not cashing as much of the blue store data as it could, but this is all random reads: random, writes and random reads so being able to cache one gigabyte of of data roughly versus you know, maybe I, don't know a hundred megabytes or two hundred megabytes of data compared to a full, two or 56 gigabyte pool of data to be drawing from it.

A

That doesn't help the reads that much it helps a little bit, but but not a ton.

A

The auto tuner, though, does really well in either scenario, no matter what ratios were specified because it's converging to that that same kind of optimal solution, it's actually faster for random, writes than then either of the non auto tune scenarios and it's faster for reads: random reads, certainly than the the kind of sub optimal worst case scenario with the auto tuner disabled and arguably faster at least kind of with the the ratios being set reasonably I.

A

Don't think, there's actually much difference here is probably just random variation. I suspect that all three probably perform roughly equally when when oh noes are, are always in the meta cache latency, the 99th percentile latency, improves to the the biggest. By far the biggest difference was when we were doing Oh node reads from disk or from rocks DB's cash versus when we weren't the auto tuner does a good job of avoiding this it we avoid it relatively or we we avoid it.

A

When we have enough met Akash again, there might be a little bit of advantage here when we're not or when we we have higher percentage of the reads coming from the data cache which the auto tuner kind of lets you do, but it is hard to say whether or not it's really dramatically improving things.

A

Yet in this particular scenario, so that's that's kind of our BD that it looks like all of this is helping RVD more than its helping rgw right now, but I think, especially as we improve our GW behavior with with beast and, and maybe you know, kind of doing other things looking for other. You know areas where we have bottlenecks. My hope is that this will start also showing more of a performance improvement with our GW, like it is with our BD.

A

There are some caveats. Someone just didn't have any questions on any of that. So.

B

For their a BW, but this kinda reminds me when we were talking about some in the cash game booster in general before and varying, how do we? You were able to get some results with like local black toys caching and were like we'd seen some good results from the Intel mint data, caching layer ever it's called, but that was even with 30 W I'm wondering if that's because they were you, it was using my flower drive, you get RW objects, so there was enough or small enough metadata.

B

They would actually fit all in the cache that point or it. If the access pattern was more skewed, so you had a smaller subset of the metadata they fit in the cache at least. Do.

A

You remember if those tests were done on file store or on blue store, rather.

B

Than they were faster yeah.

A

My guess is that that has a lot more to do with PG splitting behavior over time and kind of the fragmentation that you end up with in the directory structure and having to do a lot of fragmented inode reads on the disk, because over time you end up with data from multiple XFS AG's, all in the same directory as you split farther and farther.

A

You know that you start out with all of the the objects in a particular PG directory are in the same AG and they're all kind of located similar to each other on the disk. But, as you do splits and and kind of mix all this stuff up, you end up with files.

A

You know object files that are spanning many different locations across the disk, and so all of those I know'd reads end up. You know kind of being scattered.

A

Does that that can make sense yeah.

B

That does make sense as being a much bigger performance. It I guess. If.

A

If Intel's thing could cache all the I nodes on the SSD, you know, presumably then you could. You could do all that really really fast and it wouldn't wouldn't matter anymore right.

A

That's just a guess, but um I have a feeling that probably plays at least a moderate part and why they saw those kinds of results. I'd be really curious. How much their their software helps blue store, especially if you're already putting o nodes or sorry o map data in on the SSD, and you know, with the DB on the SSD.

B

Yeah, that's maybe pretty interesting, I think when we thought discussed this before we were talking about how perhaps the local block twice cashing, wouldn't avidly Sherman effects once boosters caching was more intelligent because the major effect was from caching that metadata yep do that explicitly. Maybe a block device cache, isn't effective anymore.

A

Yeah I think we well I, don't I, don't know for sure maybe I'm wrong on this, but I really suspect that we could do smart caching in blue store by by having blue FS, allocate space on an SSD just for for Hot Stuff hot hot data right that that could be instead of reading directly from the the block device.

A

You know it could move or cache hot things on the SSD beyond. Just you know, having the DBS there, I I think we could do that in a way that would be beneficial in different ways than our cache. Turing is.

B

Yeah, possibly I think it's any fun. We try to do the data, caching that we end up being really were close dependent and only affecting we're. Close they're highly have had high mobility, I.

A

Think I think maybe what would be interesting there as if we end up in these scenarios, where we don't have.

A

Kind of enough data cache, which is pretty common right to to cache all the data, that's being written or read for really hot things. You know, maybe that would that would be a benefit but I guess when you think about like say our GW bucket indexes. That's already Oh map right. So it's already gonna end up in the DB already on the SSD. So maybe.

B

A

It's not as big of a deal. Maybe it doesn't really matter that much it'd only be user. You know object data that really would benefit from it.

B

And that might actually make more sense for rgw this, let's be a Kashmir in front of it, whereas, like every incentive, s are gonna have at least the client side, cashing in that I could virtual machine or a page cash.

A

Do we have any use cases where we have any kind of client-side code that depends on really really hot data that doesn't have client-side cache, like you know, maybe with the RBD kernel interface or something I.

B

Mean they they keep. The kernel under phase typically will have a file system put on top of it, so it still ends up using the local page cache and the client there yeah I think maybe some kind of cases with rtw, where you don't have any I mean you could put it got from there in front of it like a it's like a CDN sort of situation, perhaps.

A

What about like a database that doesn't use the file system that just uses the block layer? Well.

B

In that point, the database probably has its own cache as well.

B

A

So yeah, maybe just getting OMAP, is good enough. Maybe we don't need to do anything like that or maybe just the benefit of it would be really unclear. Yeah.

B

A

Well, okay, so, anyway, caveats of this again this is coarse-grained. It doesn't happen that often right now I'm just doing it every five seconds. I think we could probably increase that a little bit, but we we don't want to be doing this, probably with every trim.

A

So it's it. It right now over allocates a little bit to give itself some leeway to grow things. That's why, in some of these graphs, it's not quite getting up fully to the hundred-percent ratio, because it's always leaving itself a little bit of room to grow. In case like say, the kV used kind of starts going up. It always wants to say ahead of it if it can. So that's why it's kind of left a little bit of of the cache unutilized for future growth. So that's that's one caveat.

A

A

There's some really weird behavior in rocks DB, where by default, rocks DB allows it will grow the block cache beyond the user specified ratios, especially in during compaction. The alternative is, you can disable that and it can make operations fail, which I think we probably can't do currently so so the gist of it right now is that rocks TV may use more memory than then it's been assigned temporarily, especially during compaction. What's a little odd is that when that happens, it appears to completely flush the high-priority pool. So all the indexes and filters get flushed.

A

It only happens when it exceeds the the block cache size. That's been specified, I, don't know why that is. It doesn't make sense to me.

A

So at any event, that's what it appears to do in this PR there's some code to kind of try to work around that by if it, if it all of a sudden, the rocks, DB cache rinks dramatically, especially like the high priority cache or the high priority pool the usage there shrinks dramatically. It doesn't just shrink it to the new value instead kind of the rocks DB priority cache layer in this will will say well. I I recently saw high. You know this high of usage.

A

Now it's saying this: here's a watermark for what it was previously set to and we're gonna just shrink this slowly rather than then just dramatically drop it and then regrow it slowly.

A

Instead, there's some memory here, so so let it regrow quickly if it needs to, and then we're kind of back close to where we were before once we had all indexes and filters populated in cache, so we're working around it right now, but I think. Ultimately, we need to understand why rocks TB does this and if it needs to do this or if it's just kind of a an esoteric behavior that that doesn't really make any sense future work. I've talked a little bit about kind of this bend.

A

Lru approach, I've got kind of a prototype of it in the works that that that right now reports or bins the different usages it was. It was pretty easy to implement so I'm playing around with that. It does not appear to have significantly high overhead, it does add some overhead to trim the cache, but it's basically just incrementing and Eckhart and and decreasing a couple of of n64 values with smart pointers to them.

A

It seems like it's low enough overhead that maybe it's not going to to dramatically hurt things at least not at our current performance level.

A

We'll see my hope is that once we implement this, it will allow the caches to rebalance based on the workload right now. If you imagine a scenario where you've got lots of heavy rgw rights and then later on, do a bunch of our biddies, random, small random reads: writes on the same cluster. You still might end up in this double cashing scenario, because our job, you sort of forced you into it and now and now without the ability to adapt.

A

Rbd we'll end up, in that same scenario with a bunch of old rgw data cashed in the kv cache, and you can't get out of it with this bidding this priority-based binning, I believe we will be able to I believe we'll be able to say. Okay, there's lots of old rgw data, but we don't care about that anymore. We have much a much more high priority.

A

Oh no data coming in right now, so, let's cache that instead and then fit all of the own ODEs into cache. All of the Ono data that was in the kv. Cache starts kind of going into low prior lower priority bins that shrinks and now we're back in our optimal scenario, where we're caching lots of Oh nodes and not caching, Cavey data. That's irrelevant! That's the goal with it. I think we can get there.

A

Other future work might be to to do this at the OSD level. Instead of just looking at a user-defined boost or cash ratio, we kind I think we want to instead say here's how much we memory we want to give the own this particular OSD in general, and that accounts not only for these caches, but things like the right ahead: log buffers and rocks DB and the PG log buffers that are in memory and other stuff, that's potentially in memory right now.

A

The OSD potentially can use significantly more memory, then, what's defined here, you give it three gigabytes of glue, store cash and in reality like in rgw workloads, we might be using close to eight gigabytes of RSS memory on the OSD. It it's very very different and we can't account for everything. There's TC malloc fragmentation, there's there's lots of other stuff that that's going to be really hard to tell you know to account, for but at least we can get closer.

A

We can maybe get to the point where, where we're not going beyond what we've specified for the OSD to horribly dramatically, we'll see I, don't know, but by at least think that we can do better than we're doing now and then finally,.

A

Let's see, oh, this is the age based. Oh right now, even if we did this kind of priority based binning I, don't think we're gonna totally get rid of double caching in in rocks. Db I, don't know if we'd entirely want to, but definitely in cases like rgw I think we want the OMAP data to be prioritized relative to kind of the the double cached blue store. Oh no data.

A

We really want I, think the Ono data, probably in blue stores, cache, and we really want the oh map- data cached in rocks, DB's block cache- or at least we want prioritize it versus. Oh, no data in Rocky's block cache, so potentially, maybe in the future we can. We can either create multiple caches for each column, family or or at least kind of implement some scheme to to give different priorities to different prefixes or something so that might be another potential future area of work that that I think could yield improvements.

A

So um that's it we're more or less out of time, but any any questions on any of this.

A

All right, well I'm, going to stop sharing then any last minute things before we wrap up guys.

A

All right well have a good week. Everyone thinks thanks for listening. I know it's kind of weedy, but it's a it. So thanks much thanks, Berg. Thank you.