Ceph Performance Weekly, 22 Feb 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2018-FEB-22 :: Ceph Performance Weekly

Description

Weekly collaboration call of all community members working on Ceph performance.

http://ceph.com/performance

For full notes and video recording archive visit:

http://pad.ceph.com/p/performance_weekly

A

All right, they start with poor requests. Yeah.

B

There's a big pile of them here, because I finally got through all the stuff from when I was on pto.

B

A

Right well, the first one is just came in this morning and it's pretty exciting. Actually, this Peter went through and figured out why the OP tracker was slow and fixed most of it. Now that performance impact is it's not totally gone, but it's like much much smaller. You compare under still chart the blue blob to the green blob. On the other side, that's pretty awesome, so I had some style nets, but otherwise all sounds fine to me.

A

Need to run that they're testing the.

B

Fetch I'd be nice to get that through the wall, clock profiler, so that we can actually observe the the continued impact of any of it. If any.

A

Yep, let's see a de patch, the blue sorta just trim the cache more frequently instead of every 200 milliseconds. It's every 50 milliseconds does motivated by somebody that had a very small cache and doing a lot of IO, and it was just not trimming fast enough to like keep the memory down. I.

A

Don't think we used to trim like every single request, so I don't not too worried about that being a problem, but should make the the tail light and see a little better. Let's see the back part of some marks, ve compaction, stuff, an improvement to insert and out of that one. That sounds great.

A

Yep, let's see some power of to stuff from rate of Slav.

A

Look sounds good, not sure how much impact that will have I'm, not sure that those divisions are really the problem, but I'm guess we should find out. Let's see not.

C

Much I'm fighting and the divided and a huge problem just they are very easy to eradicate yeah with almost globally with a dedicated type yeah.

A

Ok, yeah, it's probably just as useful to have the constraints on those settings being a power to and if that, if that's actually what we want to enforce.

C

We introduce the type to the the new type P 2 T 2 2 options, infrastructure to have a limitation out of the box. I would say yeah.

A

Yeah, a pumpkin okay, let's see mark identified something with the stream stuff for debugging up. It was slowing down things. I think that is that one ready to go I can remember. There's some I.

B

Think I think David pointed out that it was showing it was. It was opening some debug output more often than previously I think the I figured what it was now, but anyway, yeah I think that was the only knit with it that was lucky I, be I'm.

A

Not too concerned with that 50 back up, but it's different as long as it's like.

B

A

Just debug logs yeah.

A

Let's see, there's a sharp cue change, I think that's fun. We should just yet defer until after we merge the other refactor of the charting stuff. um Let's see sorry set that closed.

A

um Oh yeah, that one was close. Let's see that clock now, this inlined that's merged some config normalization checks from Peter, that's merged the observer for the config options that we talked about a couple weeks ago. That's merged, that's great! That should make it really easy to move away from the legacy stuff.

A

More normalized, stuff merged.

A

Oh looks like couple blue star optimizations. How are these were these significant impact? We just saw buy these worth back according to luminous.

C

Which one notes and this stuff the.

A

Avoid overhead of standard function in Bob T was one of them.

C

Could be significant because, as DD function requires in case of taking clamp de, it requires additional memory, allocation, dynamic memory, location, okay,.

A

C

I guess we might.

A

C

To take it might.

A

As well, okay there's another one to mark derivatives of AO context of final. Those are searched, reviewing might as well just pack for them.

A

If you don't mind doing that, right, Assaf, okay,.

A

Great see some live already object, map improvements, avoiding get bail on a repast that was pretty significant. Actually that and the other thing that mark identified were the two big. What was the net on those mark.

B

So this is with pet store that I was testing him, so the in memory vector object, implementation, the MD, config T stuff was big, I mean that was like a in each video in each TP OSD t pthread. It was 5%, so.

A

B

So so a fair amount there I mean that might have back chily been repeated in a couple of places, so it might not have just been 5%. It might have actually been more like you know, up to maybe 10% in time. Do.

A

You mind checking to see if that will apply to luminous, so we want to make sure we have eliminated those same issues. There too.

B

Like you do want me to make sure it works and luminous or like we.

A

Need to backward it or should be backward. Oh.

B

So if we can, we can look at it there. It may not be as apparent with something like blue store and they might sure.

A

Yeah, if it's a trivial fix, they might as well apply it sure.

B

Yeah I suspect it depends on whether or not in luminous we added can we were checking some config options. I don't think we previously were checking. So it depends on whether or not those new config options were being referenced or not. I guess in luminous yeah, that's sort.

A

B

A

B

I think it was sometime in the last six months that we added those and kind of hurt things.

A

Okay, oh it looks like there was one from John peg on do quid PG login codes. That looks. That sounds good to that point. That's probably why, according to.

A

Another of everything, Marya property- the threshold went to jewel. That's awesome.

A

And this hash thing would close because it was a problem with FIO, not really the boost or code. Oh and this one here, the ec back-end updated that one's getting that one's in the test, queue.

A

Yep, it all sounds good what else your cash? That's a lot better! There's a new mode for Rio's cashed. Hearing that ax should look at again. We viewed it a couple weeks ago, I'm, not sure what the latest is.

A

What else there's still work to add a quick assist on Intel processors, support for compression.

A

That's still ongoing.

A

um Oh, the recovery, optimization that you're looking at that one right: Josh personal object: recovery, yeah; okay, what's the status.

B

There I think it mostly looks good, there's a few things to be fixed up with the encoding and stuff cool. Okay.

A

That's mostly, it.

A

All right, I'll talk about the booster cash mark, sure.

B

So there this this was motivated somewhat by a bug that came up recently. I, don't know was I've taught my head, but it maybe was that one maybe was that no, it wasn't just a trim happening more frequently. There was another one too, where there was something related to trim and I started. Looking at the code for blue source cache and it the I guess, the the the thing that really struck me is how kind of the the encapsulation is kind of leaky.

B

It feels to me at least like there's just a lot: it it's complicated, I. Guess it yeah. So so you know I, don't I, don't know how to fix the bug other than that. It's made me between that and between some issues with a partner that recently came up where they were having a really kind of it was. It was apparent how how confusing it can be to to how we kind of deal with with blue stars.

B

Oh no memory for that memory for rocks, DB's block cache and then also having memory available just for caching data in blue store and how we kind of try to divide up the ratios between those and limit rocks. Tb's cache at some point. It's just kind of it's not only hard for for users, I think to kind of understand.

B

All of that, but even for us, we've got this kind of scheme that I was responsible for implementing to to try to kind of make a sane default for how we divide all this up and it's it's clear that it's not handling. At least this particular partners use case very well, so um you know we. We could try to make that better. We could try to make it smarter.

B

We could try to divvy up things more dynamically and make it so that we're balancing hit rates in our cache and Roxy B's cache or maybe just try to make something- that's a little bit better, a little bit of a better default. That's still pretty simple, but but even if that is the L feels crazy to me.

B

So the the quest that I'm on now is to try to see if, if we just kind of rip out blue stores cash entirely, if we can identify what the pain points the solving are and then fix those instead of having our own cash and just rely on rocks TVs cash for metadata I, don't know if that's sane or not, it might not be. We might still need our own, but I want to at least kind of get a better idea when we don't have our own cache of what we're fixing yeah, yeah yeah.

A

So I think they're. There too I think it's important to separate out metadata and data. So the problem we're having here is the case where we're not catching any data, it's just metadata and we weren't giving it for. If you gave those see lots of memory, it wasn't giving rocks to be enough of the memory it was. We basically capped at that thick rock streamer, yet I think 512 Meg's or something really small and gave everything else to blue store, and that was just a bad decision. I think in retrospect, the.

B

The only thing is: is that that's what the the testing was bearing out for us, even with with a fair amount of.

A

Stuff yeah before we know it before we put, we fix the.

B

A

The bloom filters were enabled and accounted for in cash right, so it was. We made.

C

A

And then we made something that affected that decision, that we didn't revisit the decision. Yeah.

B

Exactly so, we.

A

We have to fix that regardless, but the thing is: if setting all that aside, the eliminating the metadata cache stores. Basically, the basic trade-off is, if you eliminate it, then you're trading, you're, basically trading the overhead of maintaining that cache in memory versus the cost of decoding the data that comes out of rock darks to be every time, and we did quite a bit in blue store to make that decoding much faster, but still slow, I. Think, objectively speaking, it's still it's still too slow and I.

A

Don't know that that's something we can fix in there, like short term I, think it needs like I, think it needs a pretty fundamental, like rethink and redesign of how the data structures are laid out in memory, so that matches the representation in porosity so that there isn't like a decoding stage. We can just like map it into memory and then immediately use it. One.

B

One thing I do want to point out, though, that we don't have a lot of data on is that when we've encoded the data, we can fit more in rocks tb's cache than we can in the balloon. It's Orono. That's.

A

B

So for customers that have a lot of objects, it may be better to have more things in cash like for yeah.

A

B

Eat the encoding overhead and decoding overhead than it is to use blue stores. Oh no, we don't. We don't really know we. You know it's it looked before when we did this testing last like giving blue source. Oh node cache all of the memory we could more or less made sense as long as rocks TB had some like minimal amount for probably indexes at the time and now indexes and filters yep.

A

So that's the if I remember correctly book Oh excited yeah, so we have blue store, KB ratio. So that's a point: nine, nine or whatever. So right now we're doing almost everything directly until we hit the cap, which is 512 megs and then everything goes back to a blue store. So that's the it's really that cap. That seems like it's a problem, I think what we it feels to me like the focus should be. What is the actual cost of maintaining that that cache? Because that's the part that I don't really.

A

It feels like we're sort of making decisions based on a fear that it's it's expensive but I, don't know that we have any good data about what the overhead of actually maintaining that cache is so like. If you have a yeah, we.

B

Know the filters alone eat up 20 bits per key with our current settings. That's.

A

That's all in rocks to be, though, no I'm talking about the blue.

B

A

B

A

Do we have a good understanding of like how much that actually costs, because the because they're they're sort of these confounding issues one is that we have to have some tracking of in-flight requests and that's handled in the cache. So it might be that the sort of the workaround is just to like steer all the memory towards loose towards rocks to be, and then blue storages caches only what it has to that requires.

A

Basically, no code change or very very little code change and might be enough as long as it keeps all the cache structure small. So all the lookup tables are very tables, are small. It's only in-flight stuff. It should be hopefully really fast, but again when you we need it for sure figure out what the cost is there, and the second thing is that the only way to cache data is also have the metadata that I Dingles off of so. If there's any kind of data caching enabled, then mmm that doesn't help right.

A

We have to have rocks to be tracking, that in memory, I think that's important for Assefa thefts, workloads and for our GW workloads. Not so much for RVD, because there's the client-side caching there and also the VM and everything else, is caching the filesystem layer, but for our GW hot objects should get cached on the OSD and for a set of s.

A

Well, maybe less important or seven s for any client sharing. That cache will be helpful. Our.

B

Gw is one of the cases where I'm worried that, having us kind of not not being able to dynamically change, the rocks DB cache hurts us right because you've got so many more key value pairs, potentially enough with au map that we we don't. We don't know how much we need to be able to keep all the indexes and filters in memory. Yeah.

A

And it seems like the wild card. The variable that we don't understand is is how much memory those bloom filters and indexes are going to be. First, is actually caching the to keep the idea.

A

C

Needs like that.

A

That mostly matters if yeah yeah, we just don't really know how how big they're going to be because it's it's really a question of the size of the data set, determines how many bloom users are yep and, and that depends on how much I'm updated it. There is reasons I, guess.

B

And even kind of gets more weird right because you you load well, I was gonna say you look bloom filters in with the bestest he files, but that's not true, because we're keeping them in the block cache. So so you have like a bloom filter for every SST, all right so, depending on the size of your SST file, you may be paging stuff in different. Like you know, sizes right you might have like if you have a big SST file, you've got a big page in for the the bloom filter.

B

If, if it's not in memory, whereas if you have like small SST files- and you have less to load in at once, but it's good and bad has the trade-offs. I think we want I was gonna, say I, think we always want indexes and filters in memory. If we can help it, though we yep.

A

So I think that my my gut says that we should that the simplest thing might be good enough and that's basically just to fix that stupid cat, and so it's instead above a certain threshold. You just served that about you, know, 2/3 or actually even 1/3, to boost or or vice versa, something like that I.

A

My guess is that that's that's going to be good enough, but if it's not then I, we should be able to ask rocks to be how many SST files there are and how big they are to get an estimate of how big those bloom filters are. So then we can adjust the cache based on that I'm. Just not sure that it's gonna be worth the complexity. To do that. I. I, guess, is that we can do the simple thing: it'll be good enough. I kind.

B

Of wonder, generally speaking, in a variety of different situations, if we want some kind of mechanism where we can map some kind of inputs to different outputs right, like say you know, this is the hit rate of the the bloom filters is the hit rate of of the key value data.

B

You know give us memory based on on. You know these numbers that we're giving you.

A

B

And that best seems like it's very generally, that kind of a process is really applicable in a lot of different places and stuff yeah.

A

If you have, if you have a model for what the cost of a Miss is and the cost of a hit- and you have the hit rates, then you can build a model based on that and you can do it an optimization problem when you decide how like I would get the memory on that. That's that's true. Yeah I.

B

I, wonder I, wonder because I mean we first. So if we keep the blue storoe node cache and we've got that there, where we have to decide how much memory that thing has we've got the rock CB cache where we have to decide that we've got some other buffers in various places. If I recall in, let's see the OSD code, the.

A

Osd has the up C object, context, cache I, think it's like six eyes, I! Think it's just fixed it like a hundred or something.

B

Well, I'm wondering those if we can do something like this, then users don't even need to think about that right. They don't have to wonder. Oh.

B

A

I think to be clear: users should never have to think about any of this. I want users have one knob that says this is how much memory- yes, so I think it really comes down to how sophisticated a model do we think we need, and it it's it's it's it's basically complexity versus diminishing returns, so we can do something sort of trivial, a trivial set of heuristics like we currently have that are hopefully good enough and we make them like just good enough, so that they're good enough or we can invest.

A

You know a lot of effort into like online optimization totally whatever gets super complicated.

A

B

It could it be even something as simple as you have different things that say, say: rocks DB. We have some kind of little little piece of logic in there that says, here's home how badly I think I need memory based on some. You know scale, you know from zero dough at one point: zero or something and say this is how badly I think I need memory, and then you know we we give it memory based on how much it thinks it needs memory. How much other things think they need memory I.

A

Don't know yeah I, think I, think what you're saying before is is the way to look at it where, if you look at the hit rates at the various layers of caches and if you can determine what the cost of hit or miss is, then you can.

A

You can optimize that that function, that system or whatever but I'm I think that that it doesn't necessarily mean that we have to like do that optimization online in a running system, I think it's good I think it's sufficient that we basically build our own model and we decide what the sort of extreme workloads are.

A

So, for example, you know all RBD, you know NOAA map, it's all blocked data whatever and then we like fill a device, and then we figure out what the right allocation of memory between blue Storen and rocks TV is, and then we take the other extreme where it's like all our JW indexes stole an entire device and we figure out where that balances, and then we just pick something that sort of you know text somewhere between those two extremes without necessarily like yeah. That makes sense. I.

B

Think it it does, if you assume that a clusters workload isn't getting like shift dramatically in run time between, like an R GW and an RB d workload like predominantly one shifting predominantly to another.

A

Well, I think, even even if it does, if we, if we choose a I, think what I think I mean I'm just guessing here, but it. My guess, is that the that the I guess the surface area of the thing that you're optimizing is such that you know whether you're steering like 80 percent to rocks, TV and 20 percent of blue store or 80 percent of blue store and 20 percent rocks TV doesn't make a huge level of difference.

A

It's only when you get sort of in the extremes where you have not enough for rocks to be or not enough for blue sort of that, like things go, really go south like they did in this. In this case, with the partner, and so as long as we sort of stay in this middle ground, then we can.

A

We can pick a value, that's sort of middle of the road that doesn't run off the off the rails for either extreme cases like it still does good enough for the, although map case, and it still does it good enough for they're. Like no oh map case, then you know we might get another like 10 percent. If you're like twiddling, trying to optimize it perfectly but I'm not sure it's worth I guess.

B

A

Or something like, especially for the case where we just need to make sure blue store, does it doesn't break right? We need to make sure that we're not blue star started luminous like we need something that we can back for it. We just like tweak the policy so that it's going to be okay for everything uh and we don't want to like introduce this whole other level of like self tuning monitoring or something.

B

One so so I agree with you there that something very simple for luminous. Absolutely, we don't want a bad for a big big change. I. Do wonder, though, if still going to have a problem where you have very different requirements for how much cash rocks TB gets depending on what you're doing like an RDD case. That's.

A

The big question I think the way to answer that is to basically take you, know big SSD and fill it full, though map and then a big SSD and Philip fools like for mega objects, and just you know.

C

A

C

A

Do it because it's hard, we can't really speculate what the way it's gonna be mm-hm.

B

A

You can't, but it's not gonna, be right.

B

Even in the four megabyte object case to if it's really fragmented, we're still gonna have a lot of key value pairs right, yeah.

A

I mean we can yeah, but it's gonna be work I'm, just thinking what the extremes are, the the extreme in one way is going to be like I guess: rgw objects, oh they're, all 4, Meg's and they're, written sequentially and they're not fragment at all, and the other extreme would be 100%. Oh map tons of rocks three keys yeah.

A

Maybe there's a third point: that's like very fragmented. Are we deep because it's slightly different key value? Did it but I'm not sure that it I think it will matter, because I think that the important bit is the decides? Does the rocks to be bloom filters, yeah.

B

And the ones that the bloom filters that we really really need I think are probably when we're setting or checking adders for when we're doing like writes we're looking to see if stuff exists already, but that I mean every single write. We do like three of them or something that's so yeah yeah I mean that's really what we're I think talking about the primary thing we want, like a super high yeah.

A

It's probably the case that rocks, if he just doesn't its distinguishing between those that part of the key space and the key space, that's like just full of user data, and so it won't isn't preferentially keeping those bloom filters in cash versus the other ones. You know what.

B

Ideally, we have all.

A

Of them in cash right, because it's going to be so I think the simplest case would be. We can size it so that we can keep all the bloom filters. Even if you have a complete device, that's just full of. If you fill an entire device with you value data will still be able to keep them all in memory. Dad.

B

A

And so that's the that's the that's the question answer.

B

We don't even really want bloom filters at all for any PG log entries that end up getting into level zero.

A

I don't know, I, don't know how compaction works.

A

But I think that's a small piece of it.

B

Even I don't think it is actually I recall on a different customer or a different partner. Customer I remember workload when they were. They had a big. You know much bigger than our our test environment here I recall, seeing a lot of work being done, creating and and working on, bloom filters and I suspect it was for PG log work with all the the key inserts that are coming in with PG log I mean it's. It's a huge amount of our total key insert workload as PG log updates, I suspect a lot of that work.

B

Creating those things for I, don't think we even need it right. I mean there's no point. We don't read that until we need to have on startup, but.

A

B

A

Put it in it, we have it, except for column, family all right. Yes,.

B

Oh, we have all map in a separate column, family I, don't think PG log. Specifically, though right I thought.

A

We did I put PG logs, I, don't call family, maybe maybe I just miss remembered.

A

D

At least experiment yeah.

A

Peachy meadow map it has its own column, family, just for the PG logs. So if we hadn't, if we can disable bloom filters on a per column family basis, then would you have just do that? Yep.

B

That might help as well.

A

You know, can you look at that? One I can.

B

All right, it's a good thing. It.

A

Feels to me, like the two action items. Excuse me action items are that disable bloom filters for PG meta items and then we need to deploy and like populate a OSD or small cluster or whatever, that's just completely full of probably like our GW index data or something or just or some sort of simulated map data just completely fill it up, and then we can look carefully at what roxie be.

A

Did you know how many, how big the bloom filters are, how much memory they would take to catch them all whether we can catch them all how much memory we need given the size of the device if we wanted to catch them and so on, because that will tell us whether it's just a matter of you know increasing the one that to like two gigs instead of one gig and alts it.

A

Maybe that's enough or maybe it it's impossible for them to fit, and we have to do something clever where we prioritize different types of bloom filters or something else, but but we need to get the data to figure out how big they are. In that case, yeah.

B

And I actually don't know, yeah I, don't actually know if rocks TV gives us that, like what what the hit rates on the bloom filters are. That's part of this is I, mean yeah.

A

But we should be able to figure out how big they are right right, but.

B

We know how many I mean, we know how many heat well use.

A

There are yeah and.

B

Then we can calculate it figure it out from there yeah.

B

Oh go ahead, then sorry, I, don't want a truck go ahead. I was just gonna, say, I know, rocks TB will give you estimates on how many keys are present. There are kind of some weirdnesses I think when you have like duplicate, keys and multiple levels, where it's not like exact what you have and what you don't I, don't know if there are like offline tools where we could just like to shut everything down and then have it like go by and go through and just like get everything I'm.

A

Not sure that that how many keys is actually what we care about, because the users and the how many keys we don't know how many keys. What we know is that we have a 10, terabyte SSD or whatever a 30 terabyte SSD from Samsung, or it was that's what we know, and so, if we, if we fill and that, then the question is, as an operator I have this SSD I know I'm gonna be using it for a g-tube you in Texas. How much memory do I need for my OST for this size device.

A

So we should just fill it with sort of representative of map data. But it's something that looks like a generic is sharp I'm at work load for our GW, probably and then, and then go from there.

B

From the perspective of knowing how much memory the filters are going to take, don't we need to know the number of keys to figure that out, though, like we have to look at how many keys we generated when we ran this workload, yeah 20.

A

Bits per Chiho: if, if we're gonna figure this out like on a napkin, then we can count the keys and figure out how main keys fit into an SST table and then how big the bloom filter to be presses t and we could figure it out that way and like multiply times 3 or something or all the right answer.

A

Whatever that happens, or we can just have a workload that generates the keys that we care about, not even know how big they are and just fill the device up and I'm kind of leaning towards filling the device up, because that will capture any other effects that we didn't think about all right. Okay and.

B

Then, how am I gonna find out Rock.

A

City the haters in the real world and see what it what it needs and.

B

How do we know whether or not the the bloom filters used more memory than was available in the block cache? That's.

A

We have to like look at the logs to figure out figure that out I'm, not sure she sure. So we can. We can look at blue FS to see how many S's T files there are and helping there. So we can actually see that there's I assume we can look at an SST and parse it out and figure out how big the bloom filter portion of it is. That's probably like in the header of the file it tells it starts with the bloom filter or something not sure.

A

So we might have to write out something to like dig into the guts of the Roxy B data to see that data but I'm, assuming that we can figure it out.

A

B

Well, it'll be it'll, be different. Since different column, families will have different SST files, we'll have to kind of. We can't just look at like one SST is representative, we'll have to figure out what the ratio of different things are and what they look like. Yeah.

A

B

I, don't I, don't know if that's better than the like trying to find out how many keys total are in the database, because then you can just apply the the one. You know kinda yeah.

A

Well, we can do both yeah, maybe.

B

A

Like we, we can't skip the part where we actually look at the actual rocks TV behaving in the real world. We can't just we can't just do the napkin calculation and then assume that it that's was reality but to actually check oh yeah.

B

Sure fair enough, okay sounds good.

A

B

There's only little thing I had here, which was that I don't have it up yet I, guess the wall clock profile, but um when I was doing some of the pet store work earlier, I did a Wolcott profile on FIO with the lebar BD plug-in, because that was actually the the bottleneck at some point was having just you know. One client wasn't enough had that scale to multiple clients and looking at it, it looked like a lot of time. Well poll might have been a bottleneck. We actually were spending a lot of time just polling.

B

It might be that something like a poll would be faster, but then also there was some random. Other stuff in here looks like object. Er locking was maybe some of it for up submit in TP Lib Barbie D calc target spending some time there. The crack yeah.

B

Then just lots of stuff in the async messenger to object or object or handle OSD opera ply.

B

King the finisher queue there there's some finish up.

B

A

Years since we optimized the abductor codon, so it might be worth a fresh look, I think you who did it a big rewrite? Am I working with SanDisk in like 2013 I, think I think that's a.

D

Big I have a big reread for that using a CEO or the networking TS okay in the pipeline, I just haven't been able to get back to it yet because I've been doing stuff things for customers, yeah, okay,.

B

Also encode decode there's a fair amount in encode and decode of MOS d, op, so encoding that.

B

Decode message: actually CRC calculations were not insubstantial there at any event, there's there's a variety of things here on the client-side that probably could be looked at and improved so yep. That's all right.

B

That's all I've got okay.

A

I guess short term short term fix for the current tuning I think it's. It seems pretty clear that our decision to just put a hard limit on the amount of data that we get arrives to be isn't right. I.

B

Maybe we should get the data back from the user, though, because it's it's also entirely possible that these are misses in in blue stores, own Road cash, and that we're just you know, yeah.

A

Okay, yeah so they're there a couple of the pieces of information one when they turned on the blue, FS buffered read it vastly improved things.

A

This is something that could be because the block, cache and rush to be wasn't big enough, but it could also be that rocks to be is still stupid about compaction where, when it does compaction- and it writes new SS T's, they don't go into the cache. They only go to the device and has to go read them back again. As you start faulting against them.

A

They haven't been super excited about like changing that, because most people run where ICP on top of a file system that has its own page cache, and so it doesn't matter that much so I wonder if we should just change that setting to default on.

B

Yeah I think if I remember right, I've actually got data showing that buffer Greed's was far far better to when we were doing testing Redis loves, async, read PR. Let.

E

Me find that because I think, oh my god, so so, while you're doing that, I just have a question about that, because if we see if we see that buffered reads is better that basically says we're not cashing the stuff that we should be cashing. It's true.

A

Things yeah so two things, so yes it. There are two reasons why that can happen, one because the rocks to be cash isn't cashing what it should, because it's too small, which I think is I, think that's what's going on here, but the other cause which we can't fix without fixing rocks TV is that when rocks TV does its compaction, it's taking all these SST files that are warm and in the cash and everything and it's writing new ones that are compacted and it doesn't put them in a cache.

A

It just writes into a device and so immediately following a compaction. You have a whole bunch of cache, misses that have to fall to everything, hot backing it again and that's a common problem with rocks TV and leveldb, and a bunch of these things, um I've reviewed papers with weird complicated schemes trying to like mitigate that effect with Ellis entries and we've.

A

When we talked to Roxy about that they they said they don't really want to just dump the newly compacted stuff into the cache, because it'll push out all the other stuff that it was legitimately warm and the newly compactive stuff may or may not be warm. They don't really know, and so otherwise, every time you do compaction you'll just basically like to throw out everything in your cache and they'd.

A

Their cache basically wasn't smart enough to have if but I think they're using a straight LRU, and so that's the problem, if they're, using like a 2q or MRC or whatever one of the more sophisticated LRU type caches that has multiple chances before you get kicked out, then that wouldn't be a problem, but they don't so what would be lost so from, but basically blue if I spot fir, trees, true sort of a website issue, because you have a period of time.

A

The reason why I was pushing to get that off and while we finally turned it off, was because I wanted I wanted to eliminate that entirely, so that we would have just to simplify the code but mostly and so that we would wouldn't have a disparity between when you're running on SPD Kay and when you're running on a kernel, block device and I. Think that disparity doesn't matter that much and the codes already there. It works fine. So if it helps, we should just make you so.

A

But that was why that.

E

Was why I push to get it turned off? Okay, but but if you're doing that, it doesn't mean you're, buffering stuff twice, because you're buffering it essentially X buffer cache and also in the OSD yeah.

A

Yes, yes, but the, but the linux buffer cache is sort of free memory because we're not using it anyway, because all of our other caches are in the process, and so anything that said someone s buffer, caches memory that we wouldn't be using anyway. So it's it's free as far as we're concerned, I guess you might be crowding out. You know something else in the pH cache for a file system or something but I. Don't think we care about that.

B

Yeah the the buffered read bits of it lets us not have to be quite as precise with how much memory we're giving blue store either right.

A

A

Yes right because, basically.

C

A

Other memory left over and the host blue FS will just cache works. Yep, that's yep, but I yeah, I.

B

Think I think there's a lot of situations where we're gonna have customers that are not going to be able to very precisely define how much memory new circuits, yeah.

A

Okay, so that all up in a PR that turns that back on and feedback port I think that seems like a no-brainer. But the other thing is.

A

See if I can remember how this works, so the cache tunable is for the either we have. We have cache meta ratio, KD ratio and KT max. We have meta ratio, a one KD ratio is 0.99.

A

Data ratio is inferred from those it's just one, minus the other two which is defaults to zero, because just one minus a that you and then we have KT max just currently 512 megabytes.

A

Basically, if you have less than that amount well know basically it'll it'll obey these ratios and giving up the blue store cash until KP max hits the max at which point we'll stop giving anything to proxy B. That's the current behavior yep, but I think this. We should I think this.

A

We should change, thinks I, think we should just we should add there just get rid of it, and so we obey those ratios all the time or we should change it to K V min where, if it's below baby min has look at look at our settings are basically our settings basically say get everything in Roxy B. So if we change it to K V min where, if it's below that, then Oliver memory goes to K T and then, if it's above that, then we obey the ratios cuz right now.

A

These ratios are basically meaningless right.

A

C

Choose K V min.

A

And we can change these to 0.5 and 0.5, and so, if we have less than 512 megs, we give all of it. Roxy be yeah, Roxy, V and then once we get above that, then we give half to rice to be half to blue store.

B

E

B

We don't necessarily want to give half of it to Roxy and half to blue store, though, because that was the behavior that we had previously, where we were having a lot of Onoda misses when doing unlike fast devices, we had something more along those lines and we kept on through a variety of steps. We kept on giving more and more and more to do blue stores to the owner they're right, but.

A

That was before before we had bloom filters. That's.

B

A

Just saying decision status in the, but so if we so again, this is going back to our original discussion, where we can have a complicated model that like tries to give exactly the right amounts of rock Steve. You know more and then everything else of blue store or we can have something sort of like middle the road, simple and I'm kind of guessing I'm, just guessing that half-and-half is sort of splitting our risk there right.

A

If you have three, you have three gigs to give to the OSD and you give one and a half to each.

A

That's not going to be that different than if you had five gigs and you gave all of it or whatever the other way around. If you have five gigs, you give half and a half, that's not that different than if you have three gigs and you gave it all of it arrives to be right. It.

B

I don't know if I believe that statement, but I'm trying to see if I can find some of the old testing and legitimately this is without the bloom filters in the cache. So it's not not strictly relevant, but it wasn't. It wasn't real. It was you know what we have now was was kind of the optimal at that point.

A

Yeah before, but we got a huge boost from there, so I guess the way I sort of intuition here is that at worst were only we're at the worst case here, for if we're just splitting it 50/50 at worst were half as good as we could be that make sense.

A

So, in the worst case, we're like only making 50% getting 50% of the value out of the memory like if you compare it to giving all of it to blue store, which might be the way in or giving all of it tries to beat being the wind we're sort of giving like half of that value by splitting our risk.

A

It's like putting out for money on red and half on black I guess, and when you start talking about giving multiple gigabytes to the OSD, which is what we're doing with nvme, we have at the default cache size of three gigs.

A

This is going to be no worse than.

A

Then having the perfect, if we have three gigs and we split it 50/50 it shouldn't be worse than if we had two gigs or one-and-a-half gigs, and we perfectly allocated. We made the perfect decision.

B

Yes, there's just a big difference in our performance when we have one and a half versus three I mean.

B

E

Which is a very binary either they you have enough for you area where it falls off a little yeah.

A

Which is I only need to do it. The whole analysis of what the filter sizes are, in the worst case, I guess yeah.

B

I think I think the the the takeaway from this is that it now that we've got bloom filters in the rocks, DB cache and we've made other code changes to its is probably worth retesting. All this figuring out kind of.

C

B

We expect given how many co nodes people have yeah.

A

But okay, but setting that aside previously, we had this, we, the current behavior, is that up until 512 megs, forgetting all of it we're giving 99% so we're effectively giving all of it to the proxy B I. Don't think! That's any different than changing that kit. Basically changing that KB max to caving in so that anything below that we dedicate all of its tracks to be is.

B

A

Basically, no change of behavior. All this changing is sort of how we allocate things beyond that, but it still feels like it's the right direction to go in.

B

Yeah, it definitely add very much changes the behavior when you are in large memory situations, certainly that's yeah yeah and maybe that's what we want right.

A

Right now below 512 medics, it all goes rocks to be and above 512 megs. It all goes to blue store yep, and so we could change. We could change it to kt min instead of KT max 512 and below that it all goes to KT and then, above that we changed meta ratio to 0.99 and KD ratio to 0.1.

A

We basically flipped those two and it would be the same behavior mm-hmm, but the new, the new, the new way having K team into KP max, would give us a flexibility to decide what happens above that 512 voice. Right now, we're just stuck we always we don't the choice: Atticus, yeah, yeah and I. Don't think I.

C

A

Think there's any question that below that threshold it should all goes wrong to be right. I, don't think yeah, that's not it. That's not a yeah, a question and.

B

That was what the testing bore out to is that you needed at that point at least that much memory to be able to get reasonable performance and then and then yeah. It was exactly what Ben just said. It was a cliff right, anything more at that point, more or less it wasn't exactly five twelve. It was probably more like you know, four hundred or something at that point you know giving more to rocks tbh didn't help. It was better just to give it out to two blue store. I suspect.

B

That's gonna continue to be the case where there is some amount of memory rocks, DB needs. You give it that and then, after that, it's better to give it to blue store. You know, that's that's what it seemed like the testing showed when we originally did. This I think it's probably still the case. It's just that rocks DB needs more now that there bloom filters there.

B

A

I'll, take an action item to prepare for request that changes, kitty max came in and just look at the options around her.

B

So one question, though, is if we are doing buffered reads in blue store and we can solve, or at least identify pain points. You know if it's encoding or if it's you know walking or something else. Is there any good reason to split the memory anymore and and actually have the blue store cash at all, because it still strikes me. That's a lot of complexity has a lot of code. Yeah I, don't know.

A

There's a whole in order to cash, and he did it at all. We rely on the metadata cache because the data is associated with the metadata, so a we need most of that complexity just to track in fight rights. We could throw it all the way and rewrite something that's probably simpler, but it wouldn't be that much simpler, but we would get rid of the LRU, but all the other data structure do the same. We have the same look-up tables, basically I.

C

A

That we would win that much and we still need it for in-flight rights and I think we still need it for we still need if we want to have that Westies ever catch any did ever and I think we do for our GW and for probably for some of this so.

B

So, if we're, if we're doing buffered reads, and if we have the O node cashed in in rocks DB and along with all the other key value stuff, where would you see where would you expect to see pain points if, if we didn't have the oh, no cache is what I'm wondering.

A

A hot rgw object right. If you have one object, that has you know it's. The latest I think we had something untrim objects several years ago. That was like it there's something stupid. It was like a minecraft tar, ball, I, don't whatever it was just some. Some gaming thing was just some big file and there just it got linked somewhere and then a bazillion things are hitting it, and every single request goes through our GW to the OSD and the other.

A

Steve wasn't caching it and, for some reason, I camera what it was at the time, and so there was like a disk I/o for every single jet, and there just was no caching anywhere in that stack. Our GW isn't. Caching. On the client side, you could put like a caching web layer in front of it, but I don't know, think that's efficient right with file stored. Is it you use. You can use a page cash flow with blue store like it's direct I/o, you need you need to cash in there somewhere.

B

Am I so for buffered reads: am I, not understanding what the DOS properly it's.

A

Blue FS buffered reads so just the rocks TV it's going through the page cache not blue stored, sounds dildos direct I/o for the data path right. Okay, there is also an option to make small data I/o and rocks in blue store buffered, but that's waste lower than Jo, because you're.

C

Having data in out every.

A

All bytes and now the system get copied. You know the page cache a huge might have overhead associated with that and.

C

A

Amount that it's lower, we can go. You go badger, it I! Guess. If you wanted to I'm.

B

Trying to remember I need to look at what so so, there's a link in the in the the chat for some of the tests and, if I remember right when we switch on buffered. That is you're.

A

B

You're thinking of.

A

This orb default buffered read and default buffered right, that's a different.

E

Setting, that's that's.

A

Determining whether blue store is caching and unhindered IO in its data cache or not so that controls what that, what blue store is caching behavior is but the actual IO to the device is always direct. Io IO.

B

Not currently have any options for enabling buffered reads and blue store from the disk. They.

A

Might be there I'd like they haven't been used since we were developing it years ago, so buffered reads: aren't asynchronous, so it buffered right, sorry, a buffer right, we'll go into the page, cache and hopefully it'll won't block and then but the IO doesn't get initiated immediately until you actually do the F syncs you're actually dealing I, oh so it's probably gonna slow it down.

A

There's also all the data copies into the page, cache that you avoid with direct IO and on reads: there's no such thing as an asynchronous read: that's buffered so reads become blocking so all the ACE, a IO step that we did on the read path, but we throw out the door if we use a buffer. It reaches our furring on the block device because it's um yeah bright and all those well.

A

The other thing is that all of the all of the threading- it just totally changes the structure of blue store, because right now we Q&A oh and then we continue and then there's a different thread that picks up the I/o completions and just that's completely different, because there is no, the a or thread doesn't do anything more anymore. We're not doing a oh yeah.

A

I am I'm 99%, confident that we don't want to do that. It's the whole premise of blue store was that we're doing direct and we're not using the buffering stuff. Okay,.

A

B

Trying to remember what I need to go back and look at the the configuration files when these were run, but I just want to make sure double sure make double sure that we were not setting anything there.

A

That yeah I mean I, don't even know what the configuration option is or if it still is in the code, because I don't know that we've ever he could beat the beat of a oh and yeah. It's beautiful, yeah and B does I, don't think we've ever tested. It.

A

There for non Linux, OS, basically I, have.

B

B dev a IO commented out in myself, calm, so I've tested it at some point. Let's see.

B

Beautiful air equals false, but it's commented out in this particular test that was run I.

A

Can see where I'm we're running up on an hour here? Is there anything.

C

Yeah I want to cover.

A

B

Don't have anything else last minute, yes,.

D

Just a minor note, I'm, currently starting the some of the investigations we talked about the other day of trying to see if we can find, is trying to look for a few points where you can get a short-term win to keep the customer. The Chobot shall not be named happy regarding some of the PG log and OMAP stuff cool.

A

Okay, well is that around?

A

What's the what's the strategy, the.

D

Strategy is basically that to see if we can come up with a way that will allow them to buy some nvram and get a big benefit in a couple of the pain points from rtw that they that have been bothering them as a result.

D

Basically, specifically, the diversion rights, there's a whole lot of transaction overhead, and if we could temporarily move some of that through a fast medium and then move it to a slow and when it settles as well as making PG long not compete as much with everything and everything else that should make them not hate us. Are they on file store.

A

Unfortunately, yeah yeah I mean they can have nd ran backed OST is for the in that pool.

A

If they're not, are they already using SSDs.

D

They have some SSDs I'm, not sure of their configuration to of that configuration we're basically looking at I, mean I'm sure they have SSDs. That.

A

D

Need also just.

A

Until the on that there's 100 percent of SSD, because that's it it's right, super slow, otherwise, right.

D

We've done that already.

A

B

Adam, the the stuff we were talking about before with CPU usage, you know it's if they've got if they even on SSD, yet now that probably matters I think.

D

They are an SSD, or at least they should be okay, I think.

E

D

Details, that's like yes, what.

E

Was that Josh we should talk about the tender from Ewing.

A

Yep all right thanks, everyone Elia guys.

C