Ceph Performance Weekly, 26 Jan 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2023-01-26

Description

Join us weekly for the Ceph Performance meeting: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contrib...
What is Ceph: https://ceph.io/en/discover/

A

Hey folks, uh I think just starting to wrap up core, so hopefully uh we'll get those folks in a minute or two here.

A

So as we're waiting for them um every year, I take the ether pad and I I try to Archive it into uh the previous year and start over kind of with a new one, and it turns out that we have a not sort of a bug, but it's also a figuration setting where, um when I tried to copy and paste into the the previous year's etherpad, um basically it silently fails. It looks like it did it, but then, behind the scenes, the buffer was too big for what's allowed on the server and it silently fails.

A

And so unfortunately, if you look at the ether pad there's these comments empty and empty, um we lost the 2021 uh record. Unfortunately, uh I think that had happened previously and I didn't have a uh I didn't know about this, so it looked like it was fine, but then it never correctly put anything into it. I did happen to notice this from the 2022 one though, and I have a local saved copy of that ether pad.

A

So um I think I just saw that Adam crateman adjusted it so I'll try to get that in there so that we'll have the record from 2022, um but I think, unfortunately, 2021 is gone, but the good news is that now we should have larger buffers. So hopefully this won't happen again.

A

All right um so I guess they're still going in course stand up, but I can maybe get started here. While we wait, um there were two new poll requests this week that I saw both from Igor um the first one. Is uh this? Do not reset prefetched buffer when doing multi-chunk I I assume reads is the rest of that statement.

A

We talked about this a little bit this morning um and in Igor and I I met, um oh Igor, you're here never mind you were you talk about this you better than I. Do.

B

Oh well, not not to say that so much of improvement. But I observed that during.

B

These rare three duplicate reads from rocksdb when and they're 18 over a map, so Alistar or other stuff and.

B

That's the weird case: when roads to be initially initial fetches.

B

Skip file and then proceeds with reads of the same junk using much lesser block, and it appears that we have some issues as well or internal Perfection, which resets these prefetched blocks too early, and please also think like that. So I could see the some improvements in statistics, so performance counters that we first shared, but not much in a real performance numbers. To be honest,.

A

So I'm still convinced that this is really a bugging racks TV more than it's us, but um I guess. This probably helps right well, theoretically, helps.

B

The issues that we still have is buffer and memory, and instead of that, we said it and to uh to disk uh I believe this career has all copy and cache, and that's why the Improvement is not that be that big. But.

B

C

B

In previous, except we did so.

A

Okay, well, um do you also want to talk about your other PR here? Well,.

B

The rather the other one is mostly about enabling reach Bridge deletes.

B

Delete on the flight, so, additionally, when you need it stopped for them all right and since right now, I'm experimenting with using orange, delete some conditionally it's better to to have been to to be able to put it to switch this behavior of the fly not to mention that some minor improvements possibility.

A

B

Well introduced as well.

A

Yeah I was I was going to ask if, um if you're doing, a lot of experimentation right now, it'd be really interesting to know whether or not the the pr made to enable the ability to do compaction on iteration uh helps here or not.

B

uh You mean your recent tunables.

A

Yeah I have no idea whether or not that will apply for delete range or if it's only going to be useful for for when you have individual delete, uh tombstones.

B

I don't know yet.

B

Yeah I I haven't looked at least uh in this aspect from this point of view, but do my best to.

A

Sure sure, just if you happen to be doing other experiments, um I'd be I'd, be very interested if that, if that shows any Improvement or not.

A

All right, uh I think you are you're sort of the only two new PRS I saw this week. uh Although I apologize to anyone. If there's something I miss, um let's see, I saw one closed PR, and this is also from uecore um you're, the you're you're, the the the person doing all the work this week. I think um so. This is the you're enabling 4K allocation units for bluefest. It looks like you're emerged that finally.

B

uh Right but promisedly I don't expect much performance improvement from this PR, but rather my sometimes make it worse.

B

That's primary intended to fight to fight and to fight and expect provocations on highly fragmented disks, uh so yeah so right now we don't care about the amount of continuous chunks at the first drive, so we are able to fall back to okay allocating unit. uh Hopefully this wouldn't happen very quickly. We wouldn't result.

A

B

C

On the other hand,.

B

uh As we talked earlier today, we might want to.

B

Investigate if fragmented ssgs are in part performance, so since this stuff makes pigmentation even more possible.

B

Actually, when a rocks DB writes SSD files at levels above zero I believe it performs, so there are no Random Access. So at that point it has all the data required to write and there are no much need to have fragmented.

B

This is these problem. We might want to consider some improvements of this respect. I mean very necessity: files, solid SSD files, rather than objects.

B

C

Right now, I'm.

B

Running some experiments to check what would be the benefit.

A

Very cool, I I think it's a good area to look into.

A

Okay, let's see um moving on to updated PRS another one from uecore, uh their your user bounded, iterators and arm wrench Keys looks like Corey reviewed that and approved. We just need QA on it. Now.

B

um Yeah and let's actually an issue which one can face and it's in the same scope as many of our issues. We have got with products DB, which is impacted by tons of tombstones and.

B

Yeah I I managed to produce a tissue and in some cases, map.

B

Clear function, which is issued on object removed, might cause pretty significant right, strolls again due to roxdp enumerating of almost full database sequentially.

B

And we've got recently: we've got another player from Cody with the same Improvement on collection listing, and we had a bunch of in a lot of stuff for different uh namespaces before so.

B

That's a sort of incremental Improvement uh graded products, VPN.

A

B

Right, that's the last. uh This is.

A

Just yeah I'm hoping that we we're kind of nuking uh this this uh issue with tombstones from orbit. We've got multiple PR's coming in from multiple people all trying to solve this, and and hopefully we'll we'll just kind of through a company kind of lots of different attempts. Yeah, but we'll fix it. I'm.

B

Physician cases, but generally um we might get at least so. Then you know the the tissue again um so.

B

Yeah well, for it's still don't have quite reliable reproduction of these. You know the we got enough knowledge that it causes I, believe.

A

There there was a really easy way to General effect of tombstones, causing slow iteration uh a couple of years ago by using rgw with the the delete range configurable that we have set to to allow delete range, uh even with a small number of deletes that could pretty pretty quickly cause issues.

B

C

B

Should I write, uh I tried, W delete to my in my lower for recent experiments, but sure I switched already deletes, but uh I don't I don't know so um perhaps it somehow depends on Blackberry sources. But in my case uh this issue is not easily reproducible. So I I managed to do that a couple of times, but it can say that it's pretty easy to.

A

Okay, um if, if you're still struggling, let me know and I can try to reproduce the experiment that I did a couple of years ago and see if I can I can still hit it when we, when we change the the settings for uh the threshold for delete range and just see if it if it happens again,.

D

C

Not to mention that.

B

uh Some of the paintings might impact the behavior as well like bluefest buffer by your own.

A

That's true, that's true and there's a lot. That's changed in the last two or three years, so perhaps I won't be able to hit it as as easily as I previously did.

B

uh Yeah I am uh well one more note. uh Maybe it's.

C

B

But no, no uh recently I learned that snapdebe released open source version of the drop-in replacement for aux debut. uh You might want to try to use it as an alternative uh engine. At some point.

A

See so it's been a long time yeah it's been a long time since I've looked at it I know. Adam has already also looked at it, but from what I recall, I didn't think that um the improvements they made would help us a lot.

A

um Maybe that's incorrect and they may have also improved it dramatically in the last year or two, but um the last time I looked at it. I wasn't convinced it was a huge Improvement for us.

A

B

C

B

Believe we need a bunch of cases.

A

Igor, you may know you may know more about what they're doing than I do, but I thought that um they were doing a lot of work very similar to like the the whiskey uh uh improvements for for running qualification with uh certain.

B

Well, actually, I'm not aware uh on this stuff in detail, so.

A

B

As far as I understand, as you mentioned, uh you tried it before. It should be quite doable to to bring it oh code base and it's not that hard right.

A

I think so I mean I believe it's just basically very similar to rock CB with modifications, um but I'm not I'm, not entirely sure that Adam I mean you're. Probably you probably know the most about it of any of us. um Do you remember what what it it looked like or what? What was what the behavior was.

D

Not really I just remember: I got the pdb to integrate it with uh with Seth and Iran the tests that I actually did have at the time, so I could get a comparable results and the results weren't. Amazing. Let me try to find a link. There was some some improvements. Some degradation overall I do not even claim that my tests were the best. It was some some 4K random right. Some mixed mixed reads: rights: I remember: there was no omap operation, so maybe on that it will be a better, but.

A

D

The Baseline is I, did not read any document about. How would speedyb differ from regular rocks, DB and, as it was closed, source I was not even so much interested at that time. Now it it's changed.

A

Sure sure, with it being open source, we can certainly do more to try to uh understand what it's doing differently.

A

I I still think, though, that the biggest improvements that we're likely to see immediately are um in the right of log Behavior, which Igor I think. Maybe your your work is more important frankly um and then also uh the behavior of tombstones with iteration, but we're we're already kind of trying to fix that anyway. So.

B

Yeah and uh so I definitely I'm definitely planning come back to this alternative writer. Catalog implementation and I've got one more idea.

C

B

It might be helpful, which is um absorbing um rights tools which looks DB triggers during compaction from time to time. So if we have large enough external right, the headlock we might want to to avoid the work relevant which she raises.

B

Just thinking about that.

A

Are you actually seeing red Styles during compaction and practice.

B

uh Yes and well doing, including my last experiments, I, have bought a bunch of them with new settings and still on my in my backlog to to check the original Behavior.

B

um So I have some feeling that new settings functions more aggressively and these might uh was more frequent right stalls but need to double check that.

B

um But anyway, I've seen um some latency Peaks periodic latency Peak in the field multiple times and again, I have the feeling that it's caused by rocks a bit. So we can absorb that. We rectangle like a headlock. It would be great I believe.

A

Yeah yeah I I was I was very very impressed by the numbers that we were seeing by your your experimental work earlier in the I think in the spring you're looking at that, um it's definitely I think worth continuing effort and I'm, not um I mean you are with with improved for xav settings I think we were seeing like 122 000 write apps for given OSD, which is really really good.

B

Yeah, hopefully,.

A

Yeah: okay: uh let's see moving on uh next PR uh upgrading to the latest uh rocks to be from Facebook um I'm. Reviewing that one, the gist of it is that um the the we can't switch over to the rocksdb uh version directly. We have to use our existing uh branch that has a fix uh that was implemented a couple of years ago or upgrade scenarios.

A

um The the author was trying to push a branch directly to our roxdb repo and didn't have permission to do so so I I just this morning asked if they could Fork that uh repo directly uh Branch from our Master version updated to roxdb's latest, and then uh it created pull request from their own Branch uh and their their own Fork. So uh we'll see if they have success doing that.

A

Otherwise we really need to um look at access permissions, but I, don't think that would be necessary so anyway, um that one's uh still in the works and um I think the last ER I have here is that Laura made uh an additional uh I. Think uh review of this um uh balancing score PR from Josh Solomon uh that he he kind of described for us last week.

A

So just a little movement on that one! uh Well, we'll we'll see if that that finishes uh full review soon or not um in the no movement category I made it about uh not even quite halfway through, uh but I, think I think I looked at the most recent stuff, so there may have been a couple here that that uh were closed by the bot, but otherwise uh I. Don't think, there's a whole lot to talk about there all right. So that's it for pulling uh pull requests.

A

um Let's see uh Josh. uh We didn't finish uh your your discussion from from last week. So would you like to continue talking about your uh the latency Spike issues that you were seeing.

C

Yeah and actually, since the last time we met we've pretty much tracked this down, um so I followed the tracker there I'm, not sure, if folks have actually looked at it or not, but um I'll very briefly, say what we were seeing and then I'll talk a little bit about what we found to be the cause.

C

um So we have this uh going um monitoring software that we use. We wrote it ourselves. It gives us things like a histogram of latencies, including Max latencies.

C

Over time periods um is, of course, a history on bucketed so like if we see a five second Spike that could be anywhere between two and five seconds, just to do out of our moderate works and that sort of thing, but we also set a 10 second Raiders off timeout in the software, and what we saw after Pacific upgrade was that multiple clusters also we start to see this time of fire where it almost never fired before, um and it was very irregular that we'd see five second Ten Second bucketed latency spikes versus before some of these clusters that were better behaved.

C

um It would almost never go over one or two seconds, so we saw this a lot.

C

um The cluster never reported slow, Ops, the osds never reported slow-offs, so we couldn't use any of the usual like slow off um investigation methods to go and figure out what was happening there.

C

They would just not report it and once we put all those Clues together, Alex and me, they thought aha I wonder if this is a messenger level throttle, and so the issue is due to a decision made Years Ago by our predecessors, all of the throttling uh perf stuff had been disabled, so we couldn't actually inspect the throttles, so we did turn them on in one of our clusters. We've now since turned them on everywhere, because, like there's just no reason to have them off it.

C

There was a performance concern years ago, but like I, don't think it was ever substantiated. It was like. Oh some vendor told us, you should just turn this thing off, and so we did, but there's lots of those sorts of things floating out there in The Ether. That just are not good ideas, so anyways. We refer to that and it was very clear that the client message throttle was triggering every single time.

C

We saw one of these huge latency spikes um and so, of course, we went and did digging and found out that had been re-enabled in octopus and Pacific. It would have been re-enabled in uh Nautilus, 14 223 had that ever been released, but it was zero in Nautilus and the funny thing is we actually have this throttle disabled previously, when we upgraded to Nautilus, we went and looked we're like.

C

Oh, it's set to zero now, so we don't have to have an explicit disable anymore, so we got rid of it, um and so, when we upgrade to Pacific now the throttle got enabled for the first time essentially for our Blockbusters, where it had never been enabled before.

A

Was that the one I did did I was yeah yeah.

C

Exactly the thing is like, and things because, just like in your test, it doesn't show up in benchmarks where it shows up is in these uncontrolled public workloads where we have, who knows some customer that has a high IQ depth workflow, but also one of her osds, and so when I looked at the implementation- and this is where, like I filed a ticket where I asserted I, think I know what the problem is, but at the same time it's great if someone who's, actually an expert in this code would actually like, confirm or deny my understanding, but what it looks like is in the messenger.

C

The way that throttles are implemented is basically, once you run out of a throttle resource, the connection, each connection that tries to get it through all the resource and can't get it just puts, gets put on a one millisecond delay timer, oh, which is a horrible starvation issue.

C

We actually had a case where our monitoring Suite was actually starved for three minutes straight, could not get any IO over through to an OSD, and that's not surprising, because all it takes is one customer who does keep getting through to the OSD over and over and over again, and all these. These connections that get delayed just keep getting delayed, delay delay delay. They can never get their I o through.

C

So basically, it's just like it's a very starvation prone algorithm, so we've just turned the throttle back off again, and things are way better. In fact, our p100s are on some of our better behaved clusters. Three or hundreds are now lower under Pacific than anywhere under Nautilus. Once.

A

We disabled this throttle and that's what I would have expected to see. That was really surprised that you were seeing higher latency because in all the other tests that we've done it's it's done better, usually so yeah yeah, okay,.

C

A

C

We try, we tried setting this throttle, so the default in Pacific is 256.. um We tried setting it to 40.96, it wasn't high enough. We said we set it to 16 384 and it mostly helped but like and so I mean I understand the the reasoning that I eventually finally dug out of some red hat documentation somewhere is basically we you're trying to prevent a OSD flap when there's too much client traffic right is that the base reason for this I.

A

I think so it's been a while since I looked at that, but yeah and.

C

So, like the setting of 256 probably makes sense for a spinner um and don't know that it really makes sense on flash where, like you could get, you know, 30 000 iOS done a second in some cases, right, yeah, so yeah um when I did some digging around the internet. Of course everyone has their own setting for this. I did see someone set it as high as like 65 64k or 64 Chevy, um okay, but I mean I'm sure most people are just like setting it to random stuff and then once their problems go away.

C

They never really think about the implications are of how they're studying it um yeah. So anyways, like I, said: we've turned it off, um because that's the only thing that we found was safe and our our at least our block workloads. Our block workloads are much happier with it off than on pretty much any level.

A

Okay, is that one millisecond delay configurable or is that hard-coded.

C

It's hard-coded um because I would actually really scared, because, when I was facing through this, um the units for that timer value are actually wrong in the header.

C

um This goes back to like I, don't know 2016 or something like that. uh The units were actually changed in this timer to go from milliseconds to microseconds, um and so it actually looks like it's a one second delay. Until you dig into the implementation you find out, it's actually microseconds underneath the cover they they updated. The units in the CC file, but not in the header, so I do have a PR up for that somewhere to like fix the header file.

C

So the unit is documented correctly, no okay, so it's not configured I'm, not like at the end of the day, like the only way to really avoid starvation here. Is you almost want to have like the state machine throttle mode where, as soon as you're throttling every single connection has to go in a queue and you're always choosing from head of queue? And then you get a fifo across connections, and then you don't avoid the starvation problem even.

D

C

Be to actually have the connection, sorry, the messages in a queue because then you get first come first serve on the connections, but that destroys the whole point of the throttle, which is don't pull the messages off the connections until they're ready to right. So the next best thing is: all the connections go on a queue. You round, you fifo the connections. Every time.

A

C

Put it at the back and you keep doing it yep yep, um so this would have to be done for every throttle. Implementation um make them all fair right like this is not just like. Oh it's just this one, throw that has this problem. Basically, anything in the messenger. Has this problem and I haven't looked at things like um the other throttles, see how they're implemented, but at least in the connection, uh in down for inbuild messages and the connection they're all susceptible to this problem.

A

Fun, yes, all right.

B

C

B

But but how these does these uh explain the.

C

Right amplification, it does not two separate problems. We still don't know what's causing the right amplification so like. If we've got time to talk about that one, we can talk about the digging lead on the last week on that one too, but absolutely yeah yeah see within the p100 thing I'm like glad we figured it out glad we were able to tweak it with a setting to make the problem go away. um I I do found that tracker. It's one of those things where it's like I would love to just go and work on it.

C

A

One day, I.

C

Have no idea when I'll have time, though so.

A

Yeah I've I've, been kind of feeling, like we've, been ignoring the messenger too much. um There's there's a lot of stuff there that I'm kind of scared of- and uh this may be reiterate that we need to go through and and uh look at it a little bit more closely again.

A

Oh boy, okay, well,.

C

Yes, continue on continue, okay, so uh write amplification, so we were all excited last week because I think you and others were just trying to think what, if it's, the Deferred rights thing. So we spent a lot of time trying to chase that angle. Internally, um I think we've pretty much concluded from the available stats from perfdump that there is no increase in the number of deferred rights between Nautilus and Pacific and our systems it looks pretty much equivalent.

C

um I did want to correct one thing: I said last week, which is I thought Oliver osts were 4K mask, that's not true, um anything that was deployed under luminous and before or actually 16k, Mass yeah, and so, as you would expect, those osds show higher deferred right rates than the 4K Mass ones um in general, and we can see that, but there's no there's, no there's no change to either the 16k or 4K in terms of how many deferred rights are happening per second once we do the upgrade.

C

So it's not a change in the count of deferred rights. Could it be the Deferred rights are more expensive, I mean that would probably take some digging. If, if that's exactly what's happening there I don't know um the one thing I did notice is when I filed the ticket it looks like beat. This is where you start to get into like these. You can go, you can go and slice the statistics to give you all sorts of interesting numbers, but whether they're actually like relevant.

C

You have no idea, but it looks like the average I o size to the disk increases when we go to Pacific, but it looks like it increases more for the 4K mass osds than for the 16. Nasos is 16k maso SDS.

A

Okay, interesting it more absolute or more, is the ratio of the I o size or both? um Let me bring up the ticket.

C

Okay, so I'm just looking at the graph, the tracker is uh here: 585 30. put it in chat for easy reference.

C

If I'm, looking at the third graph of the ones, I posted.

C

um So, like it looks like the 4K Nas hosts the average I o size jumped from I, don't know what you want to call it like 13. Debbie up to, or is it probably kilo up to like 16.

C

versus the 16k, as ones jumped from call it 17 and a half up to I, don't know not too much higher 18 19 or something like that?

C

Okay um and then you can like this is you can also see there's that yellow line that kind of transits between the top group and the bottom group yeah that host we were actually reconditioning all like rebuilding all the osds on sorry.

A

C

Half the osds on it. Essentially, you can see the average value size is dropping, as you would expect, because the Mavs is basically dropping on average across the host.

A

Sure, out of curiosity before that transition period started, it almost looks like in the 4K case. You've got some kind of like binomial distribution.

A

Right, like it's um yeah, it's like some of the yeah. It.

C

Is weird I, don't know, I, don't know why.

A

C

um I'd have to go and dig like: we've had like these systems also have a mix of at least two different generations of Hardware in them um and then also deployment across in this case, with the deployment across luminous Nautilus and then obviously now Pacific's on the system so like they have all these interesting histories, where we almost have to dig on a per host group basis to go and figure out. Okay. What is the exact history of how we got here so I?

C

Don't know how interesting that binomial distribution actually is at the end of.

A

The day but you're right I did.

C

Notice that I thought it was weird and it does seem to disappear across the upgrade from all the Pacific.

A

Yeah I was curious. If that would help explain any of this, like you could understand the difference between those but I. Don't I, don't know if it does or not. Maybe okay, but the big difference right is that we see that the the ones with a 16k monoxides are are differently showing higher uh right sizes. Then the small one as you would expect, although it is interesting that and they're not that far apart.

C

A

They're not you're right.

C

They're not very terribly far apart.

A

And then the 16k case, it really looks like there's really no difference like I. It doesn't appear that there's any growth right. It's.

C

Very small yeah, like I I, mean to my eyes. There is a small jump, but the other weird thing is: you can see. The average iOS house was actually falling in the recent history of Nautilus up until the upgrade, and then it looked like it started increasing after that and like the unfortunate thing is that that cut off there is as much history as we have so yeah. We like, we have monthly Cycles. We have weekly, there's all sorts of interesting cycles that happen in our system, so it's entirely possible.

C

If we actually had like six months of data, you would actually see some sort of monthly cycle here or something so yeah.

A

C

What I did attach to the ticket is, um we did get a perf dump across three different osds on a different system that we upgraded um both before and after the upgrade the same three.

C

So I was hoping that some difference would jump out if we looked at the before and after when we looked at it, we couldn't find any difference of note, but.

A

Very interesting.

A

It's also kind of interesting that I mean it almost looks like at the very beginning of this race. The at least some of these hosts look like they were in the same ballpark as after the Pacific upgrade right like the in that that the ones had the 4K men Alex eyes. It looks like half of them maybe started out pretty close to where Pacific ended up um yeah I mean not quite they were definitely a little lower there, but, like you said there could be some fluctuations over time.

A

But then you've got these other ones that were were definitely dramatically lower.

A

Is there? Is there any way to tell information to find out any information about these, these different groups of of osts, the ones that were in each each? um Let's.

C

Be bringing that up now, um the binomial distribution actually still exists.

C

uh I'm sorry I'm, just trying to it takes a while for the whole graph, the compute I'm, just looking at like the last hour.

A

C

Actually yeah, it looks like.

C

So it looks like it actually, the binomial distribution started to show up again in the last few days.

C

There's not yeah, so look I'm, actually just looking at those two groups of hosts I, there's nothing that stands out to me why they would be different, actually they're the same generation of Hardware, they would have all been all their osds would have been deployed on Nautilus.

A

Are there any correlating.

A

uh Variables like by cluster or by workload or anything else that would explain it.

C

The only thing is the ones that, in the bannable distribution, the ones that showed that showed the higher right, I O size were the most recent 12 hosts to be deployed in the cluster.

C

So perhaps it just means that they're, the least aged osds. We want to put it that way.

A

Sure so larger I o sizes at the disk level, potentially due to lack of fragmentation right.

C

Yes, yeah yeah that could totally okay, yeah, okay,.

D

Just that right, IO sizes, you showed it's both read and writes iOS correct.

C

Sorry that only writes.

D

C

um I'm going to quickly compute a graph of three.

A

Somehow Adam we've come back to the fragmentation topic again today,.

D

We would but uh last week, Jeff said that also total amount of data transferred is is increased and that cannot be explained by its fragmentation.

A

If we were dealing with very small iOS, could we end up with more fragment more more rights if we were fragmenting over Paul small units- and we end up with a lot of uh hanging hanging bits of data.

D

With more fragmentation, I can expect that we need more metadata, but increase would be a very minuscule.

D

Would be if we would switch more often to different rights with more fragmentation, but I don't think we work that way.

A

When he did Injustice, he didn't see more different rights anyway,.

A

I was wondering more about, like a um is there any possibility that we end up with um uh additional additional rights due to.

A

Needing to to fill in small gaps, I guess, but maybe that doesn't make any sense.

D

I know that when a blue FS log is highly fragmented, its updates can become significantly larger.

A

A

That should be a small amount relative to the overall amount of data being written right.

D

And I lost you for a second.

A

That there should be an overall, a small effect relative to the overall amount of data being right.

D

No, not necessarily because if your blue FS log is fragmented over a multitude of small allocation units.

D

Then, after each update to blue FS lock when you need to know, but you don't fix this I sorry, we don't fix the blue FS lock size on an extension so yeah that that wouldn't be very large. Sorry.

A

I I was thinking more just the fact. This is the RBD right, Josh.

C

A

I mean bluefest they're, not a blue Fest data should be very, very small relative to the amount of of blue store data.

A

And my my thinking.

B

About it correctly, unless you get approach.

A

Yeah yeah exactly, but we didn't, but it sounds like we didn't see an increase in different rights.

A

But maybe if they're more fragmented, I suppose if there are a lot of different rights. But but we saw more fragmentation in blueface.

A

Joshua was there? Was there a high amount of deferred rights in both cases? Like I know, you didn't see an increase, but it was. Was it a high amount.

C

um What would you consider a high amount.

A

I, don't know we're, let's say just to start like we're: half we're half of them deferred rights.

D

B

Well, as far as I remember, it was pretty high and yeah. Indeed, I didn't see the increase after upgrade.

C

And our 4K osds we're doing I'm just looking at a different cluster here, we're doing like one one to two deferred write-offs per second.

C

So not very many and in the 16k ones we're doing about 200.

A

Okay, so a lot more deferred rights on the 16k cluster, but those are the ones that you didn't see any well. If you saw any Ray amplification, that was pretty small.

C

I'm just looking.

C

Yes, yeah, so if we're that's.

C

So this particular cluster I'm, looking at the 16k osds that actually looks like it's pretty much all the iOS they're doing are differed yeah. um Oh.

A

C

I I'm doing I'm I'm playing this dumb. uh Half it's like half, okay,.

A

So a lot of iOS that are sub 16k yeah, but not not all of them.

A

Okay, so you're seeing a bigger effect with a small nanolic size. You know a minor effect with a 16k metallic size.

A

I mean I, guess that makes me kind of suspect metadata more, like you were saying, Adam.

C

um I did quickly compute reads: there's no increase in reads um not a list of Pacific or if there is it's, not easy to measure.

C

Yeah I I, don't really see anything interesting. Their average read size is something like 22k and this one particular cluster, um except for it's actually really funny. So the less fragmented 4K osds and the 16k osds have a average read size of closer to 32k.

A

C

Which makes sense like that must imply, there's some sort of like pretty heavy 32k workload, predominant 32k workload or something like that going on here.

A

Sure do you have OSD logs from before and after the upgrade.

C

uh We say Alex Yeah, we actually did she upgrade recently. We actually collected with um some debug levels- correct, I, don't remember which ones I'll see, which ones are we paying for that.

A

Debug blue star 20 for like five seconds: okay, I, don't even need that I was just wondering if you had the Rocks DB uh event. Events in the OSD logs did.

C

A

Guys, look at that.

C

Yeah we I didn't look in detail, but we can provide those. That's a good idea.

A

um If you want to, you can try running there's a script in CVT for um just making those statistics. Look nicer. um I've mentioned it before, but uh you can try running this python script on the before and the after ones, and that will give you a whole bunch of information. Like a summary statistics on, like the number of um input and output records and the the write rates for roxdb.

A

That might actually tell you whether or not rocksdb is like writing out more data um uh before and after maybe it'll help explain it if it is.

C

A

I imagine you'll see really different behavior on your 4K and 16k minoxides osts as well.

A

Both a super interesting Josh, especially thank you for for digging in on that issue, with the messenger uh we we kind of tested that and we're like. Okay, let's it's looking good, let's, uh let's merge it and didn't didn't, hear any issues from anyone and uh that's unfortunate.

C

A

C

C

I have like we have not attributed any user complaints to it, although we know that users were seeing it like I I've gone and run fio and uh VM over multiple days and like I, saw fioc like 10 to 20 second latency. Sometimes so.

C

I know that I know it was hitting people, but nobody complained about it and we, we wouldn't have noticed either other than the fact that we had started paying a lot more attention to our p100 starting about a year ago. So yeah I wouldn't be surprised. If this is like especially like, if people have been running luminous, they would have been seeing the same thing. I'm assuming like I, didn't look at the V1 messenger I'm assuming it's throttle. Implementation is the same, but maybe it's not.

D

C

But it's like it's really easy to miss, because even your p99s look, fine right like arpina and eyes were fine yep, okay, because most of the iOS get through right.

A

Yep yep exactly exactly.

C

All right well we'll uh we'll try to dig at our um the those rocks, DB logs a little bit um and see but uh I'll definitely admit on our side. We're kind of like out of ideas at this point as to what to investigate other than like right now, the path of like really heavy tracing, uh I O, and trying to pick that, apart, which we don't have time to do.

A

Yeah yeah maybe try to see if the roxyb statistics look any different for for before and after that. That should be a quick, quick thing to try.

B

Yeah, just rough briefly from become encounters for tickets, uh SG 683, specifically, uh what I can see is a pretty high difference in log bytes, like blue effects,.

B

uh This might be a random fluke caused by compaction right, but maybe.

B

B

So would be interesting to see the difference for other, as it is.

C

That's the exact uh stat name there ready. What's the exact stat name that you're looking at.

B

So it's blue first section look underscore bytes in Nautilus.

B

uh Maybe wrong it's log bytes and it's amount of.

B

Yeah so it looks size, but not the amount of uh right.

B

So Pacific has got to.

B

Do variable slope, bytes and locked bytes log bites if I remember correctly, that's log size and locked bytes, its amount of bytes Legion to lock the period and, unfortunately, the legs the same star r, no in not always locked by its present as well. It's zero four six, eight three, why it's 13 mega X 5 megabyte for about five. Fifty fifty three megabyte for uh Pacific.

B

And the amount of compactions is three versus zero.

B

So I'm curious: if there is the same data or similar Delta for local underscore compactions and locked bytes.

B

Again in modulus, it is both zero. They are both equal to zero and for Pacific at six eight three SD. They are pretty significant.

C

One of the things that we do have in Prometheus is um blue FS bites written wall.

C

And that does seem to increase from, for example, 1.5.

C

And 1.3 megabytes per second to almost two megabytes per second 1.3.

B

B

I need to check other I'm.

B

So the increase in right, 15 to vowel, might be caused by multiple factors. Actually, so.

C

B

Of metadata increase of amount of the short rides.

B

um Yeah but again well, looking at six, eight three I can see nine megabytes, binoculars versus 13 megabyte.

B

And Pacific, so not not that large difference, I'd say but what's really differs is the logged bytes.

B

And I'm curious, if this data is the system for our notes, already demons and the.

C

Second, question:.

B

Would be if it's persistent over time I mean the compactions are happening uh permanently. All this goes away after a while.

C

Right and we should be able to see that through the rock CD log parsing as well right off the logs.

B

Please do not confuse log compactions with DP compactions.

B

Remember if log conductions are deflected in the noise do work, but at least again for for for these two or two two dumps. This is the most suspicious thing.

B

Some increase in by the written wall. uh Well, it's like 30 25, not.

B

Hundreds of the senses you're, showing you pictures.

B

Hey don't yeah, these locked bytes looks the most suspicious so far, but the question if it's observed another downs and all the time.

D

There is also one change now: blue FS reports, a much more space available for storage and it's consistent with the change we made. But previously we had it separated um storage for blue fs and for blog data, and we had a mechanism that regularly gifted or reclaimed some large portions of um the main device for blue FS. So they can so Bluffs can get a underlying continuous space.

D

Then we changed it because it it wasn't a very good solution, so blue FS is using basically the same allocator as main devices.

D

That is possible that it might make it more prone to fragment the data fragment blue FS files, including blue FS log. It's not that I'm, basically telling that it is fragmented, but definitely now blue FS files are more prone to fragmentation than before.

D

B

Lock might grow faster and perhaps causing action more frequently.

A

Well guys we are, we are a little bit over um that a good place to wrap up. Do you think.

B

uh Give me a second I'm checking noise D300.

A

B

Is the same data for log bytes? There.

B

Yes, and it's there or.

B

Yeah no Jewelers nodulous has got zero logged, bytes and zero low compactions while um well. Pacific has 1.5 megabyte blocked, bikes.

B

B

Yeah I'm curious I'd suggest to check uh local bicep larger scale, how it behaves. Okay,.

C

I yeah: we we export a bunch of these for dumb things to from the Prometheus I. Don't think we export those ones yet so well, um we can do that and then observe across one of our upgrades, for example, and see what happens and see what happens long term.

A

Sounds that sounds helpful.

C

C

All right, yeah I, also have to run to the meeting thanks so much for your time.

A

C

Thank you. Thank you. Do more. We can find.

A

Yeah, thanks for all the work you guys have been doing this is. This is really really good data that we're getting so we really appreciate it. Awesome.

C

Okay, have a good time, yep.

A

See you later, okay and uh guys I think I think it's probably a good good point for us to wrap up too.

D

Thank you guys. All right have a.

A

Good day, everyone thanks.

B

A

Next week, hey.