Ceph Performance Weekly, 19 Jan 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2023--01-19

Description

Join us weekly for the Ceph Performance meeting: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contrib...
What is Ceph: https://ceph.io/en/discover/

A

Hey folks, I just got out of the course stand up. I think we'll get some folks here soon uh we're just talking about uh it sounds like maybe we've got uh the build issue fixed this morning, uh so not performance related exactly but good news, um the the long Saga about this. We were running on like luster for a while, just because it was like what we what was set up years upon years upon years ago, and uh we moved over to uh I think is in the process of moving it over to stuff.

A

We ended up with a self-related problem that I think we've got fixed now, so uh yes, ktic the hands of joy, but it sounds like uh radic was saying that he submitted something uh this morning and uh and it went through and built on Central estate, which was the the issue that we we've been having so anyway, uh lots of good stuff.

A

All right, oh I, see Josh you're, adding a bunch of stuff here, very good, very good. um Before we get into all that, I'll quickly go through pull requests and I almost made it through the list. I got really close. um Okay, let's see new this week, Igor you've got this. This very interesting. Looking uh bounded iterators for arm range keys,.

B

A

Any anything uh well I said you added Corey as a reviewer for it so Corey. So you know uh you've been tagged um any anything interesting uh there. You are to to watch out for with this.

B

Oh well, um first of all, I purchased to reproduce that locally, so I mean the pre-produce ratio locally, with rocks, DBA performance degradation and it was a map, clear function which called Ram range keys and iterated over multiple column, families and column, family charts, and it counted up to 1 million piece check. If it needs parents delete or not- and this definitely triggered the issue.

B

The one side implemented response stuff it gone away.

B

um Well, I still am still checking.

B

um The more advanced approach which is apply, range, deletes unconditionally and honestly I, don't see any issues so far.

B

Yeah not to mention that.

B

Generally, it's it looks like a design flow that if we stuck at some function, doing something for a long time, some function at the Blue Store.

B

We effectively stole the current thread, then another thread from the same as D, sharp and potentially, if client. uh If we have multiple clients which says well PG randomly, we can cause them to stroll completely.

B

So people I mean that again, single misbehave in threats might effectively stall that the the whole Blue Store um actually I, know the way how to avoid this. Just to share.

A

I uh I had a test case that did really really badly when we had completely unlimited range delete before I was an rgw test case. I was just looking for the pr where I had the stuff there for it.

A

I have this on any event. I can find it again, but uh but if it's useful for you, we could try running uh through those old test cases again and see see how it does. If we allowed for uh uh range delete to happen with fewer elements.

B

Yeah, definitely it would be great to be able to reproduce the the the T-shirt.

B

A

Yeah, it's pretty old I thought it was like in 2017 I'm trying to find it again.

B

um Well, I have one thought that maybe the analysis, the root cause analysis, wasn't 100 valid in that time. So could it be that something like this unbounded.

B

Iterators could trigger the actual issue, not credential deletes because you know uh I don't really understand. What's the big difference between single key delete or wrenched, one in terms of painting afterwards.

A

Yeah I mean the impression that I got from comments was that the ranged Elite Tombstone is different than a normal deletion Tombstone, but I don't know in what way it's different I never was able to really find out and I never dug through the code enough to really understand it.

A

um I can show you, though, I found the pr that where I was testing the behavior I don't know. Maybe this could have been more related to um uh to some other aspect of this, but um it appeared that, for whatever reason we did range delete with um a small number of keys that the uh it made iteration extremely slow in general.

C

I'll say that I've sort of tested, both over time and I, don't see. I haven't seen really any difference from my side between a delete, range, Tombstone and the same set of individual tombstones covering the same contiguous range, but in either case, if you have a large swath of a contiguous range of tombstones, whether it's from a delete range or individual ones, both of those cause. A big problem.

A

Corey, did you notice any difference that you had like say? If you, you did small delete ranges versus individual ones? So, if you had like say um a delete range only covering a couple of keys versus individual keys, um what would that show any difference?

A

C

Don't I don't believe so, because what I see that what I've seen the problem, as is that right when you do like a lower bound call and that lower bound call is within or at the beginning of a range that is full of delete, tombstones or delete range Tombstone. It first has to do a binary search to find the lowest key in that range right, and then it has to go back and check whether it's deleted, either by looking at individual or range Tombstone.

C

If it finds that it is, it has to do that binary search again, and so, if you have a million contiguous tombstones like that, it's doing a million, binary searches and coming back and looking if it's deleted until it finds something that isn't Okay. So, regardless of whether they're, multiple tombstones or not, it ends up doing the same thing.

A

I wish I could understand why we saw the behavior. We did back when this other PR that I did merged, um where we basically only used Elite range when we had large numbers of keys.

A

I guess I've included a little bit of a profile on this, but it's not super useful.

A

Just iteration overhead, that's all we know really.

A

In make shared.

A

C

I'm struggling to remember it now, though, I think I did see some uh notes in rocks to be changelogs about there being bugs with delete range in the past and prior versions. That may have caused some performance type issues. But I'd have to dig that back up to see what that was. It was.

A

Yeah, maybe maybe this isn't valid anymore, like maybe there were bugs that have been since fixed.

A

Well, in any event, um Igor we can. We can try, testing your stuff testing just in general and fixing yourself again.

B

Yeah uh well, I mean at least about that stuff makes sense indeed and I think it worth backboarding as well, and meanwhile we might want to proceed with uh using bench delete unconditionally in main background the if works, properly yeah, you can make it for a while.

B

All right, cool.

A

Let's see moving on, uh it looks like this initial commit, adding Arrow flight functionality has merged, uh since we don't have uh Eric PCD. Is there anything interesting about that or anything new.

D

uh Well, there's a lot of interest from our team and aeroflight and Integrations, but uh we're still trying to find kind of the right model for that this. This PR does it in in kind of a weird hacky way.

D

um That is not going to be the final product, but gives us something that we can clients out and experiment with.

A

And remind me what the aeroflight buff is again.

D

uh I'm, probably not the best person to summarize this, but um it's a fast RPC framework for low latency data transfers. Okay,.

A

A

This is um this isn't related at all, then, to the like Fast S3 stuff, that we were talking about a while back.

D

uh Possibly, uh the idea for rgw is that we kind of use this as a side channel for clients to request. You know, or you know, query specific parts of objects.

D

I think there's potential for using select like statements through that also cool.

A

All right so yeah, it would be exciting to hear more from you guys as you try it out and if it works well.

D

Yeah I um I think there's potential for Integrations with osds also so that rgw could like describe the layout of data and and have it fetch things directly, but it definitely needs more elaboration. Yeah that'd.

A

Be that'd be exciting.

C

A

D

Very cool there would be lots of consistency, issues with OSD, Maps Etc, so I'm not sure what exactly we can do. There.

A

A

All right: well, uh let's see next uh I think this is just get closed. It was a pretty old PR for rewriting the hardware. Docs I I suspect that maybe that's been superseded at this point, uh but it got close so I put it in here. um Let's see updated uh or missing part of this one I don't know who wrote this. uh Do not collection list, one remove collection. Let me look at it quick uh race on okay.

B

A

uh Corey looks like maybe you're reviewing this one.

C

Yeah I've looked at it a few times and added some discussion points in there on that one I think there's probably more discussion warranted on it. As far as whether what approach we really want to take long term, I guess on that piece, foreign.

E

Check if collection is not empty, because we have as strict assert for trying to delete non-empty collection on Blue Store, maybe we could revert it, but for now it's we need to. We were worried that this change itself will cause um osds to potentially just flip.

A

Okay sounds like: if people are concerned about it, we should maybe maybe uh either either request uh changes or or maybe rejected, so it doesn't just languish.

A

Who do you think guys.

B

Well, actually, this doesn't make much sense in a regular operation since most of the time, we do not see any assertions, but it might be useful for troubleshooting, since it provides early diagnosis for issues like that, like improper collection, removal, complete collections, so.

B

This respect, it looks like we might want this stuff during Gray and maybe have it optionally enabled in case of some issues in production.

B

From Performance Point of View, it's better to avoid to to get rid of it and then disable it.

B

Can say that this looks for the legend approach: I mean having an option to enable and disable it, but I I, don't know we don't see any better approach so far.

A

You are, maybe maybe just put a comment in with what you just said. Do you think um if you think it's reasonable, if we we just have an option for it, maybe that's maybe that's the way to proceed.

B

Yeah well, the question would be.

B

If someone needs this sale, this option in production- well, I, don't know. Yeah um looks like the option. Is it's the approach that might work.

E

Well, the argument against this PR would be also that we, as far as I, remember uh check if there are any objects in the collection anyway. So even if we skip trying to list any object on that level, we still do that entire access to rocksdb in Blue store anyway.

E

Maybe the proper approach would be to improve handling of deletion of non-empty collections.

B

uh Well, Adam I believe this PR Works in a different way. So it removes the check from the the check for empty collections like empty collection in at blister level, and we still have it address. D1.

E

Okay, I might, in that case uh disregard I, might have misread the code here.

C

Yeah, my other thought is that we could potentially keep the one in Blue Store, the one that he's removing that PR and return an error code.

C

If Blue Store does the fact that there are still objects in there during that iteration and bubble that up through view transaction so that we can remove the one in the OSD layer and and then end up having the same logic for retrying the PG deletion. When that is the case, if new objects had been added instead of doing the assertion, failure that way, we still have basically the same failure um or the same invariant protection at the Blue Store level. But we avoid doing that expensive, iteration twice.

B

Yeah, this button can make sense, um just not sure if we are able to record for this specific function, you need to double check since it's default in a transaction menu.

B

I'm not sure, are we able to return errors for operations submitted to the transaction transaction.

E

I guess we can I'm not really sure we do handle those errors very well, but it seems that the path in Object Store for returning returns for a transaction exist but handling this OSD code. I I, don't know that.

B

Well, uh definitely uh OSD code for PG removal doesn't handle that at the moment, but it's it can be introduced. What is labeled.

E

But now I'm thinking, if maybe there is a way that we could store in a Blue Store internals some kind of flag if the collection is empty like a because the problem here I see, is that we do iterate over entire collection in um submit transaction in the relay here is a poison for us. If the same delay happens when we list collection, we will be fine with it.

E

So maybe if we could somehow pass an info that last search through a collection gave us No Object, we could use that in remove collection and and use that as a gate or immediate deletion.

B

We should currently introduce some tricks with timestamps.

B

And have something like last update timestamps, which is called level and page I'll check it.

B

B

So well anyway, it looks like we have done ways to improve this PL.

B

um So, let's, let's discuss something about the other purchases offline.

A

Sounds good all right uh moving on then, uh let's see next uh Sam reviewed this mcluck uh PR for adding the ability to handle high priority operations. I didn't look super closely at it, but I it appeared. Sam did a pretty in-depth review so that looks good and moving along. uh Let's see, there's been I think a little bit more discussion on uh Corey's PR for studying your xdb iterator balance for blue star collection list, um or was there anything new there.

C

I think I go reviewed it and we kind of we had a little bit of discussion, but I think he he approved and uh I had the needs QA label. At this point.

A

All right good deal uh next uh upgrading rocksdb to the latest Facebook release. uh We discussed this last week. I just put a note in there that we can't actually adopt uh Facebook's Branch.

A

We have to uh use our own because there's still that fix from several years ago uh that we we still decide we want to have so um I just sent the author, a note saying that we still need to do that and if he could update the pr that would be fantastic um and, lastly, for updated PR's uh there was there's this um heading primary balance. Scores, uh let's see, Laura are you here today, I, don't see Laura. She just added a a slight um uh suggestion to part of that code.

F

I'm here, um oh okay, good, hey, Josh,.

F

I have there is a slight problem with the whole calculation, which is um it's very easy to define the score when primary Affinity on all the osds is one, but one day it's it's a more conceptual problem to have uh how to actually Define the score in when the prime, when every USB is different, primary Affinity to have something that is Meaningful the number that the user could be interpret in a normal way, and uh so I added the unit test, run random tests and that I could run them.

F

My goal was that I would have went in all kind of corner cases when this score doesn't make a lot of sense. For example, if we have replica three and all the osds have primary Affinity of uh 0.2 or something you can't actually meet all the requirements without you. Actually, you know you're breaking the the rules of the game, because you give more primaries than what the primary Affinity asks you to do it obviously it's possible if the total primary acidity is too low.

F

So I try to do something that in all these Corner cases at least zero and in all other cases, score is always larger than one and has meaningful uh something that you could look at the numbers and understand something out of these uh I still struggle with this actually so I have I could easily explain the score when the the the total, the primary Affinity is high, and it tends to be more and more.

F

That's got it abstract number, when the total Prime Affinity of all those is becoming low and reaching near the the one. The the one divided by a replica count so now I think about a bit of different approach of uh starting the the score again when I have the primary low primary Affinity I'm working on this, but I still need to worst case scenario.

F

Is that the in some cases the score would be smaller than one, and there will be a documentation explaining that this is something that says something is bad, but I I, probably I would be able to fix it, and the the only making the score always larger than one is simple. The the complex is making it meaningful so that you could look at it and understand something else: it's how how much your system is balanced or not balanced. The the goal is that one is perfect.

F

1.1 means 10 degradation in read: performance 1.2 meant 20 degradation. That's the goal um I'm working on this, because it's uh it's It's Tricky, when the primary Affinity numbers are random, they shouldn't be because they should have some logic behind them, but I don't want to count on uh exactly how systems are set in the field and say that this is the illegal configuration. Well, if it's accepted by says it should be handled correctly, though.

F

Currently there is a minor tweaks to the numbers and and to the unit test, making sure that the hope to make sure that actually I I'm able to run it currently I have something like um I failure in unit test, one in 500 runs I want to make it never fail. Okay, so it's still.

F

That's the case, but the Laura is working on in parallel to this on actually on the closing the loop, which is actually creating a read balancer that will work from OSD map tool, as well as the offline capacity balancer works, so it would split file with commands to actually balance the the primaries.

F

That's for the first release, so it should be. It should be ready before the feature phase or freeze it's almost there.

A

Very cool: well, good luck sounds like you've got your work cut up for you, but.

F

uh I thought that I had the formula set very well in the mathematics were good, but it was incorrect. I'm still struggling with the edge cases of the mass over there.

A

Anytime, you do things like this I feel like it's it's uh at least for me. It takes like five tries before I can find something that works right. So I, I sympathize.

F

I hope I'm in the fourth one.

A

Good luck! Good! Luck! um All right! uh Let's see I think that's all I had for for pulling requests this week. Guys um was there anything I missed from anyone.

A

All right, uh if not then moving on um Josh right before I I I, see you've got a lot. So I want to speak something in quick before uh before I. Let you go yeah for sure. So uh Adam uh has been working on uh all this uh work for uh making um uh snapshots and uh and RBD mirror uh much faster and uh and his results are amazing.

A

I've I've linked the spreadsheet there for anyone that wants to take a look, uh but the gist of it is um that his new, this new code is is, is really really good and probably I I would say, maybe obsolete's the need for my object, defragmentation branch that I have worked on um because it it's so fast that I think it it doesn't even need it um Adam. Is there anything you want to to talk about with with your work um other than the fact that it's amazing.

E

Well, it's the code itself is not fast, it's actually quite slow, but the modification is that, during a cloning of object, um making a duplicate, in fact, uh there is an aggressive procedure that tries to avoid creation of any extra blobs. It tries very hard to keep the count of blobs minimum.

E

So when faced with the problem with constant up dreaming and creation of new snapshots, we do not increase amount of blobs in Object Store. That's the speed up. It's completely unrelated Mark to your work on copying objects from place to place, because data of uh objects will still be fragmented. It's.

D

Just the metadata.

E

That's in more or orderly, on minimal State and that that's it so and that any performance that came is not from the actual cloning process, which is now significantly slower, but all the other operations we perform in future, including future clonings. Of course,.

A

Yeah Adam May um the reason I I said it. The way I did is because I I think um with your new code. It appears that that uh essentially, the only benefit from my version now is the fact that you're getting defragmentation I don't think my my code will in any way help improve uh CPU usage at this point after after your work is applied.

E

Yep, that's possible.

A

um Not to say that we we still might want to defragment objects on clone just to to um you know at some point reduce the amount of fragmentation that exists, but um I think I think with your latest code, assuming there are no problems and it works uh in production. The way it does now. um It's looking really really good.

E

A

Yep yep all right, so that was that was all I wanted to bring up uh with with that work, um uh exciting, exciting stuff, uh so Josh I'm, going to turn it over to you uh looks like you've got really good updates for us here.

G

Yeah well, I, guess we'll see whether the quality actually is that great, um but I did want to give an update since I brought up these issues. I feel like it was two months ago now or something like that, with a break in between.

G

um So just a reminder that there's two things that we saw in Pacific once we were upgrading between Donaldson Pacific that were puzzling, and this is mostly in the rdd side. We don't have a lot of data points yet for our rtw workloads, the first one is the right amplification issue, um not a lot to say here we haven't been putting a lot of focus on it. I did um for your thoughts. Mark I did quickly look at some um block. Trace myself and nothing was obvious.

G

There was nothing that stood out as like what was an expert right and what wasn't I think it'd take a lot more picking. Apart of what yeah exactly it's kind of hard to match up what?

G

What with what, um when, when I started to do in Prometheus, was we have the ceph reported, how many rights are clients doing and then, of course, we have the disk rate stats from every single one of the nodes, and so what I computed was how many disk rates are happening on average for every single client right that comes from Seth, and it's pretty clear across the upgrade threshold that it jumps from it depends on the cluster.

G

um It jumps from two to three OSD writes per client rate up to three to four per OSD, so we're 3x replicated. So this is jumping from six six to nine up to nine to twelve essentially writes across the cluster for every client right so like it was pretty clear that basically, it looks like for a some number of volumes. There is one extra right that is happening for every client right that comes into the cluster.

G

um That's really! As far as we've got I don't know, I think we had a lot of questions in our minds about like did object, map implementation change in some way. Could this be somehow snap mapper related we're like we're just not right, and especially since we're not convinced we're actually seeing this in our staging clusters? Our staging clusters do not have the same workloads as production right, so that makes it.

A

Are the um so you only see in production, so you you can't like easily get a controlled set of tests. That's.

G

The problem yeah yeah, exactly yeah yeah, so I'll put it as we're, not convinced. We see it in staging the reason why we're not convinced is, unfortunately, we don't have the Prometheus history to be able to run the same level of analysis because we only realized. This was a issue once we started seeing it in prod.

G

Okay, um we have one staging cluster where we do have the history and it's actually really weird because it looks like it actually has the amplification for like three days and then it drops back down to normal again, but we have not seen that in fraud and where it dropped off in staging is when we did the conversions um column sharding in rocksdb and so we're like. Oh that's weird. Maybe we should just do that.

G

So we've done that for one or two two I think full clusters of production, and it did not make a difference there. So Luke I'd, like we don't know, but it's like our single staging data point I I, take in almost any data from staging clusters. I take with a grain of salt, so I wouldn't put too much weight on what we saw there: okay, but but yeah exactly what you say.

G

We don't have any sort of isolated case where we know we can reproduce this other than we just know what's happening for all of our production clusters on RVP, when I do the same analysis of um disk rights per client right for our rgw implementations, I, don't see any difference between Pacific and Nautilus, but we're we're not nearly as advanced there. Yet in how many clusters we've upgraded. So I still need to see some across that boundary. Okay,.

A

But but maybe maybe that would indicate like if, if you consistently don't see it with rgw, maybe that could indicate something regarding overwrites, okay,.

G

Yeah, okay, that's a good point. I haven't thought of overrides, yet we were trying to brainstorm like what are all the RBD specific things and so like Alex. um He his first thought was okay. Well, let me just go and disable all of the RBD features for a volume staging and see if it makes a difference. But again it's like. Are you actually seeing this in staging? We don't know so yeah, oh yeah, so anyways, that's where that's at I don't feel like. We quite have enough data here to like write a ticket.

G

That's not just going to immediately get turned into need more info. So that's why we have to take it for you yeah, um but.

B

G

That's around that one I thought I would at least like brain dump that in case it triggers a memory for anyone like. Oh, we changed this an octopus or something like that.

A

I mean like the things that come to mind for me, are like column, family, sharding, Igor, I, think the 4K men Alex stuff that you were working on there's a lot that changed there. If I remember right, wasn't there.

B

Yeah, that's the question. I wanted to raise I'm sure if it happened in Pacific. Let's see it's happening.

B

With you only what they might want to double check it so yeah, if 4K location units are present in Pacific release, would definitely should be the case since to hit a bunch of rewards in default rights. It seems like that this might be the case. Then.

G

I just wanted to double check my new assignment, deferred rights, I, don't think their deferred rights don't apply unless you have separate um wall and data, or do they actually happen? Even if you have a single device, housing area.

B

uh If I remember correctly, they could anyway.

E

B

E

Cases different rights do happen anyway,.

G

Okay, all right, that's interesting! Because I've, my my thought immediately went to deferred rights because I know there were changes around that in I, think Pacific time frame um and but at first I thought it was only if you had like hybrid installs and so I just kind of discarded. That idea.

B

Well, and so the question is, uh what are your main devices? Are these feeding drives or SSD for.

G

These this is all SSC mixture of uh SATA and our older stuff and nvme and our new stuff, but we see the same the same amplification in both cases.

A

And we should we shouldn't be using the Deferred red path, then I guess.

B

Well, at least a different little less aggressive.

B

Now that we remember exactly the we use default drives, we.

E

If both are drives uh SSD, then the defaults we use for different rights basically delete any non-necessary different rights actions. It means that if we overwrite a partial um uh allocation unit, then we do different right, but that's that's only it. If a device is Spinner, then we also use different rights for performance, and in that case we we will do extra.

E

B

Idea, yeah so so different light are still present for the main devices, but the uh with aggressive.

A

So what about the case, if, if they're on uh flash, uh with only a single partition and they're doing overwrites to RBD uh what would be the cases in which we'd be doing deferred rights.

E

And the case would be when we have right size that is not completely matching allocation unit, then we do read a part and put it through a different right. So there would never be a moment in time that we could create a corrupted data. That's when we use different right.

E

But that brings me to a question uh Joshua if, if You observe um some performance degradation due to additional rights or you just notice that there is more iOS to the device, so.

G

We definitely noticed the degradation are older clusters at first we've since made some tweaks to the this configuration Hardware configuration to compensate, but we definitely saw a latency increase there after we've compensated. For that honestly, the performance hasn't been as the difference is minor and actually the only place. I generally see the performance difference is in the p50, so the median right latency, the median right latency, tends to go up with Pacific.

G

We usually see improvements everywhere else. Average is usually better. P99 is usually better, but the p50 does go up ever so slightly like 20, or something like that, but so small that, like I, don't really care if it goes from one to 1.2 milliseconds. This.

D

G

The reason why we haven't spent like as much time on it as we were before, because once we mitigated the older cluster problem it's like. Are we really that worried? It does become a maybe a slight concern for wear leveling like? Is it going to increase the wear on our TLC devices?

G

um We also have our two oldest clusters still yet to upgrade and we've been holding off on those because those ones do have higher leads than everything else.

G

So we're a little bit concerned that maybe something will happen there, but yeah like we aren't seeing um now that we've mitigated the older cluster issue, mostly, we aren't actually seeing a major end effect from this and it at this point it remains mostly a curiosity and then a where concern, especially if we do see it in the rgw side, because we are well I mean we're running Spinners in our GW side without separated wall.

G

So if we're going to see more iops against our Spinners, we're concerned about that, and we are also running qlc, so we're also concerned about the wear there right but like like I said we don't know yet whether it actually affects our GW.

A

And it remind me from the previous conversations we've had: do you see an increase in the amount of data being written to this, or was it only an increase I'm.

G

Fairly convinced that we don't um because data was kind of mixed uh okay, I've gone back and forth on that a few times, because every time I look at the data, it's like sometimes I decide. Yes, sometimes I know. If it is it's not.

A

Like a massive difference, okay, pretty small okay, so that could maybe indicate that something's being broken up right like like, for some reason, we're we're breaking across a boundary or something where we didn't before. Yeah.

G

I can probably actually very quickly, oh after bringing up a proper draft here.

G

We compute that now on one of the ones where we have like a more solid read on the problem.

B

Just a side note: I've just checked and 4K minologize.

A

B

I thought it did.

G

So what you're referring to that is like is the concern that there was a or the thought that there was a set of changes to support before it came in Alexis. That might be in play here.

G

A

There we um we discussed that PR or those PRS I think quite a bit several years ago in this meeting, and um there were definitely performance effects early on and Igor I think you did quite a bit of work to um do to mitigate a lot of that.

B

Yeah I recall at least one issue with deferred rights which duplicated the.

B

Rights actually happening to disk or something like that.

G

Okay, so Superior question mark So, I I've got a very clear example now and no like that is actually significant increase in amount of bytes written to disk. Okay. So that's that's the concerning part. Yeah.

A

G

What I'm seeing here is um uh I'm just looking to see if I can share this now, there's some stuff I don't want to share in this tab. Okay, um what I'm seeing is so that earlier calculation I was talking about where I'm calculating how many disk rights are happening per client right um across the upgrade threshold we went from an average of 9.5 just rights.

G

Remember this is 3x replicated so 5x3 to get the per disk 9.5 disk rights per client right and it jumps to an average of about 12.1 um and then over that same transition. We were averaging.

G

I I think this is saying 25 gigabytes per second written to disk across the cluster and it jumped from the 25 to 38., oh wow. Oh so that's huge, actually I I had actually not done that count. That recently, that's pretty big. That.

A

Is that's really big, okay, so yeah? Definitely for for uh where we should yeah. We should track this down.

B

We'll say it again: uh what was the difference.

G

I'm actually just going to arrange The Ether pad now, so we actually have the numbers written somewhere. I was just giving me a sec, so it's just great for client right and point. What did I even say here: shoot that 9.5 to 12.1.

B

um Regarding default, rights happened much more frequently after 4K log size, but this was result in Pacific dot, six minor release, so just in case, are you what what are the Pacific minor release in your production costs?.

G

I'm sorry I missed the question.

B

uh So I found the ticket which fixes different, which fixes than expected different rights for Legacy clusters uh uh in Pacific, and this happened. 4.6 Pacific release, okay,.

G

Yeah we're running dot nine.

A

Okay, no specific to Spenders right.

A

I'm trying to think, if there's any way, Igor, we could verify whether or not it was those changes uh to support Working Man outcome, Spinners that that might have caused this like if there's any way on there, because they're running in vme drives. If there's any way, we can.

B

Well, fifth of all, we can uh enforce HDD settings for the drives and hence it once applied. It should work similarly to spinning drives in terms of before tried to our stuff. So.

A

We want to assimilate.

B

We can simulate spinners for with.

A

In the round, Spinners they're on right, you're on flash right, yeah yeah.

B

A

What I'm wondering if it was possible that any of the changes that were made uh at that time could have impacted the different right path on on flash drives.

G

B

Definitely good.

G

Are there logs that we can turn on temporarily? That would highlight that issue at the OSD level.

B

uh Well, but one more question about this stage in clusters: yeah, where they deployed before Pacific as well, like production ones, yeah.

G

So um so the example I just gave was from a production cluster that production cluster would have been. Let me check its history quickly. Here, I was going to say it was deployed on uh Nautilus, but I should make sure that it wasn't deployed on luminous.

B

G

It was a point on luminous.

B

But on the other hand, we didn't change the location unit for flash drives.

A

A

We left that alone, but I was wondering if any of the changes that you made could have possibly impacted the behavior.

B

um Well, I need to run through all these changes once again go answer. This question foreign.

A

Family sharding right- that was the other. The big case that happened back then.

A

Oh sorry, uh you wanna talk about you found.

C

I was just recalling that one that uh Igor and Adam had both worked on where there was a bug in Pacific. That still is there, since we haven't had a release yet, but we're Boulder clusters that still have the 64k Min allocation size that have been upgraded to Pacific aren't different yeah. It's based upon the small size, but I don't know if that makes sense.

B

That's this, the bottom ticket for the tracker I.E at above all these domain and that models from Pacific, backpack and again it's happening for for spinning, derives only.

A

I'm pretty sure in luminous we had a 4K mineral accessory SDS right, like you verified that right.

G

Wheat, so I mean for us on our configuration. We overrove that and we've deployed 4K mid Outlook and our RPA clusters for years. Even okay.

A

Good yeah, I I was pretty sure. Luminous speed had a 4K metallic size for ssds, but um you know one point earlier on: it was 16k.

G

Yeah yeah we've always had four caving.

G

It's funny because I worked at object side. So much I just go to those settings.

E

I guess one more thing could have happened. If you use you use one good device or um two devices for separate uh block data and roxdb data.

G

Yeah, we use one device everywhere across all of our infrastructure.

G

And I do confirm we did. We do override 4K and analog size for all of our blockbusters.

E

Are you able to distinguish rights to that came from uh block data rights and rights that go to roxdb, write ahead log when.

G

I, look at the Block Trace I'm, pretty sure I can figure out which is which is based off of right patterns, but like um there's no way for I, don't think there's any way for me to do that at the stat level.

G

D

Because we don't have like an.

G

Instagram of right sizes, or anything like that I'm just trying to think of a way I could actually separate that at Prometheus level, I, don't I, don't think I've got a way to do that.

E

But to make sure when you refer that there is an increase of uh right access to the device you uh talking about the sum of block data rights and any metadata size. Okay,.

G

E

G

All rights and aggregate to that device since database wall block is all there.

B

Hello, let's do different rights, you might want to check Blue Store performance counters to get an estimate on whether they're happening and what's the amount of them, the best table would be to reset the counter to leave running for a few minutes and then collect the game.

B

B

In all flash setup, I believe they should be quite meaner. So if you have large numbers there, then we need to trucks.

G

All right very quickly checked one OSD on this cluster and it's logged a billion deferred rights in four weeks.

B

G

Perfect reset to reset those counters right.

A

Yeah yeah, okay, I think so I believe that's.

B

B

Well, it'll be great if you share uh please well these zombies, maybe some short short.

E

Yeah, maybe it would be a good idea to fill out the ticket yeah.

G

I'm starting to think of that now too yeah.

A

G

um Okay, so I I reset the counters. In the last, like five seconds, there were three thousand deferred rights, so like 600 per second roughly, it was very rough.

B

So, what's about the the amount of regular rights? Well, anyway, it's it's better to to pass it offline. So we splitiate and repeat.

G

And where's the account of.

G

Regular rights.

E

But what's one thing for sure, deferred rights will amplify right count to the device.

B

Maybe not right count, but the amount of bytes related to disk.

G

Yeah, um okay, yeah! So, very briefly, if I look at uh this, the done counter uh sorry I'm, trying to find one. That would be like what is the right count here and reading this correctly, it looks like the majority of the rights that were done.

G

G

uh Blue store, Wright small was 3324 and blue star right deferred was 3271.

G

Boost our right date was 782.

B

It was about amount of right beak or some something like that. Yeah.

G

Right date was 782.

B

So you have much more more rights than big ones, right.

G

Yeah small razor 3300 versus the 800. so like four to one.

C

B

Well, I believe this shouldn't happen or okay location unit, since RBT does the padding itself, and hence it should issue for the airline rights which are not small they're, not printed. A small for 4K allocation unit.

B

Don't perhaps that looks, The Rook was buried some somewhere.

G

Sorry, where does RBD enforce that, because I'm pretty sure we're exposing like very short, like the actual client in the VM, is seeing 512 byte. So is there a remodified right, that's supposed to happen somewhere in the stack.

B

E

Well, if your clients do write in multitudes of 5 12, then yes, different rights are expected.

E

Yes, that that could be. We just don't really test uh that that configuration I, guess.

A

But maybe maybe this helps us narrow down a test between Pacific and Nautilus, and you should be testing very, very small ones that enforce different rights and see if there's a difference.

G

I'm thinking the same thing, because our primary workload and staging is actually 4K aligned.

A

Okay, I'm starting to wonder yeah.

G

Is that why we don't see it yeah.

A

Yeah yeah yeah, it's kind of a whole lot of testing. We don't. We don't tend to do a lot of like sub 4K like aligned or unlocked if I tested. We should probably should I.

G

Have no idea why we exposed 512 bike to our customers instead of 4K. That's all before I started working a couple years ago, I'm like oh, that's, curious, but I haven't, given it much thought so.

A

Yeah huh interesting.

G

um Yeah I will file a ticket. um I've noticed a five-fold bike thing because I might play with that in staging a little bit just to see what I can find there um I know I've taken up a bunch of your time already so I don't know if we want to defer, have defer the um the 100 conversation until.

A

Yeah yeah we're over the limit today. Let's, let's do it next week- is that sound good yeah.

G

Sounds good, okay, cool thanks Hall! This was really uh helpful. Hopefully, hopefully we are driving towards a resolution.

A

Yeah yeah I, don't I, don't like that 25 gigabytes per second to 38 gigabytes per second, yes,.

G

I mean that should be fair. This is a very large cluster, so like what this translates to on a per disc basis, I I'm not worried about it.

A

No one noise: this is more about the the ratio right like yeah, always right. It's.

G

Pretty yeah yeah.

A

A

All right, well uh any anything else from anyone before we we wrap up today.

A

Thanks for sticking around everyone, I know we went a little over um all right. Well, let's, let's wrap it up and uh give everyone next week. Then.

E

All right, thank you, guys, see you next week. Thank you. Thank you.

F

Bye-Bye bye.

A

Thanks Josh bye.