Ceph Performance Weekly, 7 Nov 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2019-11-07 :: Ceph Performance Meeting

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Is it starting to get cold where you live? Oh.

B

Yeah, but not that fastest I'd prefer so it's around five to ten degrees, Celsius and raining most of time prefer freezing and cold weather them. This Washington, yes,.

A

Here it's a little bit below zero Celsius, so the ice is starting to form on the UM ponds.

A

But you should come here and it's I.

B

Suppose you have rich of the same climate here, but well last year it's a bit warmer than normal.

A

This year they predict that we will have will be cold and and lots of snow, so we'll see how it goes.

A

All right, Leslie people are coming I've gotten here from the court meeting. So uh third up here, alright for a new poll requests. I only saw two new ones, both from me this week. So one is very simple: it's reverting, Rock City, parallel compactions to one from two.

A

We changed it a couple of months ago and but Igor has actually been documenting, some issues that the users have been having with rocks TB corruption, and we don't know that this is it, but it's one of the changes that we made so right now, I have a flight with do not merge, but it might be one of the things that we revert to see. If that makes a problem go away.

B

A

B

I left a comment there yeah, so we verified these hypothesis and looks like our using this parameter. Didn't help so.

B

Much attention in this part and regarding the corruption, the candidates is new Roxy, B itself or probably prefetching, that it brings.

B

So there is prefetching implemented in rock deep itself, but we also wrote a small piece of code to support that, maybe in our pocket, okay,.

A

So not not, we don't think it's parallel Compassion's. We don't think it's the upgrade and we don't think it's delete range. Then right.

B

mmm Can say anything about lit range, but that's definitely not parallel compactions and not that I'm great I think.

A

We only enabled the lead range after we did the upgrade before that it was off unless the users all turned it on.

B

Well, didn't think about the trench at all, but for now some sure.

A

B

A

Not good well I, don't know.

A

All right, so speaking of de lis range, though that was the other pair I've got here. It turns out it's very, very slow uh-huh. We can talk about maybe a little bit more later, if there's time after we talk about deep scrum, but the the gist of it is that we need to be very, very.

A

Measured in how we use it and once we do use it, we should probably start thinking about issuing a mem table flush. You know we can. We can issue a number of them I think before we probably need to flush. But if we have hundreds of these things happening quickly, then it can really really quickly start making everything very, very slow. So that's basically the high-level view of it all right, updated PRS.

A

Let's see the this one, but right log and RVD I think that there's still a request for Jason to review that he had kind of given a general thumbs-up about splitting that away from a much bigger PR that they previously submitted, but no no specific review yet I think sage did review. This enhance OSD new affinity PR with mostly with just some some comb clamps and then some questions about how they're doing it and if they need to do it that way.

A

Let's see Eric's.

A

Per shard entry count yard during bucket listing that I, don't think it got any real major updates, just rebasing to keep it and working I'm going to try to do some additional testing with that. As time permits ends, I object is, is still still desperately being worked on. There are some new updates and testing I think it was not not passing Jenkins, oddly or something so he's working on that and then objection sustained as well, so I'm dad okay CID. Any news on that.

C

I'm sorry I took doubt for a second of what were you asking about? Oh just.

A

I saw Adams, I, objected, looks like he's furiously, trying to just get the last little bits necessary for it to be merged. Yeah.

C

As far as I know, it's passing all of the Suites, but Jenkins is seg faulting in the linker, so I can't pass make check. She's I believe that Jason and David Galloway are helping him look into that.

A

Good deal hopefully soon agreed all right, so that's all I got for updated stuff. This week, I've got a couple of things myself in the no movement list. I need to get back to and look at again, but I, don't know what I'm gonna get to them. So I think that's the case for a lot of folks, but we do have a lot of outstanding performance yars.

A

If you probably need to go back and either get over the fence or clear out at some point here, alright, so any any other chairs in the store anything really into those that we need to talk about.

A

Alright, then do we have the guys from Morgan Stanley here.

D

Yeah how's it going it in good.

A

Good, how are you good.

D

So the issue that we wanted to talk about was on deep scrubbing.

D

We had an instance a couple days ago where, in the morning that came in there were a large number of deep scrubs going on in the order of almost 400 going on at once, and it kind of brought our flustered performance to the to its knees, cause our rgw to just be crashing, and in general we just like to get a better handle on.

D

How do we get a picture of the deep scrub landscape and how its scheduled? How we can you know, I set no deep scrub in on the cluster and it's slowly working off, but two days later, they're still like 30-something deep scrubs going on. So it's not like it cleared.

D

New ones are being scheduled as far as I can tell with no deep scrub on I mean it is looking at the logs there's, there's almost continual three or four a minute new new PGS scheduled for deep scrubbing, so we're just trying to understand that and and thoughts on that.

E

Well was deep scrub disabled for a long time or something, and is it the case that you turned it on recently? No.

D

No, it's it's been on the first time, turning it off. It was two days ago.

D

E

And my second question would be: do you have the scrub sleep set? It.

D

What it was not says, the default of 0 when the issue happened. Look you know, searching I said it to 0.1. So.

E

That is its current.

D

Setting with and right now, the current settings that are not default are the sleep at point: 1, no, deep scrub, I set to I turned on and and then also to try and affect the scheduling. I said the deep scribe scheduled an interval to a month instead of a week. Those are all changes since we had the issue not before.

E

That everything.

D

E

Okay and what is the exact version that you're running.

D

14.2 report.

E

Okay, okay, then, that's that's something we need to mean it definitely to investigate this video and put similar issues and our latest releases the other issues that we did receiver and earlier releases. They fixed a bunch of them I think David was also on. The call can speak more about it. David came here, yeah.

F

Also OS d max scrubs is not been changed.

D

I didn't change that; okay, that the one is that the one that yes.

F

A number Matt.

D

Yeah, that's the one yeah okay,.

F

Yeah I'm surprised that, with no deep scrub set that there'd be anything new getting scheduled, we.

G

Were most surprised by, we were hoping once we said to know that um you know they would trickle down to zero, but we definitely saw for sure like say we were at like one point, sixty two, then we definitely saw sixty three and we.

D

G

The logs actually said it was.

D

Yesterday, while I was watching to the middle of day watching the logs, you know every minute two or three finished and two or three launched and is the day before, is when I said no deep scrub.

F

And it shows on the status, yes,.

D

So these were not restarted, I did the I, don't know if that's relevant but yeah the status. The stash shows the note D scrubs as soon as I like to put that on that's.

F

E

Definitely sounds like a bug and if you haven't created a tracker yet let's just create a trapper that is urgent and we can so.

D

We've tried that a little bit before, if anybody has any guidance to help us log in to the tracking system, the.

D

To get an account to put it something in the tracker before.

G

D

At least I got.

G

An email saying my account was created for the tracker, but then, when I actually tried to log in I could not get the whatever that's called. The secondary factor can.

E

You send me your email, addresses and I can make sure that you have access to the tracker and you can file of issues and stuff.

G

Mark and then you can forward it on yeah.

H

G

Thank you. We yeah we're happy to put this up in and just we we tried and failed yeah.

E

Yeah you can do this. Thank.

G

E

G

I guess the question would be in the meantime, with the bug as it is, is there any kind of thoughts? um We obviously need to try to avoid this, so our thoughts were like. Maybe we just write a script that would kind of manually start scheduling these deep scrubs ourselves. um We've.

E

G

Some things online of people doing that it's not definitely what we'd like to do, but at the same time, obviously we can't have the cluster go as poorly as it did. So we just want to see like do you have any pointers on what to do with that.

G

Like is the model basically like disabled, deep scrub and then like schedule them yourself, or we saw other people say you know, make the interval very long, like you know so long that hopefully never actually scheduled, and you schedule them yourself, so we're just trying to get any type of pointers. Cuz. Basically we wrote a script.

G

I was going to do something along the lines of um you know, schedule the oldest five, but make sure there's no more than say, 30 deep, scrubs happening at a given time or something like that was our over the whole cluster I mean that's what we were thinking about doing, but we're open to any suggestions of what we could do for the time being to avoid cluster chaos.

E

David I think you're the best person to answer this question.

F

Well, I I guess I'm wondering are: is there something going on that's making there like anyone, deep scrub taking much longer than before or loading their cluster I mean there are config parameters that would they could run only the deep scrubs at night, let's say or at off hours. Yeah.

D

I mean we saw that we saw those settings, you know we don't really have a nice off hours just because you know when people aren't here in the day. You know we have other scripts that are running and actually or a lot of times more demanding on the file system, so so the the off hours or scheduling it like that I, don't think, really fits our use case. I didn't see like any high like OSD loads. If that suite on that, like with causing the like long or deep scrubs or.

E

Just in general, like.

D

How many deep scrubs should be going on before before we see performance degradation like, for example, in the object store, I, don't what's a good.

F

D

For that we don't even know that right.

F

Well, with OSD max scrubs of one you're, only getting any one OSD would only be handling one scrub.

I

Yeah, it's it's! That's what it tries to do. It's not quite smart enough to limit it to one it actually limits it to when it's it to to, because.

E

You know yeah, they.

I

Do either it's a primary or as a replica, that's one of the to-do items for octopus that hasn't been done yet I.

E

Wonder if the quick and dirty ways to just increase the sleep, so the.

I

Easiest thing that you can.

E

D

E

Yes, it's going to be it's going to.

D

E

Out! Well so that you don't see in fact, I know that.

D

Sleep that by default to zero, that I said two point: one I don't know.

E

G

Here we go higher than point one. You.

E

Can go higher if you want point, one is not happy. It's.

I

Kind of depends on your environment and how close to your performance, saturation or whatever you are. Okay,.

F

Right so oh I wonder if they have the code, it that has the warning that will tell them if they're, if they're getting way overdue on their scrubs so like. If you so, let's say you set to sleep really high and.

D

You know you've scrubbed in time.

F

So you're already getting that yeah.

D

We are getting that even when it was zero.

D

Sometimes it's quite high right, yeah I mean it's it's that kind of migrates around, but we've definitely seen that as high as 2000 CGS and we in our cluster we have around 11,000 PG s.

I

That's not so bad you're like double the interval in it. Everything should fall within then within the target. I guess.

F

So what has something changed recently? I mean why what you didn't upgrade or anything.

D

No I mean when 1424 came out. We upgraded right after that. That's our last change. Really it.

G

Was running with that version.

D

Pride for how.

G

Long would you say Bologna a month without there.

D

Without any changes.

G

And our dataflow hasn't really gotten much larger and yeah and I think when the problem happened, it wasn't a. There was a whole load on the system at that time, but the.

D

No deep scrub flag should cause no new schedule, deep scrubs. That was.

I

Intended, yes, that's it yeah exactly.

D

We're definitely seeing new and scheduled, you know or new our new one started. So what's the is, there is a nuance difference between a PD scheduled for deep scrub versus at starting.

F

I

D

Like I mean after RPG is deep scrub, did it scheduled to be deep scrubbed a week later or is it? Is that a scheduling thing, the new job? Does it happen versus it starting just I, don't know no.

F

Actually, the that's it's it's sort of a little subtle here. Every time a regular scrub happens, so that's normally daily. It checks to see whether it's been now past, the deep scrub interval. Okay. So if you set the regular scrub to a month and the deep scrub to a day- and if you inverted that it was still, it would not deep scrub every day, it's only scheduling the regular scrubs.

F

So your your regular scrub should always be less a sub. Okay,.

D

F

You know amount of time of a deep.

D

Stretch- and we haven't changed that I mean- is there I thought that we may should should change that till like two days or that? No, that.

F

Wouldn't make a difference as far as yeah the flag, so yeah, so so ever so, what with the no deep scrub flag set it's supposed to so everyday, it checks. Okay, should I, run the scrub and then even though it's been a week or two weeks, whatever your setting is it's supposed to look at the deep scrub flag and the no deep scrub flag and then just say: okay, well, I'm not going to do it anyway. Okay, so I just know. If you have regular scrubs.

D

Okay, we're definitely getting new new scrubs starting right now and it's and that deep scrub Flags been set for two days. All.

F

Right, yeah, I'll, look. It look specifically at the 14 to 4 version and see if something got fixed past that or if something was strange in that very the latest.

D

Release right at the further than we know, that's.

F

What the releases page says, but so probably yes.

G

So, just to reiterate, the recommendation would be instead of doing things, manually, turned back on deep, scrubbing and just increased with sleep, and then the.

E

G

1, if we still have issues just creep that up slowly, to see eventually, if it helps us yes,.

I

I

So I think that, aside from the immediate issue, I think the question, in my mind, is whether we should change our approach to scheduling scrubs in general, because I think a lot of other users have like gone down this road of thinking. They need to schedule them themselves.

G

Research, there was a lot of people with.

G

Believe me, we don't want to do that, but at the same time we didn't see any other knobs of scheduling and then forth and so forth. So again we that's why we want to call here first before we implemented I'm fully a solution so.

I

That the ticket in the whatever the card in the backlog was to fix the scheduler so that it gets rid of that minimum up to so it's a minimum of one, that's smarter scheduling, so that you're, like just any OST, is only participating in a single scrub at a time but I think in reality. What you want is in order to reduce like a visible client impact.

I

You want like a subset of the us to use and the whole cluster to be scrubbing at all, because then most IOT's, our best and only like you, have a tale of what small subset of iOS are getting o sees that are busy scrubbing, and so the sort of the visible impact is lower.

G

There's a way for us, the max number of total on the clock, like it percentage-wise, like you just said, like 20% third set that then I can basically dial that in to where is acceptable for my cluster and then that way, I have that granularity to say, like I, never want more than this much. My cluster fact is something like that. If that was implemented, would do exactly what we were kind of thinking of doing with the script anyway, yeah.

I

We could write a manager module to do that. I think that the open question is whether it's going to scale when you have 10,000 O's, DS or if you'd have to revert back to the other sort of distributed, scheduling that we have now.

I

Another good intuitive sense by the feeling it would just work for most people doing manually.

G

Probably but ya.

D

Don't know I appreciate the the comments we'll try. Some of these yeah.

G

We're trying we're trying to sleep and we're trying we're another that helps or not, and we will definitely put a ticket in about the deep scrub still being scheduled on we're. Send our emails, like you, said, and see if we can actually get into the ticket system. Out of great again, we always appreciate the help.

I

Problem. Good luck. Thank.

A

All right should we move on to Billy range yeah, so the the gist of this or I guess we went were in the gist of it before. But the the deal here is that when we started defaulting to use this it, it doesn't impact us most of the time. But there's one test case that really really makes it shows up very easily, which is to create a large number of rgw buckets on one OSD and then probably delete all of them. And you can.

A

You can trace through the code path to see what that actually ends up doing, but the we end up, basically just calling Billy range, a bunch on a small number of map entries and all of those range tombstones, end up bright minting the men, people really badly and then subsequent elves looking workload or even that current workload will cause a ton of iterator overhead where we are just churning I.

I

A

A feeling that it's even worse,.

I

Than that, because the object delete now when you delete in any of any object with on that, because the delete range at the object store layer. So those are trickling triggering range dates, also so just creating on map objects and deleting them is going to do the same.

C

I

C

Back that bucket deletion, as hitting is just deleting the bucket index objects yeah. Anything special for.

I

What we know that.

A

Doesn't seem as bad I wonder if it's, because it's triggering flushes more often when you do objects, if there's enough other data that that the the Menken whole flesh gets purged, so it doesn't seem as bad I. Don't know.

I

I, don't know I bet if you did and like a greatest bench with the own upsetting it just created a bunch of little map objects and then just yeah looped and deleted them all you'd see the same thing: yeah.

J

I

J

Reflect just a problem in the trivial problem or a bit less clear, a little a very narrow problem in the design of the mem table, or is it or fundamental.

C

From the release notes that mark linked, it sounds like it's a fundamental issue that they kind of considered that important, basically saying don't use frames, delete a lot.

J

This is really fighting upstream. That's the case. Yeah.

I

A

It makes sense them like the bulk case right. If you've got like a million keys, that you want to delete, and you know yeah, you don't want to call it for every single one, then yeah he's arranged and then flush yeah and it.

C

Sounds reasonable types.

J

Of environments and some types of system really, this is the most optimal way to delete it. Fans like.

I

J

Advisory or at all, it's.

I

Just weird that it, if implemented that way, um yeah it seems we have no you're.

H

In it, for the larger cases to like, with that special of.

I

H

um Is that large enough for, like we have a kind of an idea based on experimentation where the lower head of Rangeley would be sufficiently small that we should you be using it.

A

Nope no idea I took 1024 from the other PR that was previously applied that implemented a threshold. I had no idea where the author got that number from, but I didn't know, Casey didn't know. So we just yeah.

H

I mean it's better than zero. Obviously I, just wonder if, with the well.

A

I guess the the question there is.

A

We probably at some point needs something that that will issue a flush at some point. I, don't just you know, wait around laying these pile up. Even the rock CB guy, has talked about implementing some kind of like automatic flushing behaviour. If they're enough phrase, tombstones they've got the same idea.

A

How many is a really good question right, you know and and how big you should be. You know if, if you've got a million keys to delete- and you do that over 200 range deletes then, is that the appropriate point to flush? Should you do it sooner and then have multiple flushes? You know, I, don't know that. There's a clear answer to that, and it probably depends on the underlying hardware and I. Don't how much DP you you want to earn versus if you want to be in compaction.

C

So do you have do you have any idea of what the flush is actually doing? Is it like? Is it locking anything that would prevent other racks to be requests from going through, or would it be safety to kind of run regularly? The way that that we're talking hey well.

A

Yeah you're, basically so so everything that's in the current buffer for the right. I have log that you're writing into basically is flushed into level zero and then you're issuing compaction, though that's why it makes it fast right because now you're anything new coming in is coming into a new buffer, though you're you're, basically taking what you've got. Moving that into you know compaction workload, but now you've got brand-new buffer that you get to rate into quickly.

A

As long as you don't end up, you know doing the same thing in the same thing over again and then now you have no more buffers to rein in T because everything's in perfection now you stall, right or or maybe he wants to be slows you down before you get that at that point. But it's yeah.

I

I'm just trying to imagine what what this would look like if we wanted to trigger a flush, because it seems like we have to have enough his ability to tow when the last like how big the current right buffer is and compare that to the number of range tillage since the last right buffer was started.

I

Do you know if we actually have an API to like check that I? Can't we asked so every time we do a range delete, we could say: is this the same buffer as the last time is so increment the counter, if not reset it, to zero or to one and then, if it hits, you know 20 or however, whatever we decide, the magic number is then then force a flush.

A

I

Need to be what the current right buffer number is. I guess that's what we need.

A

Yeah, that might be that might be overthinking. It, though right because I mean that's. That's a nice optimization. If you can say: oh, you know the last three range police. We did landed in this buffer, so that was reset, but maybe it's maybe it's not that bad. If.

I

We just flush unconditionally. I yeah.

A

I

Just seems that you're always gonna have these flushes that are necessary.

A

That's maybe why we want to set it really high, though the threshold really high right like if maybe, if you're, deleting like a million keys, maybe it doesn't matter that much that you're issuing flushes a lot. Maybe you really want to be in front of it. I don't know like in the case where you've got like an object that has 2000 map entries. We probably don't want to delete range on it anyway, right. You probably.

I

Yeah well I mean I state your current protocol to fix that right. So they should become much.

A

E

I

As long as the threshold isn't, like you know, whatever the PG long trip, PG log trim increment is plus one or else every PG hard trim basically is going to do it.

A

I mean you could even do a time-based right like if you see.

A

I

Well, maybe I spice. My suggestion is to look at the rocks to the API and to see if there's a quick way to get the number at the current number for whatever the thing is that you're writing into, because that if so, then it's really straightforward to just say just count per burnt, write buffer and put the threshold that way?

I

If not, we can just have it just a Jared, simple cap, but I don't have any intuitive sense for what the number is. Do you like? Is it after 100 of these, then we should force a flush or is it after 5,000 of these we should force the flush, I. Think.

A

He's can't depend on the hardware, but on the new nodes that have obtained drives and really fast in VMI's, it was really quick, like maybe two or three hundred of them. You want to do flush, maybe even more okay. It was not a lot yeah.

I

Okay, because presumably the iterator basically has to is doing a order in queer. It's looking at all the current range deletes in the buffer to like check right. Let me pile up pretty quick and.

C

I

Seems to be pulling.

C

To a standard set, and so we were seeing a lot of allocation and allocator overhead there, like I, think that's a dimension that we should I guess look at one tuning. This makes me think it should you.

I

After, like 16 of these, you should just.

I

I

Okay well I, think you're approaching your current phone request is gonna, be a definite win.

A

And for reference, that's not really mine, that's that's! Aaron 85 I, just.

H

A

Up a little bit, yeah.

H

Mister wondering I guess if you can see this kind of behavior in cases where we do leave with larger ranges, like maybe charting like a case or something or deleting a bunch of recent ones, I wonder if we should disable it entirely until we has two times that do some experimentation and relate to it in a way that we're sure it works well,.

J

Obviously sounds kind of true, but I certainly.

I

J

Like to root cause, whether whether there's, whether this attractable path here to get to acceptable or better performance than we have easily own the moment we were in the bay previously had to, or if this isn't running into something deeper, that we're adding overhead.

I

I think for odd nautilus. We just have to turn it back off as we're also suspicious about this corruption. That's just a switch. We just only recently turned it on anyway, but a master I, didn't kind of like you know, try to try to be clever.

A

They'd you weren't here for the first part of the conversation but I think Igor said that the corruption issue predates of us upgrading to Fox, TV 6.1.2 and predates the when we switch to two parallel transactions, which I think also means it predates the range stuff but I'm not premature on that.

A

I

With that, we came from.

B

I

Is here, okay,.

B

Mark not sure I could use and conspiratorial I was saying that max parallel compaction doesn't.

B

We mentioned to reduce this parameter to one and still be able to reproduce the issue, but most probably TPM database, so the the root cause most probably is in new rocks DB or maybe in some prefetching stuff or maybe derangement I, don't know so it's definitely not parallel compactions and not.

B

With the upgrade, so it happens for fresh noodles installation, and it happens for max parallel compaction set to one okay.

I

So, in that case, I would say then for Nautilus. We should turn off delete range.

I

What you won't do anyway, I guess, because it it's totally really performing but I'm.

B

I

B

For now, I don't have any evidence of delete range being root cause except.

I

That we know it started with dot three in that for that right, like we haven't seen it earlier than that three.

B

Well, dot three got plenty of database changes is that rocks GB, prefetching and delete range, and so it looks rather like some reading issue or things like that from I, don't see any.

B

Possibly as well and be possibility for delete range hood to impact that somehow. But maybe.

B

E

My question was that you know: do you see an issue with reverting the rocks, DB version upgrade and Nautilus, because the dependencies we had were delete range which we want to turn off anyways? Is there anything else that could go horribly wrong.

B

I'm not sure about performance. We observed some wet performance or map enumeration releases.

B

Hopefully this has been fixed with prefetching and you know the mock to be at least I haven't seen any complaints and flan in in this performance issue.

I

We need a fair I mean occasionally Roxy be actually just can't be generated because I said, I cannot just change verify that it's.

I

That would be the most conservative thing to do right would be to revert, rocks to be and turn off range. At the same time, yeah.

A

Age, one of the reasons we upgraded rocks to be, though, is to solve other corruption issues that they mentioned. Those.

I

Worth their innocence,.

A

I

Weren't they, you had seen that there was a note, and so we wanted that great, but we hadn't actually seen it.

A

Well, we've seen stuff I mean over the last, like three years. People keep talking on the mailing list about like random or oxenby corruption issues, but it's it comes up like once, every like four or five months. Someone has some kind of wrecks to be like CRC check sum type error in rocks TB in the API corruption in weed.

I

A

I

Yeah not as prevalent as this one, though I mean we have like people hitting it multiple times on the same clustering. That was never heard before.

A

The that was actually related to us can ask eager, though we we definitely have not seen this on a previous version of rats to be the same issue.

B

I haven't seen any complains for this ticket for not little spree or to.

B

14.4 dot three, and this is exactly version which got to be half.

B

But I can't we have never seen checksum Roxon rocks DB before probably happened, but it looked completely different and not that frequent.

H

And even you're saying that the prefetching key value could also potentially be related.

B

Yeah well, actually, I don't have any evidence of these as well side. Let's go on the part we change from our part, so we have some additional support in blue phase for for prefetching and well. Actually, I can see two branches in rock DB code, which calls checksum verification when first one is regular reading the second one is retrieving data from from the prefetch buffer, which actually might not think about prefetching.

B

Actually right now, I provided a small snippet of the court to enchant, enhance the log in and hopefully, we'll see, what's the source of the block which is failing.

B

After that, we might definitely they it's professional, not.

I

Have you been able actually look at the SSD file that claims to be corrupted? Oh, what the corruption might look like well,.

B

Actually I had a couple of nibbles from such files and looks like they are absolutely different from what I expect they could be. They should be actual chipper, so III can see zeros that they've said where checksum are expected.

B

B

Actually, I don't have an explanation for that and these logs new snippet I provided will dump the block itself. Ten words one.

I

Thing to look out for is if the zero start at a page, boundary.

I

That's similar, the symptom we saw in my theory was returning refugees.

B

Well, well, in these assistive files that I have I can see some data portions related to GB block or whatever, which starts at probably page boundaries.

B

But well, offsets that are reported in in logs are completely unaligned. With this information in assistive file, it doesn't read. Well, it doesn't complain about reading it offsets at page boundaries. They are not aligned with pages and actually so checksum values. It's presenting do not correlate. We have data that located in SSD files.

B

But yeah checks and do not correlate with actual that thing on a 50 file and well at least I I have just a couple of such files and there are zeros in tough sets in assistive files, but the retrieved value is definitely not zero and which is very interesting about the retrieved value. It's always the same for all different cloud for different clusters are reporting the issue so I suppose it's well, um so that it doesn't look like reading error. I mean it's that that doesn't look like data corruption, that persistent storage, it's I, don't.

B

And by the way, we've just got a new ticket.

B

This is the same, of course, with a bit different appearance, getting a session like Qatar, but there will be different values, I'm verifying it's exactly the same.

I

Mark what version are we on here at.

A

Six point two point: one: if I remember correctly: oh that's like weasel our way back yeah. They made a lot of releases since June. A lot of bug, fixes and updates.

I

I'm not seeing anything here about ERC check some corruption, oh yeah, but I'm- not sure if they would call it out in the notes.

A

Trying to decide if I thought any of the other things they did mention could be related at all.

I

Do you know what we had the 461, we afraid.

A

I

A

It was like 518 1, maybe something like that. It was a way of ways backfiring all right. You know.

K

A

And all the email I've got.

E

I think we are on.

A

6.11 do Oasis, 1.2, ok, sorry, I, had my numbers transpose birth, six one two: they fit nice 61 this one too.

A

Bug fixes closer redhead log file before another thread deletes it.

A

That's probably important.

A

Let's see race condition well,.

B

Actually, we can probably check if new rocks DB version run smoothly with our current Nautilus and ask these guys to build and deploy I guess.

I

How how easily are they able to reproduce this like? What's the occurrence rate.

B

Well little I think several times a day and see that well, okay,.

I

B

One or two times a day but.

I

A production cluster with full data.

B

Can say about all of them, but for this couple of them running tests placed in a and.

I

They're able to reproduce it, but just cluster yeah. Okay, that's good news! So.

B

Well, actually, they are doing some stress testing, it is one of them and they have three clusters and two of them are impacted by the issue all right now they are reverting back to 14.2 dot, two and also very fine. This patch enhanced login, okay.

I

And so I would, if they can reproduce it, we can just have them set the are more interesting, the false on the off chance that that's really did. It seems unlikely, but it's something to try.

B

And some maybe after getting feedback phone for current activities, I'll yank.

I

B

I

Can push a branch easily just take the novice, whatever top four and just put one commit on top? That updates for us to be the latest also awesome: okay, okay, well good news that they can reproduce it. I was worried that this is like random users. You know once a month.

I

Very challenging.

A

Igor is it, is it just one of our steers and multiple lists, ease acrostic.

B

A

What's that look like like what are they doing.

B

Now can say about often, but but some of them are running some stress low slaughter. You can check the ticket I asked about that, but not much feedback on there, but some they mentioned some songs, guys voting.

I

A

All right, so maybe I got like a couple minutes left, but the only other thing I had on here was a discussion about number of PGs and maybe maybe that's a bigger discussion than we have for five minutes.

I

um Yeah, so I think that the thing that I think we need to be careful of is when we have more PGS. It increases the probability that a double o, SD or triple o SD failure will lose data, for example, when you have both these randomly corrupting themselves because of a rocks to be bug so in general, I think we want to try to keep that PG count as low as we can provided. It's not you know outside against other issues. I guess you.

A

How does it I mean I know that you'd have potentially more.

I

More triple combinations of those thieves that share data yeah.

A

I

Expected amount of data that you lose is the same yeah.

G

I

The PGS will be smaller if you had more of them, but the way that, because data is randomly spread across VG's, it isn't so much. How much did you lose its whether you lose any data, because that tends to sort of lead to a bad day? Yeah.

A

You, you distribute better, so you just repeat your failure: better right, yeah.

H

Yes, the other thing I was thinking of in terms of fun. We're more peepees would have impact would be more CPU for viewing and osteen of ham. Blue t-rex.

I

Yeah I thought I'm not too worried about the CP over ever appearing because of hearing. This doesn't happen that often and and if we adjust the tweed just a log, so that it's like a total or not a perp EG thing then, but it just bidding him up differently, but.

A

Josh I will say that I had some tests going where I was actually putting like seven thousand PGs on a particular OSD and peering didn't seem to really be a problem at all memory usage was a huge issue, but I would be normal person.

H

That was some Ocoee just yeah.

A

Yeah it was that many PGs by accident.

A

Having said that, I wasn't looking really closely at it, so you know wasn't anything I noticed was real bad but but you know maybe there'd be a scenario where you you could write if you had tons and tons of PJ's and for some reason things were updating really quickly or something I. Don't know.

H

E

H

I thought, then that's fine. That would be good here.

I

Remind me what what the particular issue in this case is why this isn't look why the reading was up weekend remember it was just.

A

I

A

Work, I love them, so so the the very specific reason I brought up this morning is because there was a QE case where they had n pool spread across ATP GS, and they were trying to expand the amount of space in the cluster and backfill was complaining that they didn't have enough space to to finish backfill, for particular PG I assumed it was because they were trying to fill it onto an existing drive that didn't have many space matters. What's.

I

The theropod really I feel like this was actually a dog where it was improperly calculating the anticipated billing or something because this kept coming up.

E

Yeah talking about the backfill to fully show that.

I

E

So yeah, that's one other thing: we should probably check if they are getting those warnings and they're running fourteen dot. What is the version they're running back I.

A

Don't know the bug for.

E

The four should have that fixed though, but which we should grab. We should double check so mark. Can you uh you send me a link to the BZ or whatever you're looking at and I can try that yeah.

A

Absolutely I will do that when the meetings over here sure saves the the other thing. I was wondering about to those beyond just this particular issue, I wonder if we can avoid having to invoke the PG Auto, balancer and shrinking the amount of P G's that we have until kind of later on.

A

If we can first change the PG log lengths and have more peaches up front, if that will less avoid rebalancing- and you know- oh the backflow work and everything else if we could start out with more peaches but kind of tweak, the the PG log links on a per cool basis exists. Essentially, what the auto balancer really is kind of doing right is you're.

A

You don't have as many PG log entries, because you're shrinking the number of peaches, you might only have like ten in one pool, but you could maybe instead have like fifty that have one fifths of the PG log entries per PG.

I

Alright I've tried to follow here, start over again you're saying that when we're reducing the number of few GS, we should first increase the PG log length. No.

A

No, the other way around so right now we have like a total of 100 PJ's on an OSD right, so you can very easily end up in a situation where you've got a lot of pools and the the auto tuner is saying. Alright, this pool gets 16. This cool gus' hate in this pool is for right, though I'm saying is well. Maybe we can start out with a larger pool of PGS to work with if we are willing to change the PG log lengths of the PJ's in each pool.

A

If it's in shock, you know yeah and so same 3000 is what everything gets. Maybe this pool gets instead of having the number of PG shrink in that pool. Maybe we say that the PG log length for the PJ's and that pool shrinks. Instead, you still have the same number of entries. Overall, though, it's not like you're you're, shrinking the the total number of log injuries. You've got you're just distributing them across more PGs. Instead,.

I

So that you could so that the balancer would reduce the memory but not actually reduce number of peachy's, as you want to have more cookies for parallelism. But.

A

For parallelism, and also so that you avoid the the backfill workload I.

I

Think the yeah I mean I, think I. Think the PG logs are sort of a red herring like we should separately from whatever the autoscaler is doing. We should be choosing a PG log length for PG. That's a little bit smarter than just a fixed value for PG I. Think the real question is how many PG should we have and that's you don't want a lot of empty few G's.

I

You don't want a lot of Fiji's for an empty pool unless you need that parallelism for performance and so you're really like basically want to set a floor on the PG count or pool right and right.

E

Now that's right.

I

Now there's hard-coded value, it's like for I think by default and we could make that or or something or we could ask the user to like say this pool I expect to have high performance, and so therefore I'm going to set a minimum of something or we could try to make the cluster try to intelligently choose a minimum based on the size of the cluster. So if there are only Oros T's, then there's no reason to have more than four you geez I.

A

Guess 100 doesn't give you much to work with, though right like 100 P G's is pretty if you've got 10 pools or 20 pools, dividing that up in uh yeah.

I

A

Same way, is this really tough, yeah.

I

I mean that only really matters when you have a tiny cluster right when you had like 4 nodes or something when you have you know, 200 MSD clusters then like unless you got thousands of pools, but people don't really have that.

A

Yeah I guess maybe the question for me would be what's our typical deployment size. Is it like 200, OSTs per cluster or.

I

That's probably like of 16 I would say that meeting us be DNS, probably or 100 I. Don't know we could go, look at one entry and see but sure.

I

Sure the median size of.

I

Buster size has.

I

Jesus I'm sorry.

I

Median Custer is like 30 right now. That's using.

A

I

A

So so, in reality, you've got like a pool of more like 3000. Pgs do know, be 30,000 pts to work with right. Oh it's in 30 years, yeah use in 30, sorry, 30 mm.

I

It's actually 10 puffins I guess is it. You can go.

I

For three: it's for every PG, oh sure, but the median number of PG s is around a thousand yeah.

I

Anyway, I guess what I'm getting at is there cases where a pool is small and it's not performance, sensitive and you're, really watch like one PG or for VG's, or something like that right and there are cases where a pool has data that you expect to get reasonable parallelism too, and we can either do two things. We can either say that if it's a pool that you want performance on, even though it's small, then you set a minimum on it, it mean PG to whatever you want, 130, something whatever or the other way.

I

If it's a pool that you don't expect performance on, then you set the min PG to be a small number and you make the default in PG something larger I, don't really care which one we do right now, we're basically defaulting to four, and so, if you need more than that, you're expected to set it higher. um But we could flip that around we can make the min PG by a to 64 or 32, or something 32 is probably sufficient.

I

I would guess, because I don't generally use up to 30 OSDs, that's pretty decent parallelism and and then on. The small pool set that maybe gee did be something small I.

A

Guess what I don't like is how we have to move data around to make changes right, yeah.

I

That's that's the only that's the name of the game. If you want to do it automatically, unless the user tells you what the pool is going to be for so, if we concur äj--, we should encourage users to set the target ratio or the target size for the pool, and then we can just start with a number that's reasonable. But if they give you no information, then you have no idea whether it's going to be an empty pool or if it's going to be a big pool.

A

But so I guess my question being back to PG long length is: does that get you some of the benefit of the auto scaler without needing to move data around.

A

Not all the benefits certainly, but does.

I

It no I'm Nessen.

A

I

Seem like that, the memory usage in my mind, isn't the big thing, but maybe why.

A

I

Do we limit the number of peaches right now, because it's all the cpu that you spend processing all the peering messages and and the reliability and probability that you'll have problems.

I

But maybe they memory as part of that I'm not sure.

A

For me, memories seemed like it was. The primary thing that I see is that the PG log length can get with lots of PGS. It can get very, very big. The amount remember we we assigned to it yeah.

I

I mean I'm just thinking that every time there's like a customer issue where they like just have way too many pts in the cluster, and it's like makes things Diddy, it's like it's, um it's not the memory, it is the memory usage, I guess at some level, but it's not necessarily the log. It's like dealing with the like the past intervals and the peering and the like when they get a cluster, that's dug into a hole and can't climb out again that's.

H

How I was lucky that would the CPU issues like yeah.

I

H

Yeah tons and tons of PDS to work through everything it was yeah, ok, heard.

I

Kind of problem, yeah, I, think you're, right and I. Remember bringing what I'm not remembering what you mentioned when you first said that.

H

I think I'm just making as another consideration for that's. Why benefit from one too many plays yeah Josh.

A

Was it like an overall number of peas, or was it actually like a PG / OST thing.

H

A

At small scale, I haven't seen like exactly those kinds of problems with too many PG Esper OST, but that's like with 4 OS DS or something.

H

Question I had wanted to bring up as because a lot of those things like with heartbeats and interconnections have been I, don't know more with larger clusters, with a cluster of like three six well, those DS doesn't matter that much that we have. If we have like 300, pgz or something I, don't know, I, don't.

I

Think so, except that.

H

I

Failure is really to fill us, but that's already true from a small cluster, so yeah exactly I.

A

Mean maybe there's like a point at which the total number of OSD or total number of key G's could start kicking stuff in right, like maybe, if you're talking about less than 30,000 PGS or something I, don't know, maybe it's not that big of a deal to have more like significantly more even I, don't know.

H

Anything feels like the monitor for electricity lumber, yeah.

A

Yeah I mean like I've I, don't know when the last time I ran, one was but I mean I've I've I've run tests where I've had like a thousand PJs on the OSD on multi OSD clusters. And you know this is small-scale. It's like you know four Oh STIs or what was the urgent thing, but you know it works. Just fine, there's no problem, but you know I'm not really stressing them in weird ways: either I guess so.

A

And the memory usage can be high, especially if you've got something that's exercising all of those PGS all at once.

H

Yeah you're, probably right, like I'm, roughly a gigabyte of memory rather than 100 megabytes. To that point, right.

A

Yeah yeah, depending on what it is and like how long the how big the the object names are. That kind of thing.

H

Wanted to bring up was been and had to enjoy some testing with autoscaler and seeing some effects on performance due to look like primarily recovery in his case, but mark you had mentioned that you using the past fronts and tests with a single OSD and seen some impact from just plain splitting and merging with even with booster. Yes,.

A

Absolutely I'm.

H

Doing this alone merging.

I

Has two cui sio2, the 2p G's that are going to be merged for like a full OSD map, epic cycle, so there are enough known, stable state in order to sandwich them together, and then they get unblocked. That's why the merging goes one PG at a time, those only one PG that gets blocked for like 3 seconds or whatever, and then releases okay.

H

Yep: okay, how could we G um help with that? Yesterday? Oh.

I

Yeah painfully said: I mean I. Think about splitting is that you have 1/2 G in exactly one place and you can just like unzip it and better depart in an atomic way. But when you put them together turnover what exactly the reason was you have to have.

I

Yep, all of the replicas need to be in sync, when they zip up basically I think that's mainly the reason, because we throw out the PG locks. We can't put the logs together because they have totally disparate version numbers they might be overlapping and so on, and so we just like basically.

H

Zero, the logs, thank you as the Q s stage and then the divine seizure.

I

Yeah I mean I have to go back and look at it to remember exactly what would be needed in order to get rid of that, but at the time it was so much simpler. Just to do just to pause, I Oh I mean this should be rare right. It's it's unusual that you're going to be splitting because it should only happen if you have a pool that had a lot of data and then you deleted it all on, so the pool is shrinking or the cluster is shrinking or something like unless.

A

That happen, it's.

I

Bleeding merging yeah sorry merging she's like really never happen.

A

Doesn't merging happen, though, when you, when any one particular pool has to shrimp to make room for another pool, is growing? They.

I

Don't make room for each other, they size themselves based on their size relative to the total cluster, though the scaler looks at each pool independently. That's how much data is this? What fraction of the total didn't the cluster? Is it? How many PGs that have relative to the total target number of PGS in the cluster? Should I get bigger smaller, and if it's off by more than three X, then it will do it bump up or down yeah.

A

I

Shoot if you like manually, set it really high and you overshoot, then it will scale it down for you or if you blow up a pool so that it gets really big and they need to lead a bunch of data, and so it has to get really small. But again it only makes a change if it's off by more than three X. So you have to make like you know, quarter the amount of data or whatever, before it's really gonna yeah.

I

H

If you have one pool you filled up initially and then there you go in for the second pool, do three times that there's no full-size at that point, the first pool gets shrunk rate. Oh my you're.

I

Standing I believe it's based on the data and the pool compared to the total cluster capacity, not the total cluster size. Okay, now the flows.

H

I

You fill up a quarter of the cluster in one pool, then it should use it'll target, basically a quarter of the overall clusters: pee geez, okay, that makes sense and.

A

If you, if you were to like fill up a pool to almost the whole capacity, then like delete half the stuff, you have and then put it in another pool. You would just end up with more jeez than you. Otherwise, I won't like exactly.

I

Not until you go more than 3x off from where it thinks it should be okay, but it jumps by powers of two and so whatever there's like some rounding error in there. But roughly speaking it only.

C

I

Big changes, so in general, you can end up with like up to like two to 3x your target or a third extra target, or whatever makes big change. Yeah.

A

And you said that you don't think that the the splitting should be nearly as bad as merging. Yes,.

I

Splitting is super, cheap I mean it it's cheap to do the actual split, there's no pause in IO, but then you have to backfill and move the PG, but yeah there should be a huge latency gap.

H

I

Like yeah, that's very I.

H

Think is he after he said the seat, seven setting is silly. There was like I'm Trevor you sleep, then yeah. It was working there. I think. The one thing we did notice knows, though, that was that you're running on AWS and then the disks there, even though they wouldn't be non rotational, don't behave the same way as regular physical SSDs that we'd see in a data center and berstein a lot more lot slower, yeah, yeah, rolling.

I

H

Out some way to have philosophers yeah in.

A

That kind of environment, where you have quotas and throttles avoiding any kind of data movement, might be really beneficial.

H

I'm not sure if you avoid, but that is making care of making it slower so that we don't run into those swaddles helpful.

A

Well, it's kind of like, though I don't exactly know how AWS worst, but my understanding is that, like you, kind of like use up your credits and then you regain them slowly over time so like if you invoke a background workload, all the sudden, you burn through all the credits that you have right now and then everything else is slow and then you slowly get it back. But you know, if you have something you need to do right now it doesn't help you right.

I

H

E

H

Think my point is that if we can follow the background work, so it doesn't use a book your hold allocation then better. Please handle data, make.

A

Sure I see I, see I.

I

Have to run I got another better subsidy, but yeah I think my I think I've two main things. One is that if we want to change the behavior so that instead of setting a high minimum floor or high performance pools, we default to a higher number and you have to set it lower for a on performance pool. If we want to make that change, we can do that. I I, don't really care either way. That and people should weigh in and then the other thing is.

I

Then, if you want to do some smarter, PG flog lengths selection, based on how many PG zoran the eighth overall OSD and the total budget or something then I, think that's something that might might help independent of all this stuff.

I

I

Cool all right guys, yeah.

A

Good day, everyone.