Ceph Performance Weekly, 7 Apr 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2022-04-07

Description

Join us weekly for the Ceph Performance meeting: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute/
What is Ceph: https://ceph.io/en/discover/

A

All right so, let's see so.

A

Okay, max 29k, okay, so yeah um once I've got some of this quincy stuff or taken care of, I can maybe try to take a look and help debug what's going on there.

A

It is interesting that you saw it with a second nvme drive as well, um who knows maybe master is messed up or something but uh I'll see. If I can.

A

A

All right, so, uh the the uh what gabby's talking about is he's had trouble on um one of our our fast nvme test nodes internally with master for the last couple of weeks. Getting um good test results out of it.

A

In the past, we've seen around 70 to 80, 000, uh small, random, right iops uh and he's seeing more like 20 to 30 right now, uh I've tested something fairly recently that I thought was based on master on the mako notes and still saw high performance, but um I should go back and just verify that that was actually on master. I think it was, um but anyway uh things to figure out and look at um okay.

A

So this week uh I didn't get quite through all the old pr's, uh but I don't think there's probably been a ton of updates on them. So I'm not gonna worry about it too much. um We do have uh two new prs that I made this week um and these both relate to the avl allocator uh uh topic that we've been discussing so a reminder.

A

We changed the way that we determine when to go into best fit mode in the avl allocator and instead of continuing on in your fit mode. Last summer, uh the changes that we made basically limit, um based on the number of bytes distance. We have to search, and also the number of iterations, that we search.

A

So the test result data that I've got there's a whole bunch of it. Now is.

A

A

A

The gist of it basically right now is that even trying to uh I'm sorry, you should look at the avl adaptive uh enough tab, so.

A

The gist of it right now is that we're seeing.

A

That on these samsung drives, whenever we uh kind of use these limitations.

A

We we see a fairly big slowdown in large, sequential writes. It appears to be because the allocation pattern changes dramatically instead of doing really linear allocations um just straight across the drive like we did previously now we're seeing this kind of blurry of I o spread across the disk, um all 64k we're not fragmenting smaller than that. We're really consistently writing 64k ios, but um this kind of pattern of spring the entire block device with ios, makes the samsung drives unhappy.

A

A

I did try increasing the the uh parameters that we'll have in that pr uh kind of 4x and 8x. Those are on lines 8 and 14 in that first mako kind of set of columns and that helps, but it doesn't eliminate the issue.

A

So it's it turns out. It's really hard to do that. It's it's really hard to know. Well, how many bites do you search? How many uh you know cycles? Do you like go through looking for things or looking for space, so I wrote this other pr. That's in the ether pad.

A

Here makes it so that, instead of deciding when to switch into best fit mode based on you know the cycles and bytes instead, it just does it based on the amount of time that you've spent in near fit mode, and when you exceed that time, then it switches the default. I've got right now is one millisecond that was enough to kind of keep the behavior in the fast mode, uh so that seemed to work it.

A

um It kind of makes the behavior more like it is in quincy or and sorry in uh pacific, at least up to 1627 um and makes the samsung drives happy again. um The downside is, you know it's new code right.

A

So yesterday, at the the clt meeting, um we kind of decided that we want to look at more tests, especially on hard drives and see what the impact is.

A

There, um uh david galloway, very kindly got one of the older incentive nodes set up with syntax stream for me really quickly, and so uh I did start getting results from those they have intel p, 3700 nvme drives and they have hard drives, and so I looked at the p3700s uh bear and I don't have results yet for my pr, but I do have results for quincy with the um the kind of default current behavior and then kept reverting back to the older pacific behavior and there the effect is minimal there.

A

There seems to maybe be a slight effect because it's always a little slower with the quincy defaults. It's not dramatic, it's maybe like one percent, uh but it it does seem to be consistently a little slower in that mode, uh whereas on the hard drive case, which is really interesting um in the the tests that I run, these are really short. This was kind of like the minimal set of tests. I could run to showcase the behavior on the samsung drives.

A

um It looks almost identical, there's very, very little difference, but there hasn't been a lot of rights like small random, writes to drive, and I can verify that. I still see the same patterns I see with um uh on the nvme drives, but it's a little different in this case. We still see these 64 kios, but they're lumped together in groups of 64., so in reality we're writing like 64k rights, completely sequentially in the case where it's like pacific, the old behavior and then in the current default quincy behavior.

A

We see 64k rights grouped into blocks of 64 that are written sequentially and then those are scattered around the disk um and when I looked at longer running tests where instead now I'm instead of writing these tests for like 30 seconds they're running for 30 minutes um in those tests, it actually looks like this change was maybe slightly better.

A

um When I disabled it. I saw lower numbers in some of the tests, uh at least initially um by the third iteration, maybe less so, but um definitely it was different. So it's possible that actually, the current behavior in quincy is doing better on hard drives, maybe hard to say so. uh That's where I'm at right now um need to do a lot more testing, probably on this, but we do need to make a decision on what to do for quincy, so um really consistently.

A

We've seen that this behavior on the samsung drives is bad, it hates the change. Well, these drives hate the change that we made and we're able to get into the mode pretty easily, where they're they're, showing fairly significant degraded performance, um whether or not uh on other nvme drives and the intel drives there's very little difference it. Doesn't they don't seem to care one way or another now on hard drives, we'll have to see how this plays out but uh yep? That's that's!

A

Basically it so um one one question your you and I have been discussing this quite a bit. um I wanted to ask you uh with the test that you've been running, that workload test. Do you have kind of any idea like what what's happening in that test? Where you're kind of showing the the the current behavior, uh you know faster allocations, then I I it makes sense. I agree with you that it would do that, but um do you know what what the the workload is there.

B

Do you mean this repo allocation replay stuff.

A

B

uh Well, I think the the issue is not the payload itself, but the implementation of the disk of the space, so the replay payload is pretty trivial. I'm trying to replicate the payload coming during dp compaction.

B

Okay, that's that's just allocating of 500.

B

Kilobytes, so it allocate chunks in.

B

In five and, of course, of 500 k, bytes a k, kilobytes and using a 64k allocation unit, while the.

B

uh User allocation unit is 16k, so it's a single volume and db shares the checks, the volume and hence in high frequency in high fragmented space. uh It might be tricky for wfs to get 64k continuous blocks and it looks like without these limits on avl locator might take pretty long time to search for such continuous blocks.

A

Yeah yeah, I'm I'm surprised that adl allocator is taking. I mean it sounded like you were seeing, even in in the real case like 70 to 100, milliseconds or right. So.

B

I I I definitely have seen the log for gp compaction, which shows exposes like 70 milliseconds delay, but- and I thought that I saw pretty the same with replay stuff. But at this point I'm not sure about that. Since I could miss this nanosecond versus microsecond stuff.

A

Yeah yeah, I'm I'm not super concerned. If it's like microseconds right, I mean it's still a little concerning, but it's not like crazy, but if it, if in the field we see like 70 milliseconds, that's that's really bad.

B

Well again, it's not exactly uh regular operation, it was compaction and the compaction well for for regular operation. I can share just for latency graphs before we will default hybrid allocator and then after switching to stupid one and the difference is crazy, yeah and on the same cluster I performed db compaction and for hybrid allocator. It took something like 50 minutes versus 5 minutes on stupid, allocator, yeah.

A

And that's really crazy right.

B

Right right so, but but what what compaction uh requires from the allocator is the repetitive allocation of this 500 k, chunks of 64k continuous blocks. So it's it's probably the worst pattern in case of fragmented space, so yeah every every allocator, every allocation.

B

Needs to to look up for well, if continuous block, which is tricky scenario, but again the regular operation was crazy as well, and actually I think that if I compare uh allocation durations for stupid and hybrid allocator, so I can see something like two microseconds versus one or two milliseconds.

B

So, probably that's not so that that's actually great difference as well. So it might cause well pretty significant impact on the overall performance, because, while a real allocator searches for of this continuous block for for two milliseconds, it it's locked, so other operations are not able to proceed proceed on the same or again allocate location. So, in fact, with two millisecond location duration, you can get something like 500 allocations per second.

A

And- and does it or is it always working 64k chunks.

B

uh You're being blue first.

A

Yeah so like when, when we, when we talk to the allocator, we're never requesting larger than that or like for a four megabyte right, will we request the continued space.

B

Bluefest can request large chunks, but the allocation unit it requires so minimum allocation size is 64k.

A

B

So so every as I said, I'm trying to in in my replace scenario, I'm trying to.

B

500 k chunks, but the allocation unit is 64k, so it can return single, continuous block of 500k or a bunch of blocks. No smaller than 64k.

A

B

So but the the space itself is fragmented, using 16 16k blocks.

B

So if, if you have uh pretty fragmented space again, it might be tricky to find continuous blocks. That's that that's the key point here.

A

Yeah but but my my the reason I was asking is like when, when you're seeing some of these big um numbers come back, that's for a 512k allocation.

B

uh Well, that's the size which compaction tried to allocate and that's the size I was uh originally played with.

A

Okay, so that's what that nanosecond number in your output represents.

B

So all right, but well probably for when I shared the results I was using.

B

128K block it's present there, so this size is not present there. So maybe I I I forgot the original number when shared this results to you so yeah. That's why I remember that the this chat results contain 128k, and yesterday I did a brief check again with 500 k blocks.

B

I'll I'll sort things out out and share that a bit later, but again, the difference between stupid allocator default.

B

Well, this new patch, hybrid, allocator and original one that so the original uh behavior is much worse than stupid one or this patch uh kind of like it. I mean patch, which introduces this this search limits, not uh not your old backpack.

A

B

So you can see uh yeah and you can see this.

B

Duration, then want you need max hint in hex. So the first number is the uh requested block size then allocation unit and well, some some additional parameters. So it was 128 k, chunks.

A

Okay, okay, so um in this case we could either get a 128k, uh continuous chunk back or two or would be or 260 4k, or were you saying, 16k yeah.

B

It's it would be worth then allocation unit is 64k, so we can the.

C

B

Would be the one jump or two chunks.

A

A

And, in my case, the test, I'm doing since I'm doing a four megabyte right, presumably we're we're asking for four megabytes from the allocator, and it could give us that back in in 64k chunks. As I see.

B

In your case, it might get even 4k 4k chunks. So since we uh we now have 4k allocation unit for okay offices and hdd, so potentially you can have up to down to 4k uh blocks.

A

Sure sure so we could do up to in that case, like you know, 1024.

A

A

Blocks back from the allocator mm-hmm, I don't think we are um because it looks like I'm always writing out in units of 64k.

B

Yeah well, as far as I understand, the hybrid allocator tries to provide more or less continuous chunks, so it's completely degraded state. When you will, it will return for gay chunks.

A

Any idea why we're doing rights as 64k rights disk like where that would be happening.

B

Well, I recall we had some stuff for spinners to to match our original behavior with 64k chunks, not sure if that somehow affects ssds as well. So I I I was thinking it was introduced for spinners only, but maybe somehow effects.

B

uh This is this as well or maybe some something else is taken into account.

A

Well, the yeah, the fact that you see this, you know huge difference right when you you switch from. Basically, I think, basically, what it means is we're we're only spending a very small amount of time in in best fit and sorry near fit and then switching to best fit right away. I mean almost to the point where I wonder how often we're even using um near fit in this. These tests that you, you used.

B

uh Well, I think that since this dump was on a highly fragmented space, uh it could use well that second mode right.

A

Yeah, it might test us looking like we're really quickly going into um into best fit as well almost to the point where I wonder, if you'd be better off, just not even bothering doing your fit search at all.

B

That's where, to be honest,.

B

Because, as far as I remember, some, there are some tunings in the alligator which enforce switching to after that second mode uh in case, like in case of pretty high full ratio like ninety percent of disk, is full.

A

Yeah is switch. Was this cafe's change right, the the one? That's you know.

B

Well, I believe he borrowed that from tfs or something like that, so the allocator came from from that stuff. It wasn't written from scratch.

A

Oh okay, that's their space map, implementation.

B

All right, so if you find the original pr introducing adl locator or even in this pr which bring these limitations, they mentioned something like that.

A

Okay, I wonder if our behavior is just a lot different somehow, because it really seems like we're just very very easily going into into um best fit.

A

You know and it doesn't bothers a lot of the hardware like the intel. Nvmes don't seem to care at all, but.

B

But so uh you are saying that the full ratio for d screen your case is pretty low.

A

Exactly it's very low, it's like three percent.

B

So we need to revise this uh logic for mode switching and maybe something from there or something.

A

It makes me wonder if the search, if like the near fit search, isn't isn't doing the right thing. That could be like somehow searching the wrong searching in the wrong way or something, and then we just revert to best fit.

B

uh Just got an idea mark uh if you are able to get the disk into this bad state when our allocator behaves improperly, uh could you please dump this allocator state and we can investigate it locally with this replay stuff and things like that and not to mention we can.

B

Troubleshoot habit or their allocator and check whites into a different mode.

A

Yeah, so I mean in terms of the performance problem, um I actually recorded the block trace uh uh data so that I could then play back the same workload uh using fio and ignoring timestamps, and that kind of thing, um and if I just replay back the that right workload on the drive in isolation, it does it fine.

A

It seems to require the lead up to to get the drive into the state that it's in.

B

Yeah, but I'm talking a bit about a bit different stuff, so once usd is in that that state you can use a bin command. Something like that. L is the uh blue store. I can share the exact amount, so it will dump the allocator state in in industrial in json for much to do to the disk.

B

So I can take it and replay with my tool.

C

B

And then apply some custom allocations to it and maybe insert some additional login. I will gather and investigate what's happening, white features.

B

To different mode.

A

Sure it doesn't require the samsung drives like we. I see the same pattern on everything. So it's it's! It's not like um this. It's just the samsung guys are slow. It's not that there's.

A

You should be able to replay it anywhere pretty much or you know, do this yeah well.

B

Well, for a play, I don't need drive at all, so I just so. The replay tool just uh well tries to locate um to to call a locator and then measures. uh How long does it take, and uh that is the results well, if we can extend it, so it's it doesn't. uh No, no, no! No, no need for specific devices for that that everything happens in memory.

A

Since I've got this window shared for for people, um I'll show what these patterns look like, so this is basically what we're seeing with the current defaults. uh This. This looks a little different depending on the drive that you're running on, because some are faster than others, uh so it might show slightly different oscillation behavior, but this is kind of what we're seeing in master right now- um and this is this kind of behavior, where it's just going straight.

A

um Linear allocations across the disk when we have um uh basically the defaults tuned to what they were or what they the equivalent of what they were back in in pacific.

A

And this have behavior the intel. Nvme drivers. Don't care about they're, totally fine with it. They don't they. They they more.

C

A

Are the same performance, whereas the samsung drives seem to want this kind of behavior that we had previously.

A

And actually I'll change to the other pr, uh let's see.

A

This is again, let's see default, I'm sorry quizzy off this is the old behavior. um This is the new behavior. This is with tuning it to use more to basically do a bigger search, and we still see this. It's.

D

A

Doing a little bit better, though um same thing, with even bigger search space eight times what the default is still seeing this pattern, and then this is the change I made to do based on time with a small, well sort of small 500, microsecond um uh search time allowance and then, when we move up to one millisecond, that's when we go back to this behavior where we're in your fit, instead of falling back to best fit.

A

We see this linear allocation and high throughput on the samsung drive, and I did two milliseconds as well, but it's the same same behavior, so yeah eager. What perplexes me is that, on a workload like this with a very very lightly filled disc, we could have near fit. Take a millisecond right, I mean. Doesn't that seem crazy.

B

To get uh to allocate four megabyte chunk.

A

Yeah, what are.

B

A

Yeah so I'll be informing white chuck, um hey uh casey mentioned he's gotta go soon. um Igor. Do you want? Would you have time to continue this offline? Maybe we can talk more about it.

B

Maybe not exactly after the meeting, but in 30 minutes after that.

A

Okay, cool- maybe we can continue this um we're really in the weeds here. I don't know if people care about this, that much or not, but.

A

Maybe we can wrap this one up for now for here that so casey, um the the tracer pr um did you have anything quickly you wanted to mention about it.

D

um Well there there were some new results since last week, comparing um performance when it's disabled and it looks like there was no overhead at all, which is really promising.

C

Cool, I think that ombre uh initially he was running, the test was like 30 seconds. So what he really observed was whatever was happening at the beginning and uh it was really fluctuating so now he's running that for two minutes or three minutes and it's much, which is much more stable.

A

Okay, well, that sounds helpful.

A

Casey, we're not we're not merging this for quincy, though right.

D

uh That's correct.

A

Okay, so uh honestly, I probably won't look at it until that's. This is all done. We're trying to do that specifically yeah.

D

If anybody has suggestions for other ways to evaluate the performance of this stuff, please follow up on the pr.

D

Did you have anything else to add.

C

uh No, no just uh it looks, I mean there's still part of the work to do when tracing is enabled, but this is a different story, because by default it's going to be disabled.

A

Yeah yeah, and if it's, if it's really like you, know zero overhead or like one percent overhead, it's not a big deal, we can eat it.

A

D

Thanks mark sorry to interrupt.

A

Oh, no, no, we igor, and I were going home by allocations. We could probably spend the next 20 minutes continuing to discuss them duty. So uh yeah.

A

Maybe moving on um gabby, uh we were talking core about the the performance issues you're, seeing on official.

A

Yeah, I don't, I don't know if that's um master the slope, I don't think it is, but um it could be. Maybe that's have you tried testing um pacific out of curiosity.

E

No, I did not. I only use mustard.

A

So we we know that definitely like pacific 1627 is good. I wonder if maybe, if would you possibly be able to repeat your test with with that and see.

E

uh Sure, if you could tell me what's the um for pacific, I don't know what.

A

There's a tag for v 16.2.7.

E

Yeah, if you could just send me just a tag.

A

Yeah that should do it. um I just linked it or put it in chapter, um but here I can give you a link to it too.

A

There you go so yeah um yeah, just check that out and and it'd be really interesting to see. If that fixes it or not. If it doesn't, then then yeah, we definitely need to figure out why that note is going slow. Now.

E

Could you take maybe a a look at my machine and see if there's anything fishy.

A

Well, here yeah, I can probably at least just take a quick look um right.

E

Yes, so when I tried to change from uh to envy me one that the thing just stayed.

E

So I'm using osd device zero, which is mapped to nvme one.

A

Yeah and you said that switching other ones didn't help right.

E

I tried switching to osd device, one on all, all the settings in so I got this one x1 x1 bs file. I I try changing all the setting. There is the osd data device.

E

And uh blue store and so on, so I try to change all of them from.

E

From zero to one- and this thing just didn't work.

A

Okay, I see that the test run. I did also showed low iops.

A

Go and take a look at.

A

I have a depth good.

A

Interesting there appears to be rights in this test that I did. uh I got this when I was on it earlier with you, um and it appears that there are rights both to nvme, 0 and nvme 1 in that test.

A

uh At the same time, I wonder if the journal was or the the deviant wall was on, one of the drivers and the vlog.

E

Right in this test.

A

The one I'm looking at right now is split the rights going to me: zero during the test and they're great it's going to envy one during the test. Unless there's something else writing to nvme zero, but.

E

So block db right ahead.

E

Actually, I take it back the db and right ahead log. They are on zero, but the block is on one. So actually there is no problem.

A

Yeah, it should be fine, it's just an interesting thing that was split.

A

The processes using.

A

Four and five cores and seven gigs of memory in this test.

A

Okay gabby, my thought is: um maybe we take this out of the performance meeting and um I think we will clock profile the uh osd while it's running and take a look and find out what it's doing. uh Let's, let's start our own cat after the meeting here and then we'll go.

E

A

Okay sounds good all right um anything else that anyone wants to bring up this week before we wrap up.

A

All right well, then, thank you. Everyone for coming. uh It might have been a little dry this week. Sorry about that, but uh thanks thanks for the pending and we'll uh see you bye next week.

B

Yeah and mark I'll ping, you in 30 40 minutes when I'm.

A

Okay, okay sounds good. You are thank you bye, guys.