Ceph Performance Weekly, 22 Apr 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CDS Reef: Performance Meeting

Description

The Ceph Developer Summit for Reef is a series of planning meetings around the next release and some community planning.

Schedule: https://ceph.io/en/news/blog/2022/ceph-developer-summit-reef/

A

All right: well, it's that's good. I didn't miss anything, but uh I think everyone's still trying to wrap things up from quincy and now the pacific 162h release so not totally surprising.

A

uh There was one closed pr this week. This is um adam's uh uh great uh pr. uh I've got the world on a string sitting on a rainbow. I think, uh let's see is adam here. I don't think so right now, uh so the the gist of this one was that I think a lot of it got moved out into other vr's and other work and um and the stuff that remained wasn't really necessary anymore uh or is being added to something else.

A

If it is so um that got closed by adam, uh we did have a couple that were updated this week. um Corey's excellent pr has had a lot of discussion and uh both casey and igor have reviewed it and it it looks pretty good. uh The only comment that igor had made was that um maybe we should have a configuration parameter to to uh switch it back uh to the old behavior. In case we caused some kind of odd corner case, which is probably not a bad idea um it.

A

You know it's a little crazy to be option. You know adding options to revert performance improvements or bug fixes. Really in this case I think, um but this one might be warranted um igor. Otherwise did you did you feel or your casey either of you? uh Did you feel pretty good about it? I I've only had briefly looked over it and you know know the general idea of what it does um casey. I know you had a couple of things you want to change earlier.

B

Okay, so from my perspective, it looks pretty good. Well, if it passes all the k runs yeah and just trying to be a bit conservative about such great new features and new features and friends suggested to be able to disable it nothing because of rails.

A

Yeah, I I I can understand your your um your concern. There they're igor, sometimes rocks tv can be really uh temperamental, so it's yeah. I don't think this should cause issues just given what I I understand about how it works, but um but yeah I I do get the the concern and casey are you? Are you satisfied now uh with the vr.

C

Yeah um code wise, I I like it and it does what it says that it does. But I just don't understand blue store well enough to to know uh you know how it fits in and what the repercussions could be.

A

Yeah, I mean oh good, corey you're here um you know what, let's, let's uh let's talk about it more uh uh once we get through the prs, because now we can have a bigger discussion. I think um so, let's see other updated prs uh my time based algorithm for the allocator for for uh for the adl allocator, um so the big win was we already did.

A

That was to fix the the repeat, the repeated searches from the same cursor position uh for, like you, know, every time you're doing trying to do an allocation of the same size. um You start at the same, offset you fail over and over and over again, and then you know until you actually do a different search and for a different size. You you uh you just waste time and you're fit.

A

um So uh that was the big win. This is still a win. I think, and it makes it easier to understand what's going on, but we we can. We can be a little more lazy about it now. We don't need it immediately for quincy, um so uh igor, and I have been just having discussion in there um mostly about getting rid of the debug uh uh code, but it's do not merge right now, so we should be okay for the moment.

A

Let's see, um but in terms of allocators, I think we still have a lot of work to go back and look at behavior, but um but yeah not not for today. um Okay, so, let's see next is maintain free list type in ncb mode, uh so gabby. I asked gabby about this a little while ago and he said that this is more or less now just a refactoring pr.

A

It shouldn't have any performance impact at all, so it probably could be removed from our list here, but I'll just leave it until until it merges it doesn't have anything um okay and then finally, uh the doc pr for rewriting the hardware docs there's been a little bit more discussion on that.

A

A couple, more suggestions about wording and uh and kind of, I think the latest was looking at clock, speed versus number of cores in terms of the the monitor the manager or something so anyway, people are looking at that and talking about getting that all all right, so uh that was that was it. That was all I had. um I didn't quite make it through the no file. I've been trying to do too many things at once, but uh usually the stuff at the end doesn't see a whole lot of updates.

A

So I think this covers most of it all right, any anything I missed from anybody.

A

All right well, then, uh corey. There has been a lot of work on your pr or um uh fixing our our it slow iteration issue. Do you would you be able to talk a little bit about kind of the latest and what you guys have been talking about and doing.

D

Yeah sure I'd be happy to. Can you guys hear me? Okay,.

A

A little quiet, but otherwise good.

D

D

All right so um yeah, so I guess let me just start with where we ran into this issue, so that everybody has the contacts, because I think that's important, especially in terms of deciding how we push it in the pacific behind a feature flag or not and stuff.

D

So basically, we had a customer on a pretty new cluster. We didn't have too much going on there yet, but we had one customer using veeam that started hitting our cluster really hard was doing. I mean pretty hard was doing like 50 to 100 megabytes per second of writes for a few weeks, and then we started noticing that it was uh pausing like every six minutes and then restarting again and we ended up finding.

D

That was because the reshard operation, the dynamic resharding, was trying to reshart every six minutes and it kept failing due to uh one of the bugs that has existed in the pacific or virgin buckets.

D

And so we saw this in our monitoring and stuff, and we looked into that and we ended up finding that the issue was related to the way that a bug that had been introduced earlier about uh ascii two non-ascii 2 keys and omap, causing issues- and there was a fix for that- and so we basically patched our production cluster with that fix and right afterwards, the cluster, the bucket restarted successfully from 11 to 397, shards and then performance throughput completely tanked.

D

The cluster was essentially dead in the water and it wasn't just that customer, like basically all class customers were having a really bad time at that point, because the bucket listing calls for a lot of buckets were taking forever like 20 to 30 seconds and the veeam client of this one customer was making like one per second, and so the osd cues were completely clogged up with these operations.

D

And basically nothing was nothing was getting done and no compaction was able to effectively happen to remove the tombstones and we basically spent almost a week in this state of really bad uh customer experience. So it was pretty painful.

D

So the fix that we ended up finding- uh and that is in that pr- is basically that the omap iterator at the roxdb level was trying to search for a key in all three column: family shards.

D

Even though, based upon the hashing paradigm, it would only ever exist in one of the column, families and when it tries to search for it and one of them, one of the column, families that the key doesn't actually exist in it may come up against a range of one of these delete range tombstones and the way that it works in rocksdb is that uh essentially has to iterate over every single key in that range, uh which in our case, was like millions of keys and do the decoding and all that which took a significant amount of time.

D

So the pr basically just avoids that first, by checking whether the range passed down from blue store can necessarily be only on a single column, family shard, due to the hashing paradigm, and also prevent uses the upper bound setting and rocks db to prevent from iterating over anything. Above the o map of that particular object.

D

Even in the case where we can't make that prior optimization- and so that's basically what the pr does at this standpoint, I give all that context mostly to say I think it's dangerous if we don't, if we do put it behind a configuration flag and we release 162.8 with it not enabled by default.

D

I think there's a very real concern that somebody else is going to run into that same problem that we did and have a really bad time uh if they happen to have a client, that's doing a lot of bucket listings and has a bucket that is up against one of these ranged tombstones and uh they don't know about the manual compactions or this flag specifically and enabling it.

A

Corey, um I think if we were to put a configuration option, it would be that this would be the default behavior and that would only be a short circuit for people to be able to like turn it off if it broke. Something is that would that be reasonable? Do you think, or do you think we just have it blanket on.

D

I mean I think I mean I certainly understand the sentiment there, of putting it behind a configuration flag when we're trying to put this in uh rather quickly at the end of this kind of release cycle and trying to get it out. I mean that makes sense to me. I don't know, I don't know what the extent of testing is. I guess that is being done prior to release anyways that may or may not catch it. I feel like from my perspective, in terms of risks of things that could go wrong.

D

Yeah. I guess it's hard to tell there's a lot of layers of abstraction and stuff and obviously it's a big project, so you guys probably know better than I do, but I mean I certainly understand where you're coming from yeah.

A

Yeah I mean it on one hand. It feels ridiculous to put a configuration option in for what to me feels like a bug fix here. Right like this is this is a major issue that you've you figured out how to solve, um but you know there there is like this knit. You know nagging voice in the back of my head, like where we've seen rocks tv. Do things we really didn't expect and that's I don't know if your pr could result in that or some breaking something else.

A

I don't really think so, but that's and just on the surface- I don't think so, but um I guess that would be the the question or the concern.

D

A

And any, in any event, your your change should be the default behavior for 1628 and absolutely we've seen too many people suffer from this. It's um it's a big win.

B

Yeah, I completely agree with that. It should be enabled by default, but uh well I I met the case when I want to disable some stuff in the field multiple times so yeah. I definitely want a switch for such a new feature.

A

I haven't I haven't looked at the code closely enough. Would adding that switch be dramatically like? Would that be a a big ask? Corsey cory.

D

Yeah, I'm just thinking through that right now I mean from the upper bound lower bound standpoint. I said no for sure, because I can just not set those options on the racks to be read: options simple, this condition around that. So that seems very trivial. I'm trying to think now the idea of skipping isolating it to a single column. Family should also be easily based upon a configuration flag too yeah, because I can just give that condition as well, so it should be a minor lift.

D

So yeah I mean, I think, if enabled by default, is what we're going for. Then it probably wouldn't hurt to add that just as a safety mechanism, let people turn it off if they they need to. If there is something we missed.

A

Yeah- and it might be something- you know a dev option that um you know if this is working well in you know six months. Maybe we don't even keep it around. Maybe it's something that we we keep in back our heads. We don't we don't really need it in the long run.

A

I don't know thoughts. Anyone.

E

uh He just would just like to say that I love that pr and I'm sorry that it's not been there from beginning that kind of improvement it. It really should have been corey. Thank you for making that pr.

D

I'm happy to contribute thanks for the feedback.

D

Okay, well I'll go ahead and I mean I can certainly add configuration flags. I don't think that's a big lift at all, just after briefly thinking through it. So so why not um any, I guess do I just add them like as any standard configuration or this kind of option. Do we add any?

D

Is there anything to know about how you want to add this kind of configuration option that might be temporary? I guess.

A

I don't think so just make it a dev option and then it you know, I think the threshold for getting rid of dev options is a little easier, since the users really shouldn't be touching it most of the time anyway.

D

I'll do that here in the next few hours then and uh end up taking pr awesome.

A

Yeah, um do you do you want to run this through master qa? First, before we do the back part or what what's your timeline, looking like for 1628.

F

Yeah, I think this issue is urgent enough for us to fix, or you know, um lock 16 to 8.. I would like to get the test run started for master as soon as possible. Given that the dev option I mean I mean, given that the default behavior is on by default, the config option. Change should be minor. uh It doesn't hurt to run it through a first round of testing.

F

But yeah I would like to get this merged in master first and then you know do the so I believe it's just going to be an extra commit that gets added even to the pacific pr. So once that's done, we can just go ahead with that.

A

Yeah yeah it'll be basically just adding this configuration option just to make sure that it all works properly still but I'll be fine.

A

uh So I'm now I'm just going to put blanket approval here.

A

Just adding an approval here once the.

F

Okay, the good news is that it has already gone through first round of testing and there's no issues. So whenever you push whenever you push that last commit just think me, and then we can try to expedite this merge.

A

Okay, so that's.

D

A

Yeah, I I just put my approval on it. Just saying: let's get the the option to disable behavior default on and test it and merge it. So yeah looks good. Thank you corey. This is that was excellent, excellent work and can't highlight that enough. um I, I suspect that this is going to reduce the the workload on our our. You know our consultant folks, quite a bit. So this is this is a big win. Thank you.

D

Awesome. Thank you.

C

I'd also like to say that I wish that all of our pro requests came with this level of detail and background really really helps.

D

Yeah, by the way, thank casey also for uh giving me some pointers on the c plus plus stuff as well. It's been since uh it's been a while, since I've used c plus plus a lot, so I'm just kind of getting back into it. So I appreciate the patience with some of the details. There.

A

All right well, this this has been a success so fantastic, uh I don't have anything else for this week. um We last week we're talking about pg, log and uh and some other things uh gabby's not here. So um maybe we'll just wait on continuing that discussion.

A

uh I think the only other thing I had is I'm looking at crimson again and um running into some issues. I've got a pull request that I'll finish later today, uh all the seastar command line options were no longer being parsed correctly, due to a change in c-star itself, um which we were. We were kind of this janky thing that we were abusing to to pass stuff to them.

A

So I'm I'm fixing that, and I've got a way that I think will work fairly well, um but then, beyond that uh it seemed like I was seeing a memory leak in seastar or sorry in crimson. um Usually I saw memory growth during rights, so I think it's on the right path. uh Radic made a good observation this morning. That potentially uh was the way the alien store works. We may be.

A

Leaking memory when we issue new threads through alien stores like in steepy radik, do you do you want to uh repeat what you were saying in court.

G

It's not even a hypothesis, it's just a long shot. Basing on my on my experiments when when we were when we were trying to run alien star and bluester with the uh c stars allocator at the moment it got the situation got changed significantly because uh because a sister got a bunch got a patch uh that basically uh bypasses the sister allocator for alien threats. Like uh the bunch of the bands we have in alien start.

G

This should work, but you know sure, and does there is some field uh between those words sure, but from my personal experience is that when an alien threat, okay sister allocator uh is is, is really speedy. It's really fast, uh but there's a lot of limitations.

G

The most important one is that uh there is a limit on the number of shots and a shot is being created there when the allocator sees a new thread on the first time.

G

So if there we could be creating new shards if the bypass somehow stopped working and actually in bluestar in roxdb, uh I saw and sometime creation of uh short living threats.

G

So it turned out that running blues the running blue star directly with the very limited uh sister allocator. It's not our way to go. It's not a good idea since then, of course, as I said uh before, sister got a patch that basically bypasses the sister allocator and goes directly uh to the to the system. One short work, but basically just a hint on what might be worth uh uh initial traffic checking.

A

A

Yeah this is related to this topic too. Is I want to go back and see if I can get, um uh I thought we can again try to use tc malek for all the the system allocation.

H

Work rather than c-star allocation.

G

That would be super cool uh because uh uh just at the moment, uh crimson is uh is not linked with dc malloc. So it's uh it uses the leap c provided uh allocator, which is called pt malloc. Second, or something like that, where uh it's it's always lockish.

G

There is no idea of lock free, perfect allocation every single time you need to go over uh some locking and we have multiple threads in the in the multiple posix threads in the alien stairs pool, so might be really worth, uh maybe might be worth first of all, of course profiling.

G

But if I recall correctly, we were doing that. Maybe I guess I will try to find it uh soon, just after uh justin typical and the difference was uh pretty big between between using uh the system allocator and uh the system equator, because initially, before sister got the bypass for system. Okay for the system, allocator, we were linked. We were disabling uh this, the assisted allocator entirely and switched the just for the sake of uh operating booster. We switched back to the lipstick provided one and the difference.

G

If I recorrectly was really significant, yeah yeah, I didn't have the profiling results. Let me I will try to find that I will try to find the gist in the background, but unfortunately I have no. I cannot find a good way to set over my own gifts long so yeah.

A

Sorry, no worries no worries. um I believe you, and also we see with the lipsy allocator that um that we see significantly more memory, fragmentation and higher memory usage overall. So it's it's probably a big win. If we're using alien store to to use tc malik.

H

Likely so that that would.

G

Be my guess, and if, if there is no hiding problem, it should be quite easy, quite straightforward. Basically, it's our link time decision, a link, time change.

A

Yeah, we'll also be able to then use all the um the priority cash work for uh blue store as well, which would be nice.

A

And all the mempool tracking.

A

ah Well, anyway, um so there's that's the only other thing I have is I'm trying to work on that right now and figure out uh how to get crimson with alien story into a little bit better shape for more testing in the coming months?.

A

All right: well, I don't have anything else guys, so unless anyone has anything they want to bring up, I think we'll have a short meeting today.

B

Mark I'm curious if you get any feedback from some guys about the firmware or these drives in your lap.

A

So the the last communications I had with them, no one knew what firmware we should. um They asked me which one I wanted to use, and I told them I had no idea which ones I should use, because I don't know what the differences are and so that they were trying to figure out which firmware they thought I should install and that that's kind of where we left it.

B

A

um Hopefully, maybe I can talk to someone a little bit more and and figure it out. um I think he was of the opinion that we should get the latest just whatever the newest was and put that on there, um but I think they were trying to wait to talk to some of the engineers in in korea and south korea to um to just verify that that was you know the right way to go.

A

The good news, though, is that the cursor change basically got us back to um where we were before.

A

The one thing that's kind of interesting is that the certain uh drives, which are basically the same things we have only with newer firmware and a slightly different driver vision, are still like 10 to 15 percent faster than ours. So I don't know if that's due to the firmware or if there was anything else that changed, but they just consistently get just slightly higher results than we see even with our fixes, and they weren't suffering the same issue that we were um so you know. Definitely it seems like something something is different.

A

Oh, I see your uh your your guests here. Radic.

G

uh Yeah, it's not uh that's first uh uh and found it's uh not about bluester. Actually it's about c c and star, uh but well. The difference is pretty significant.

A

Yeah yeah, it's not surprising.

G

Well, it's about 30.

A

So yeah, I think, uh it'd be really interesting to see how ttml compared I I assume the c star memory. Allocator is what we really want anywhere. We can use it.

G

A

G

Am a afraid we cannot it's simply too limited: it's not a posix compliant memory. Allocator. uh First of all, uh there is a huge limited. The the biggest the most important limitation is that you can. It can support uh only limited and constant compact undefined number of threats, and it doesn't even recycle uh the resources, the shards, so this boils down into the number of all threads. It's seen.

G

So if you have, if you let's say, uh if you say, if you have an atp server that uh would like to uh use posix threads and system with system locator, it will be able to uh spawn uh if you're corrected, two five six uh threads, and that's all after that. You know man, sorry.

A

Braddock, I thought that we could use it for the any memory that we're allocating um on the the c-stars side, though not on the alien side. Right, isn't that the whole idea is that we can use the.

H

System allocator actually.

G

At the moment, after the changes in systar, we can- and we do use of the sister allocator in the sister uh part of uh of uh of crimson yeah before, if we had to, we were enforced to research, the ellipse one only because only for the sake of hosting bluestar. Even if actually somebody picked up seanster via the configuration the runtime configuration machinery uh we we had, we were using since the the ellipse allocator for entire process for all parts, also the sister written ones after the change.

G

In sister, after after the fix, we are able to run sister parts using the scissor allocator, which is good, but at the moment we use the lip c. The default allocator for the rest, particularly for the blue circles.

A

So right now, if we compile um without the c-star default allocator, then we'll use the c-star allocator for everything right.

G

If we compile without the scissor operator, it will use lip c for everything uh if you switch that to if we switch the lipsy allocator to overwrite it with, let's say tc malloc.

G

With with that,.

H

A

Use the the lip scene allocator right.

G

Yes, without this flag, this system default allocator. I think we use we make use of the elixir allocator, which would allow us to run also if we switch to tc malloc, we would also run the sister parts with the with dcml cover. I'm not. I don't think it's interesting.

G

I wouldn't say my impression is that the the the performance of tc malloc will be somewhere between the system allocator and uh and the pt my lockers too, the default lipsy ones so still a business in in in not poking, not disabling, the system, allocator running all the gizmo and all the system. Gizmos all the sister worked with the sister allocator, while letting the alien threats in alienster to use pc just for the sake of booster.

A

So so reddick when we use this flag, does that mean that we're using the lipstick allocator for everything, or only for the alien uh thread, parts.

G

What are the infrareds.

A

Just for alien threads.

G

G

I prefer to not rely on my memory. uh Sister default allocator, let's grab others. This is repo okay, core memory dot cc. Yes, is that default allocator.

G

There is one huge if not death, okay,.

G

Increment and base I'll try this entire entire allocator. The entire analog sister allocator code is consisted between. Let me paste just to avoid any misunderstanding.

G

The c star allocate here and if I will also, I will also provide a link to the sister record.

G

A

Yeah, most mostly I was just trying to erratic, I was just trying to figure out like has the behavior now changed with that that flag? Like does it at one point: where was it using um the default allocator for everything? Is it just alien threads, and now it uses only for alien threads? That's what I was trying to kind of okay.

G

That's my understanding. I understood. I understand that at the moment right now we make use of the speed of this very fast system, allocator for sister threats for sister reactors, which means most parts of the of the crimson.

G

uh While we use the why we fall back to the default lipsy allocator, which is pity to pt malloc second for alien fritz for booster.

A

Yeah once once I get done just wrapping up this other pr um I'll, try to start looking at the memory usage again and see if I can uh dig into exactly what's going on here and what we can do to switch over to tc malik, because that would be, I think, a low hanging fruit, big win.

G

Perhaps perhaps.

A

And just so I I know I understand to you you're you're, only thinking the aliens are allocator right in terms of like c strong.

H

You got cat in the middle. uh Could you repeat the last.

G

Sentence please just.

A

So I'm sure too you're you're, thinking that it's fine to use c-stars allocator for uh science store and for easter and.

G

Yeah, okay, good good. I my current. My understanding is that if we switch to tc malloc there, there will be no visible difference for cn start yeah yeah.

A

A

All right: well, that's all I've got so thanks. Everyone uh thank you for coming uh and uh I think we have a uh raido's uh meeting coming up in 15 minutes right, yeah.

F

Yep you've got the cds, the chef developer summit.

A

Yes, yes, exactly so uh why don't we wrap this up? Everyone can get a 20 minute break and then uh I'll see a number of you guys in about 20 minutes.

A

Thank you very much. Thank.

F

H

Very very soon, bye, bye.