Ceph Performance Weekly, 13 Jan 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2022-01-13

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, so this week has been a busy week, not any new prs that I saw, but a whole lot of closed, prs and uh updated prs for that matter. So uh everybody's trying to get stuff in for quincy, let's go through the list. um First, pr rgw zipper: this is a bug fix pr. um We noticed that uh rgw and master was a very, very slow in some age bidding tests.

A

I was running for the age bidding pr that also just merged, and it turns out that that was due to an older pr that was mostly cosmetic but did have a change that ended up um uh loading stats for every bucket load in rgw, and it turns out that was a fairly easy fix. So we we got in quick, uh which is good because we won't have that issue going into quincy now, but the the fix basically returned us to previous performance levels, so that was really good.

A

The ttl cache implementation pr merged. I didn't see a whole lot of further discussion on it. I think uh it just needed to get through some testing, uh josh solomon's primary balance or pr emerged uh laura. I think you merged that uh anything new report on that. One.

B

Nothing new, uh I guess the main thing is. We decided uh the original pr had um a doc documentation commit on it too, uh where josh had written a new document in the developer guide. That explains the primary balancer feature, but since that part of it hasn't been implemented. Yet we move that over to a different pr that will be merged after quincy branches off, but the refactoring code was uh merged.

B

That actors and simplifies the the calc pg up maps code. Okay,.

A

Excellent excellent and next adam your fine, green, locking pr, finally merged that's exciting.

C

Hey finally, it took heavy air.

A

Yeah, the complicated ones do right.

A

Any anything anything new on that that one, um I think we saw some performance advantage with it, but I have trouble remembering.

C

Yes, there were performance advantages, but none of the tests we are running as a quick performance check uh is, is showing that so I'm waiting to see them in more elaborate uh quincy testing, okay,.

A

Cool cool- uh let's see next, uh the a cash age binning, also a multi-year pr uh that merged. Finally, um so uh like like adam's tests uh for the the fine green locking, we don't actually see a huge performance game with this, which was very disappointing, but um in fact uh the the the radius bench perf test that we run through jenkins actually did show a little improvement with it, which was nice, so um the bigger benefits we get from.

A

This are more fine grained control over the caches in blue store and also much better information regarding the relative ages of the items in our different caches uh that we present through the performance counters. So um overall, I think still a good win, um but just not the big win that I was hoping for. um Let's see next uh igor's pr on handling onode binning, shardstrimming and also adam's pr uh about refactoring, oh node reference counter and pinning both of those were closed by the sailbot.

A

So these are two different paths we could have gone down. um I know there's been some other pinning code changes recently.

C

A

C

Yeah they are outdated and it's fine, because we have a new one on the kill it's on new um updated category, so they are clearly uh closed and they will remain close forever. Now. Excellent, all right.

A

So we got a path for that, which is very good, uh then the other one that closed by the stillbot auto tuning of the mds cache memory based on rss usage um that one uh I have not seen updates to that in a very very long time.

A

I think personally, my feeling on this is that we should be using the priority cache for this anyway, since um the the problem that you hit when you use rss memory, is basically solve or try to use rss memory to to judge uh how much memory to use for caches that that is using rss can result in very, very nasty swings that you really don't want to deal with the priority cache kind of gets around that.

A

So um that would be the the route that I'd advocate would go with this, but in any event um that that pr was closed due to inactivity. So uh we'll we'll see what the the ffs team decides to to do with it went forward.

A

um Okay, next updated prs um use thread local pointer variables to save the shard pointer um that was approved by ronin. So ronan are you here? I don't see ronin today all right! uh I didn't look closely at that, so I guess roman was happy with it, though uh next uh make sure blob, fsck much less ram. Greedy igor is not here adam you looked at it. I looked at it a little bit. um I I think, assuming that it's tests out fine, we should just get it.

C

Merged as quickly as we can correct, it's fixed the problems that were here before affix. There are still improvements I think, to be made, but mostly on selection of.

C

Hash and using two separate uh bucket lists for uh hashing. It's just um um thing of performance and granularity of check and verification, which I just wavered because it it will work as it is. It's just an improvement thing so thinking that this is as important just accepted that and leave this discussion for future.

A

Yeah but adam, if I remember correctly, we're not doing the. I know we're trying to use a percentage of the oc memory target in that now will it actually bound the memory.

C

Usage now it's it does bound to memory usage. It does.

A

C

Basically, osd memory target takes care of our caches and for the data structure, bitmap data structure for shared blobs, it's fixed allocation of fixed size, so it stays really within a memory target. I'm pretty satisfied with that. Of course. It we pay. We might pay for that with an extensive unneeded, blob rebuilding, but that's what we have to do. There is no no other way around.

A

And and if we give us ck more memory, avoid that and and have it be faster.

C

Not really because only it works like this in worst case, if you hit in some bucket that there is an error, you need to rebuild all shared blobs that were somehow involved with that bucket.

C

So if there is no errors, there is no need to give more memory, even I would say the less memory, the faster it will. It will work, but uh if you would have a very broken state with a lot of shared blobs requiring repair, then having more memory would help.

A

C

That that's it.

A

Yeah, that was that was one of the reasons I was advocating for using the um the the priority cache for this, because we could then um basically inform fsck up to how much memory it should try to target to use. But it's a minor point: you know: let's get this merged and then we can. We can hang over that kind of thing later.

C

I don't think we will ever will because it it will work, so there would be no need for us to ever get back to this topic, at least in uh shared blob department. There are other things that may eat memory during ffck and that might be a problem, but shirt blobs are done now. Okay,.

A

Okay sounds good. Well then, let's see moving on.

A

Skip oh node with cap iteration for empty directories, uh or I know sorry this is mbs. um Oh I no. I got myself way off fast and no movement. um Where am I okay? uh I think next. Do we talk about? Oh, this is the pinning lodge. This is the other being logic pr from you adam. uh Yes,.

C

That's uh I mean that's actually both our work, but in my pr this time and that's why the other solutions were closed and will stay closed. Okay, okay, that's a simplification. After we merged igor fix to all node pinning. So now we could make some more more simplification, nice, nice yeah. I.

A

I am very happy with what you guys came up with compared to what we were doing earlier, especially the one I tried to do so. This is this is excellent. um Okay, next uh braddock uh introduce uh huge page based, read buffers is radicular. No, I don't think so uh igor reviewed that uh and I think he approved it. So that looks good uh next optimize, pg removal by igor that previously was failing tests. uh It's had more reviews and discussion and updates.

A

I don't know if anyone's actually approved that or seen any new testing, though.

C

No, it still requires a detailed review and I don't think it will be in time. No, it will not be done until tomorrow.

A

Yeah yeah, I I understand it did have an approval by someone I think early on, but uh I would probably need someone in addition to approve it. uh Adam. Are you planning to look at that?

A

Yes, but not until.

C

A

Yeah yeah, okay, okay, no worries, no worries, cool.

A

Else uh last one uh test object, store, store test. um Oh this is my old map edge thing um neha just marked this is not stale, because we do still want it um in some form, probably not in the format. It's in, though, uh basically with store test. These tests take a while- and we probably don't want it to make a store test. Take that long. Also, we can't easily change the parameters of the tests that way with uh with the g test suite.

A

So uh maybe this becomes its own separate, benchmark or or or something, and maybe we don't tie it directly to the object store, but we try to go through an actual osd but anyway, uh those are big changes. Lots of work not going to happen for quincy.

A

So uh I think that's all I had in the updated categories. um Did I miss anything from anybody.

B

Not that I.

A

Can think of okay sounds good, then moving on um so uh the only real discussion topic I wanted to bring up to this week. um uh We've we've talked previously a little bit about uh quincy performance tests nah. I brought it up, um I think last week, so um I think for for nvme tests. I can probably take this on. um We've got quite a few uh templates for different tests that we want to run. So my thought is: let's use mako for this.

A

um We can actually do a fairly decent sized cluster on that hardware for testing. So um we've already got kind of some fairly straightforward fio tests that we can run against rbdsfs, possibly also look at dcmu runner and nbd. So, basically, um uh iscsi and nbd we'll see if that still works. We did have it working at one point through uh cbt, though so, um theoretically, it may still um and then uh hs bench for rgw. uh You know cost benches is kind of uh bit rotting uh and has been for some time.

A

So that's uh by far the easier route to go. um We don't have any real tooling yet for io 500 mbs uh performance uh we do have. I have some stuff to make that sort of automated, but um not not real, straightforward and the the results we get from. It are usually really chaotic uh for for a variety of different reasons, but uh we'd have to run a lot of I-500 tests, probably to see um what kind of trends we'd see between uh pacific and quincy.

A

So I'm I'm thinking. Probably. We should just leave that out. um It's also very time intensive, uh oh matt bench. It might not be a bad idea to run it. uh It doesn't take too long and there have been some fairly significant uh differences we've seen between past, so it might be worth looking at um and then they have uh said that the dfg team can do their own hard drive tests. So I'm thinking, let's let them do that.

A

I also really like having multiple people running tests, because sometimes it it shows things that that one group overlooks the other group happens to catch. So um what else? uh Now you brought this up? uh How does this sound.

D

This sounds perfect. I think it gives us good coverage across workloads and devices with you and the dsg.

A

Cool um the good news is that I've been doing a lot of testing as part of the age binning pr and I'm I'm not seeing anything right now in the rgw or the rbd side. That shows any kind of major regression versus specific based on previous numbers. I've gotten, so I don't think we're going to see any real surprises for um kind of the the common case workloads um it's possible. We could still see some things in in other areas, but um at least based on what we've got in master.

A

As of you know, the last couple of days, I don't think we should see anything real real nasty.

D

It's good to know, I'm just curious, like the test that you plan to run uh how many osd's are you.

A

I think we should do um well what I've like to do in the past when I have the resources to do. It is both kind of like single osd tests, because you kind of are able to much more aggressively push an individual osd that way and then also like a big cluster test and on mako we can do uh uh if we target just a single osd per nvme drive. We can do osds. Otherwise, if we go with like four, we could do 240.

D

A

So you know we can yeah I'm open either way. If you, if you want to just you know, try to go for like um maximum number of osds. We could probably do. 240 memory gets a little tight, but we can we can do it. um Otherwise you know we can we could. We could give the osd's a very, very comfortable amount of memory uh at you know, just using one one os deeper drive.

D

Yeah, I think we can do the comfortable way I mean like in general. I would want to do something that we recommend to our users, so we generally have anything lower than four gig is definitely not something we want to test so with um 240 osds. How much memory would each of these osd's have.

A

Let's see if I did 240 each note has 128 gigs of memory that would be 24 osds each. So we could probably do four if we weren't co-locating, like rgw or mds, is on the nodes. If we co-locate our gw and mds's, then that gets tight.

D

Yeah, in any case, I think we can go on the safer side and get the ost's um good amount of memory.

A

Yeah, we could also very easily do two two of these per per uh nvme drive and then you'd be. uh You could easily have four gigs for each one and plenty of space for other demons and clients at the same time, and then that would be like 120 uh osds yeah. That sounds good.

D

So you're going to be using cbt, I was just wondering. Maybe if we have time uh we could do the recovery testing as well. I know there were some improvements that srider made to uh the recovery tests, so maybe we can do that as well.

A

Yeah yeah, certainly we could we could. um uh I I'm not sure uh I haven't looked closely at the changes that they made but um uh beyond you know reviewing that, so the one I think.

D

Schroeder is here he can probably talk about it.

A

D

Only if it's my mic works or audio works yeah, he was having issues earlier.

E

Yeah, can you hear me now? Yes, yes, yes yeah, there have been quite a few call rocks from mine and sorry about that so yeah. So as far as the new recovery test and cbt goes, I think this essentially creates uh a couple of pools. uh One pool uh dedicated to the client and the other pool dedicated for uh recovery, specific operations, so the way that the test works is to basically populate the client pool with a bunch of objects.

E

uh You know, in my testing at least I have basically populated it with about 100k, 100, 000 objects and then go ahead and create the recovery pool uh and then that's what the test does is to essentially fail uh an osd and wait for the actual process to start.

E

While all this time the client tryouts are going on. So that way, we measure the how the recovery proceeds against the client drops that are going on in parallel, and the test basically collects all these stacks related to client, iops and the recovery items. And then that helps- and I also written a simple tool to graph the the statistics uh related to ios client-based diops and how the recovery proceeds uh to complete in that time frame. So, in a nutshell, that's what the test does.

A

And how does that differ, then? The the recovery test that we had in there previously.

E

In this case, the the tool that we create is is much later. The recovery pool is created much later when compared to the client. So uh I think the um I think the test is basically the for the the initial testosterone is already existing. That's the I call it the blocking recovery test and the other one that I created now. That essentially does that helps in the creating background.

E

I don't record your help.

E

A

Well, yeah, if, um if you want to, uh we can uh we'll just have to see how long it takes to get through all this others, but um we could definitely try to see if we could get that in frequency too. If you want to try a much larger cluster.

E

D

Cool I mean, ideally, I mean creating a separate recovery pool and all that may be uh fine, but like uh maybe we can also just have a regular setup like an rjw cluster and have the recovery test, do some kind of um bringing notes down and capturing recovery, stats, etc in the existing cluster. Instead of having to create a separate pool for recovery and all that stuff, which was mostly for qs uh performance purposes.

D

So that's something to maybe keep in mind to adapt to this, like real world testing kind of stuff.

A

Yeah, that's how I think the original ones worked. We we, basically just you pre-filled in some data. uh You know you do maybe like a an rgw put workload and then at some point at some you know you, you have the the recovery triggered where it marks down a bunch of osds and and then you know, casters monitoring the the behavior um while well. That's you know the the cluster is healing.

D

Okay, I don't really recall what you used to do, but I mean in general I think, uh good time to revisit it.

A

Assuming that the code hasn't changed dramatically since then, we should be able to do that pretty pretty easily. I guess this is what I was getting at.

D

Yeah yeah, I think whatever sridhar has added, is, is extra stuff, like you know you can, I think it's a parser and, like you know, all those recovery stats are now nicely collected separately. All that kind of stuff.

A

Oh fantastic, that's great!.

E

Yeah, I recommend, from the earlier test, uh used a single rbd image, whereas in this case the one that I have introduced, that essentially creates two um two separate images with two separate uh pools and that's the basic difference. I think okay.

A

I've, never, I don't think I've ever tried a recovery test with rgw. um If I remember right, the the aegis bench uh uh wrapper uh supports it in the same way that the fio wrapper does, but um I don't think I've ever tried using it.

A

Have you have you looked at that at all.

E

I haven't looked at the rdw case, mark okay, okay, but we could try it.

E

I'm pasting the details about the new recovery. I have documented a few things: okay,.

F

That's excellent.

A

Cool, um so I think uh neha. um I want to ask you: uh what do you think about like iscsi, tcmu, runner and npd? We might be able to get them for free, essentially other than the time it takes to run them uh previously. We've done that through cbt and it's worked.

D

Yeah I mean if we have time for it, uh I'm all for it. I would say anything that comes for free. Why not take it.

A

um That, I guess the question would be: is it lower or higher priority for you than uh recovery.

D

I would s okay, I am biased, I would say, lower, uh because I would really like to recovery because of m clock changes and like making sure like recovery works. Fine and like we have the high recovery profile with um the m clock, changes I'll, try that out as well. So that's my personal agenda but yeah as a global project. We can. We can talk to uh elia and their rb folks to figure out how that you know what's the priority on their mind,.

A

You know, I suspect that iscsi has not changed much at all, since.

A

Shoot I'm blinking on his name. He left for oracle a little while ago um on the the rbd team.

A

But anyway I don't think it's changed much in like the last year, um just from talking to different people so and we could we can test it, but I don't think it's probably going to look a whole lot different than it did a year or two ago, when we last looked at it.

E

A

Would probably know for sure, but um that's the impression they get.

A

All right: well, I think we got a plan then for quincy testing here. uh That was the only topic I had for this week. I don't have anything else. They want to to talk about.

A

All right: well, if not, then I know everyone's really busy trying to get last-minute prs in here. So uh yes, we'll wrap this up.

A

All right have a great week. Everyone.

F

You too thanks.

A