Ceph Performance Weekly, 31 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2022-03-31

Description

Join us weekly for the Ceph Performance meeting: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute/
What is Ceph: https://ceph.io/en/discover/

A

Not a whole lot going on with pull requests this week, which is not super surprising since everyone's trying to desperately get fixes into quincy and fix the fixes in quincy, so not not too much there um casey, I did see uh there was this, this rgw multi site, resharding uh uh pr. It looks like they actually merged into another branch that I saw. um That is that going into master eventually here.

B

uh Web multi-site reshard is our feature branch for work in progress and when the whole thing is done, we'll merge it to master.

A

Oh okay, okay, so yeah! I guess I'm I'm sort of uh documenting something here that will get merged in master later, but it doesn't matter um very cool. uh Okay, two updated prs um one this one from gabby that changes uh uh some code in the no column b code. Gabby. Do we do we still need this? Is this still relevant after the discussion from last week?.

C

No, we don't need this for performance, I'm going to push for this later into master. It's mainly a cleanup for my code, but it's not going to impact performance.

A

Okay, okay, cool, um okay and then the other one. Is um this: this uh tracer work that's been kind of ongoing for a while. I think, there's some more testing that, uh just when I was glancing at it, it looked like they were saying that maybe it was a little more um chaotic than it had been previously, but, like I confess I didn't look super closely at it.

A

um So in any event, there's there's more more testing going on there more work to make sure it looks okay, um and that was that was all I saw. Oh, go ahead, hey mark.

D

Hey I have, I have some updates on that and that pr okay um at the beginning, I ran a s3 test on that pr and- um and we have got that, we we don't suffer from any uh performance uh degradation and then I run the rados benchmark tool and- and we can, we can see um you can see like two percent.

D

Okay generation on the performance and also um sometimes I'm getting um unstable results with the rados benchmark tool and that vr I'm trying to compare that pr to the to the master.

D

um I'm trying to understand why I'm getting sometimes unstable and sometimes different results from that s3 benchmark tool.

A

The the unstable results only happen when you have that pr applied.

D

When what sorry.

A

The unstable results that only happens when you have the pr applied, yeah, okay, okay and is it? Is the tracing turned on, or is it just compiling the pr into the code.

D

It just compiled, but it it's disabled, so it does. It doesn't do anything.

D

Just I mean the the tracer function. Just returns some static object and do nothing more yeah, interesting.

A

Igor, were you trying to say something I didn't hear you, but I saw your it looked like your microphone was trying to talk or something.

A

Oh, no, okay, um okay! Well, yeah! uh Good luck with diagnosing it! um If it's, how long does the unstable period last for is? It is just a blip or is it? um Is it like consistent over time?

A

um No wait! It's consistent.

D

This, isn't I mean if you try go ahead, sometimes I'm getting. um It goes around 200 above or less than what I posted in the in the pl and but I think that we can.

D

We can see that, um even though we have those unstable results in any time, we get we're getting like two uh two percent less performance than the and then the master.

A

You could um you could try looking at like either perf um and and do like the the cash hit, miss statistics that you can get out of it and see. If that, maybe is this you're you're causing you know worse uh cash behavior on the cpu when the pr is applied, it might be a low impact thing to look at or you'll get a full call graph for richard.

A

The walcott profiler one of one of the wall collect profilers, might give you a better idea of if you can get it into that degraded state and of what it's doing when it's in that state versus when it's good.

D

um I already did that before I I don't have the results from then.

D

I think I think I'll go over and do that again.

E

So over, I think what mark was suggesting is that it's not necessarily one of the new functions that are causing or you would see that in the profiler. It could be some other behavior that you just see in the perth counters of the machine.

A

Yeah, it's possible, you might not. You have the resolution to be able to see it and, like I call graph, if it's only like two percent, it's pretty pretty tough to spot. You could try, but maybe not, but maybe you'd see it. If you saw that, like your, your cash hit rate was lower or something um you know, those are good statistics still. These have.

E

Also, do you still see the large standard deviation results, or is this fixed.

D

E

D

Is it okay? It was a problem with the bill. Okay,.

A

All right! Well, good luck! I hope you can find it um were there any yeah, no problem um were there any other pr's that people had that they want to talk about this week.

A

All right, if, if not then um so, the the big topic I have for today uh is that we've been trying to track down a right regression in quincy we talked about a little bit last week.

A

um I was fairly convinced incorrectly that it was caused by the no column b code and it it's not, um but it's possible that no column b was somehow triggering this a little bit um the past week adam and I have both been running a bunch of tests trying to to narrow it down on efficient, alice adam's, not seeing this, but I'm fairly consistently seeing it on mako, which is our amd nodes that have samsung drives in them, and I was able to do a bisect and get down to about 10 commits and in those 10 commits there was a change that we made to the avl uh allocator.

A

That change is in pull request, 41615 I'll link it in the chat window. Here and um this morning I went back and took the the quincy release, uh uh commit that that we're working from right now and uh reverted that pr against that commit, uh and I'm testing that branch now and I don't have a lot of results for it yet.

A

But I do have a couple- and you can see that here in this spreadsheet and the new results are on line 29, that we have so far, and it's looking much better, at least from what we've got at this point, much closer to what we saw prior to that avl commit we're still not quite as good as pacific was in 1627.

A

There may be something else that we need to track down, possibly, but this so far at least on these nodes seems to be uh maybe maybe the smoking gun. um So I I tried to read through that pr a little bit and there was a lot of discussion, igor and adam. I think you both looked at it fairly closely, um do either of you feel like you, you can remember, or have a sense for um how that first fit strategy was changing, actually changing in the pr or or if it was.

A

It seemed like there's, maybe some some disagreement um like who had disagreement regarding some of the things that you were originally thinking, igor.

F

Well, I I don't remember much from that pr, but I clearly remember it was about to make um avl allocator work faster. It was just limiting the search to not overburden cpu. So, if anything from cpu point of view, we should be now faster. The only result that was never actually tested is how new patterns of allocations may appear. That previously were not anticipated. I mean neither previous ones nor new ones were analyzed, so it was both okay in the in the sense they were both not tested so yeah.

F

That's it not much from my memory here yeah. I can hardly recall all the details as well. uh Just.

G

H

uh It seems we have a fix following this pr.

H

And recently backported it to pacific and it could definitely bring some performance boost well, it sounds generous at least, and now I'm curious if this is present in the code base. You are using.

A

So it we we checked or neha checked and it looks like this pr is in pacific tip but was not merged into 1627. So I have not tested this in pacific. Yet.

H

Okay, so it's not present in pacific, but it's.

I

J

Still possible, no, it's present in pacific, but not in 1627.. If you.

G

J

You recall igor: this was the pr that we just came in before 16 to 7 trees and it was a big change. I didn't want to merge at that time. That's how it's not there in a release version.

G

I mean, but it's still present in it, it's still present in currency right. Yes,.

A

Yes, definitely.

G

And so yes, uh pacific with with that to see if.

C

That's make the impact.

A

Yes, yes, that's on the docket, but um but I'm trying to get through these other tests first before I go back and test that one um that probably will be next, though.

A

Yeah, I'm I'm still trying to read through this pr and understand.

A

How the behavior of the allocator should change.

H

Well, I definitely want to to get allocate some performance counters to track how long they take for locator to perform its operations.

F

Well, I actually do not think the performance drop is in any way related to allocator performance. I would more think that it's related with different set of I o chunks to be exchanged with with device maybe require some block device operation to do.

H

Yeah, that might be the case indeed, but another option is that I could definitely see that hybrid allocator took hundreds of tens of milliseconds to allocate chunks longer than mine along size.

H

This was mostly for bloofers.

H

But just to mention how how long uh could it take to to perform the allocate from an allocation it was like 60, 70, milliseconds per single locations. The part was pretty dramatic.

A

So, looking looking at this pr and the two parameters that you can add, so we we limit.

A

How much we do first fit by by searching through a certain number of ranges and we follow over to best fit, or we we search up to a maximum distance in first fit mode before switching to best fit mode, and we implemented these.

A

These new restrictions um is that right, am I thinking about this right that, basically, before we didn't do anything we we just had no limit like as if these were both zero, uh but when this code was added now we we impose those limits uh for how long we do first fit is that is that correct.

F

Yes, I think so I mean that was my recollection, how it worked.

A

Okay, so, theoretically, switching these to zero, maybe gets us. The old behavior.

A

Maybe that's worth trying as well.

H

Can we somehow implement pretty pretty trivial benchmark, which is writing performing a large, writes scattered over multiple chunks.

H

So maybe adam is right, saying that.

H

Might be a root cause when we have single user right, four megabyte right, mapped into a bunch of 4k rides scattered randomly over the disk.

H

So it would be very interesting to assess how ours is back. uh What what would be the difference uh in performance uh between single chunk, right versus tons of smaller rides? Our ios tag handles that.

H

Because it's definitely the case for spinners, but maybe it's an issue for for some ssd drives as well as well. Oh for for I I thought, for I use tech in general.

F

Mark how many osds did you need to observe the behavior.

A

Well, I haven't dropped it down, yet I've been just doing the 60 osd configuration because I was started consistently, hitting it with this. This other set of tests that I have that's what I've been running in, but now that we know what it is, maybe that we have a target. I can start trying with like a single osd.

F

Well, I was, I was thinking like if you, if we can replicate it in a few of these, maybe then we could like do it together. You will borrow me few. A few machines and sure is decent. I could could also try to to play with that. Yeah.

C

F

Make some tests that igor were was just mentioning.

H

uh Well, uh I'm curious if we can do that with fio without uh safe at all, so is it possible to to set up with you know, to set up a job in a way that.

H

Performs large rides using small physical chunks randomly spread over the disk.

A

I don't know I've never tried to do that, maybe fast, pretty popular or pretty powerful.

H

Maybe start with that simple, so compare.

H

Single four megabyte chunk writing scenario versus that scattered one.

F

Yeah, that would be cool, although I'm not hopeful, because in my intuition, the most important part in the graduation that we get is that we actually change a shape of our objects files in fio terms on the disk when we modify them. So if we just have files under fio, our overrides will just overwrite and not put data somewhere else, leaving the unmodified data in old place.

F

Basically, we will induce fragmentation.

H

Well, by the way mark, are you do you know if the same behavior is observed, different hardware, different clusters, maybe.

A

So far, adam has not seen the same problem on our intel setup. The test is a little different, but he hasn't yet been able to replicate it. There.

G

H

uh Someone was complaining.

H

H

Maybe similar performance degradation in upstream uh recently and yeah the it was like some additional fragmentation happened to his osgs and they were of all flash or maybe some some drives badly handled.

H

A

Yeah it's possible and unfortunately, for us these are not probably very uncommon drives. These are um samsung pm 983's. I believe they're, basically like they're they're, fairly reasonably priced, um not super low, but not super high right endurance, so kind of they're. You know reasonably priced.

K

Model yeah, but that's why I'm suggesting to try.

H

H

That synthetically using fio and hence we will be able to assess that performance degradation if any uh versus different drives.

A

Yeah yeah it might, it may be.

H

Currently, your setup is too bulky to to migrate to different clusters, or whatever so I mean, we definitely need some simple scenario to to produce.

A

Yeah agreed, I think, now that it seems like I'm coalescing around this vr, um assuming that that continues to be the case. My thought is then now I at least have a target to try to shrink this. The test set up to hit it right and before it was just the space was too big to figure out what it was exactly, but now now I think we're narrowing in so it should be easier to uh to attempt to reproduce this if, if not with fio, maybe I can do it with a fio.

A

Then, hopefully, a single osd setup.

A

Icor or adam, do you do you know, um what's the nuance in in first fit versus best fit mode like what? What how does the behavior change in those two different modes.

F

New ones, I mean that's a strange thing: the best fit just tries to uh search on throughout all the chunks to find exactly free space. That's exactly what you request, but that mode, while very good in memory, usually attended to force your rights to basically go to very random places on the device, so the first fit mode is just trying to not do it and go as close as possible with a reasonably sized sized element.

F

So it's not it's totally different uh modes of work.

A

So previously we were how. How did we determine like which mode we ran in prior to this commit like? How did you end up deciding not to be in first fit mode.

F

um I don't know if in avl, but in b3 it worked that until we reached some specific.

F

Field size, we always used a first feed, but when a device was very close to full from some predefined threshold, we switched to best fit ah okay.

A

So in these tests that I'm running, where there's a very low fill on the disc we should always be running in first fit in the older versions, is that is that right.

F

That's my imagination here. Yes,.

A

Okay, okay and now, apparently, this workload is causing us to switch into best fit. It would be the maybe the guess, then.

F

Well, we could verify that quite easily if we just extract.

F

F

Results from two different chunks: we should be able to at least in a bulk uh compare what we are getting from a locator, because either at least amount of chunks getting out of allocator should be an indicator.

F

What is our behavior here.

H

And I believe quincy has got performance histogram to.

H

Which allows to see how fragmented other allocated sizes.

H

So you can probably compare our fragmented.

H

Allocator results at the beginning versus the same numbers from.

H

On on the same numbers on the h.

H

And again, I'd like to do it um latency registered now.

F

Yeah, so if we could modify a behavior by changing the diff, the configuration flux as mark suggested just by putting zero then possibly we could basically get get that. That's your intention here.

H

Oh, I I think it's not enough to tune the config settings. You need to get fragmented space as well.

A

So- and I think adam means that we could we can get the old behavior by turning those back to zero.

H

F

Yeah and use that that statistics from resorting to other other built or blackboarding that stuff yeah, maybe.

H

uh By the way uh mark, once you get this performance drop, uh if you restart the usgs still present, does it still present.

A

No idea I haven't tried restarting the osds in between.

H

I mean I mean it might be.

H

Maybe well, it might be an interesting experiment to to try to switch locators at this point, uh but this requires hdd start and you might to make sure that president's cup does not pack at least picture. So if, if uh sdg start preserves the performance drop, then you can try to switch locator just to to make sure that's exactly the case. I mean oh.

A

Sure, there's a a lot of things to test here so we'll I'm not sure what order I'm gonna do it yet, but uh I can you can see if that might be something that we could try doing. It'll be a little tricky with the way cbt works but possible.

A

One thing I did want to bring up, but you are adam. um If you look at the allocator test tab, that's the one! It's like, second from the last one right before adam's tab in that spreadsheet, um in those tests looking at pacific and quincy, this isn't, with you know the the revert or anything it looks like the avl allocator. Actually didn't do that much differently. It was a little different, but not that much different in pacific and quincy, but the hybrid allocator looked much much worse.

A

Is it possible that somehow this this process of twitching to best fit from first fit is having a much bigger impact on hybrid allocator than using a straight evl allocator.

H

Well, what's the difference with between hybrid and reveal locators in this respect, so in hybrid allocator, you potentially have much less chunks since.

H

Small chunks tend to to be present in bitmap1, so this means that avl data structures.

H

Have much less number of entries from one side.

H

And then they do not.

H

Don't work when you need small chunks color to be allocated.

H

But, on the other hand, the performance drop might be caused by switch to bitmap locator or a sort of duplicate allocation attempt, which is you need four megabyte chunk. You first go to a wheel, locator collect to get the required space and you get no space since there are no large chunk clutch enough chunks and then hybrid allocator rolls back to the bitmap allocator.

H

Then the attempts to turn to allocate on all or part of the space you need.

H

So I bet it's rather.

H

This dual term allocation attempt, which causes the the drop.

H

So again, you rather have uh well by the way it would be interesting to get.

H

A free dump three uh three junk dump from one of the decorated tracies and try to see how what's the three chunk layout in this state, I mean how we can even run some uh some allocated simulation against this dump and see what are the latencies for the location.

H

So yeah something like once usd reached: the state you can shut down hd then run booster 2 to get free charge, and then I can try to to play with it.

A

Is that something that we could see? Even if we um we don't have a performance drop, would we still be able to see then the fragmentation if we were to go digging and look at that.

H

Well, potentially, we can see that sort of fragmentation. We have a couple of uh very simple uh assessment uh to learn how fragmented is this state, but honestly they they are pretty straightforward and trivial. So.

H

In other words, we don't have good enough uh tooling to to learn the the fragmentation. So it's it's definitely doable uh once you have this free dump, but but no tooling, for that.

H

So you can, we can simply run some.

H

Another one assessment to provide the histogram showing how much how many chunks of specific size we have in this dump. But we need to write that.

F

Scratch, well, I think, there's a problem with uh assessing quality from allocation fragmentation. I mean allocator fragmentation because I do not see a relation between how fragmented is our free space and in how many fragments the allocator is providing to just objects that are just being written. Certainly the info, how many allocations?

F

How many chunks are in one allocator request will be very informative, but just.

C

F

Fragmentation of allocator, I think, is mostly useless right and for the former.

H

uh Quincy has got this.

H

Performance histogram, so this should be doable. If I, if I'm not wrong, so it's definitely in master, but I hope it's already present in quincy as well.

H

So that is a sort of performance counter, but it's named like performance histogram, which provides some statistics in the form of histogram on how fragmented we have the results from the allocator.

A

So I'll uh I'll definitely see if I can get data out of that. Actually, I've got I'm running right now with quincy. I could well. This is the the reverse, the revert, so I don't know if it's, I guess it's sort of useful, because it will tell us the case where it's not problematic. um How do I look at the histogram.

F

uh Well I'll, send you a command offline. Yes, please please see me on that one. I was about to ask the same.

H

H

Let me check what was the pr should. It should have a.

G

F

I guess we have two investigation paths. One is that allocator is somehow consuming more tpu now, and that is affecting performance, and the second investigation path is that there is quality of data provided by allocator, makes it more difficult to order path of the software stack to efficiently use that to write to the disk.

H

So I insert I pasted the pr chat, which doesn't have a.

H

Sample command, so I need some more time to to get into to recall how it's how to use it.

H

But I can see that it's present in currency.

H

It's not on master is not the only branch which did.

G

A

I see there's like a perf histogram dump command. I don't think I've used that before.

H

So it's uh it's pretty similar to dumping performance counters, but you should use the histogram keyboard instead of. Therefore, something like that.

I

H

I

H

Well, I'll, try and send you offline.

A

I think, and maybe something close to figuring this out, instagram.

A

Okay, okay, I got something um I can narrow this down to allocate from.

A

Well, this is close enough. Okay, so I'll paste. This.

A

Okay, so this is from the the currently running test, where it's also we're reverting that commit. So this should be like what the allocate histogram looks like for the good good case.

A

Oh wait: these are like mints and max's right there. Okay. What did I do wrong.

L

These are um at the after. The um the axes are listed.

A

Oh, I see, I just didn't include the rest of it. Okay,.

L

A

Yeah, okay, thank you.

A

Try this again.

A

A

A

So I guess looking at this, we have.

A

A bunch like 20 23 million in the.

A

The in the like 4 4 to 8k region.

A

And then another like 4 million in the 128k to 256k region.

A

I don't know if I get my numbers quite right, but.

L

Script somewhere that prints this out in a more easier friendly way. I haven't used it, but I know it it exists. I think.

A

Okay, yeah, I never use this, so I don't.

H

Yeah, I must have, there is another pr which is still pending review to visualize histograms.

H

H

This pr is mentioned in the dial I shared right and it's 41 600.

H

H

Embed it to your code base, you will be able to see okay, the table or spreadsheet.

A

Oh yeah, that would be that'd, be nice.

A

We didn't backport that to quincy, though version you could see. Oh this, isn't that old, okay, uh or that this is old, uh okay, so.

H

So actually, my pr is built on this uh there's a visualization tool. uh Lava has just checked yeah.

A

So how do I run this thing?.

L

Yeah, I've never used it successfully before so I can't say, but um I could, uh it might be something that has to be figured out later. I just know it exists.

J

A

There it is, I found the program.

A

Judo, zero error status and connect to admin socket.

L

We already saw this, but I pasted some usage instructions in the chat.

A

I think it's trying to use stuff in the path which, on the system isn't the case if you're running, because it doesn't this stuff, is in user local bin. um Does this give me the ability to change the stuff commands path.

A

Doesn't look like it from there so we'll just modify the tool.

A

There it is use your local bin stuff. Now, let's try.

A

Okay, key error for osd.

A

Okay, well, that was a good try.

A

L

To figure this out and write some documentation for it, because I think that's lacking.

A

Yeah, I wonder too, if it would be instead of having a separate tool. It would be better to embed this right in like the surf command, so you could, instead of like dumping the instagram, you could have it actually like parse it for you.

L

I think that's igor's pr right.

A

H

Oh, oh, okay, yeah, but it's it's not matched yet uh what's good about this visualization tool, it should be available in machine, so might be available for you right now. uh Once you make.

E

A

Sure, okay! Well, since we don't have that right now, um would anyone care to to double check my numbers, I'm looking at that uh the paste on line 158, it looks to me, like we've, got like 23 million uh entries in the I believe the 4 to 8k range based on this.

H

Well, I believe we it's not. It doesn't make much sense to analyze a single dump, so we need it's it's better to compare uh good numbers versus bad ones, so they sure much more helpful.

A

I agree I just want to make sure I was reading it right.

H

Yeah so please just start collecting them and we can arrange the stuff to analyze them meanwhile, and then provide some analysis.

F

um But igor, I have a question here. I see his values only on a diagonal of this array. Does it mean that basically, we always were given uh the size that we requested from allocator.

H

Or range looks like that right. So for this specific case.

H

It it means that the results are not fragmented.

H

And as far as I as I get it's for good case, this instagram is for good case right, yep, so yeah, that's. That might be the case, then so and probably for bad case. You can see completely different picture.

A

It's interesting because in the spreadsheet there's a point at which I was trying to do like a a 12.8 fill and everything went bad, really, quick, both with ncb on and off.

A

But I wonder if what this is really telling us is that um we were fragmenting really quickly because of this other pr, and so we we dropped down into this really low performance state.

A

I guess maybe what really tells us is that, when we fragment on this hardware yeah, it's everything's just going going really.

A

A

I guess maybe the other question then is: is that expected on nvme drives.

H

You may perform drop due to augmented rates yeah. um Well, that's that's. Why I'm suggesting to to to have this if I test benchmarks and try try that against different models.

H

F

Well, we know that rights in fragments are slower than in one chunk, even to nvme drives.

H

But how what what's the difference.

F

Yeah exactly we just know that it is an effect, but how much I have no idea.

H

And uh so it might be dry.

E

H

Which handles that differently- or it might be just an overhead for this io stack, which deals with much more in blocks, and this causes the common structure.

A

Adam, I wonder if we should try your tests on officials, but also observe the fragmentation.

F

Yep that exactly what we have to do, yeah.

A

It'd be really interesting if, for whatever reason, unofficial fragmentation doesn't bother it, but yet here it does.

F

I guess mark we have to perfectly synchronize test procedures.

F

A

Well, or at the very least we need to know in my cases, if I am invoking fragmentation, then it seems clear that that's the reason why this is happening if, if in the one case, it's unfragmented that we we believe that's the case now I'm looking at this, the the the histogram dump and if I go back and test a bad case and it's fragmented.

A

Okay, then we've got that. But in your cases what I mean, if you, if you see fragmentation, but it's not a problem, then you know what does that mean for us?

A

I guess maybe it means that we, we have different hardware characteristics.

F

We can have totally different software on flash translation layer it could. I could even imagine that we could have a drive that basically very dynamically relocates, even single sectors, though, whatever order I write, I basically always write into their continuous region, while other degree doesn't do that. I mean that's just a speculation, but it could be implemented that way.

A

Yeah, unfortunately, then that means that this this pr may be an optimization, in your case, uh unofficial analysis and it's a major regression in my case on mako.

F

Okay, I I don't want to even think about the consequences of such approach to software yeah, but we will try to properly select the best allocator strategy for a specific drive.

A

All right, well, I've, I think, there's plenty to do here so um then we're out of time uh adam. Do you want to just try and run your test again and see if you can tell if the disc is being fragmented and no.

F

I will do that. Unfortunately, my last um version was specific, so I have to recreate everything for quincy again.

E

A

A

All right: well, uh I think that's it for now, then anyone have anything else that they want to talk about this week before we leave.

A

All right well, then, uh thank you for coming guys and have a great week talk to you later.

F

Thanks bye, thank you guys see you.