Ceph Performance Weekly, 21 Oct 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2021-10-21

Description

Find more Ceph meetings videos: https://ceph.io/en/community/meetups/

A

So one new pr this week that I saw uh it's actually really interesting. Looking for people that are interested in uh backfill and recovery performance, um uh I don't know how to say this user's name uh gun, maybe uh they they basically are looking at a very narrow situation where.

A

A

You have a new osd, they require a full copy from a primary osd and if that primary osd's pg log entry count is smaller than the osd min pg log entries, then it thinks that the new osd can recover by pge log, and it turns out that it's much faster, if you um have it handled via recovery, rather I'm sorry via backfill, rather than via recovery, uh just because of all the extra work that goes into recovery. So it's it's super interesting. um They provide a whole lot of data and analysis in there.

A

So I think mia is going to take a look at it, uh but yeah it's this pretty neat. um Let's see, there are four that I've got that were closed. uh I think this ttl cache for the manager got merged into another. uh Well, this feature another feature um which I think I need to go track down, but only event uh that was closed um and that was merged into something else.

A

The this async messenger frames message: header, optimization, better decoding, optimization, like a close in favor new pr.

A

This osd compression bypass in favor of rgw compression that get merged by casey and then um the adam's bluefist, fine, green, locking uh pr got merged by kefu, uh but it looks like it's probably causing well, it's causing locked up failures and testing, but I think adam said in another pr that was like a a weird interaction or something that isn't maybe really a failure. I don't remember the details, but he's he's on it he's working on it. um Okay, let's see three updated prs.

A

uh Let's see for the mds, there was a request to provide like kind of an overview of that uh some tree removal pr kind of like a high level documentation and uh that just was provided recently. So that's there now. um I think there's just some concern about how big that pr is. um There are two library d, optimization prs that were recently updated.

A

uh I think those just both needed rebase and uh there was at least one bug fix that went in there too. I think so, um basically, we'll have to re-review those, um and then that's that's about it. um Did I miss any prs guys.

A

All right, the only things I've got today are a quick note that um jeff layton submitted uh some fixes for the kernel clients that we think may resolve the three gigabyte per second bottleneck that we've observed previously, it looks like we were grabbing a mutex that we didn't really need to grab and he's uh he's worked it out so that he can use spin locks instead.

A

So um at some point your ilia is going to review it and I'm going to try to do some testing on it. uh That should hopefully provide a nice uh performance boost and we'll see if it gets us up to like the eight gigabytes per second, that we see with the rvd and loops ffs, uh not totally sure if it will or not, but that will be the uh that'll, be the goal.

A

So uh there's that and then the other thing I've got is that um so so gdp pmp is, uh has been useful for a long time. But it's it's not quite as useful as it used to be it's now causing and has been for a little while our classic osds to um to basically uh brick when using it periodically with high thread counts.

A

You can use it fine if you've only got like one tpos dtp thread, but once you have more, which we do by default, um it can cause the classic osd to break pretty quickly. So it's it's value has diminished somewhat still works, fine for crimson, actually, um interestingly, but not for classic osd, so adam a while back had made a version uh or a similar wall clock profiler that used live unwind.

A

He did some really clever things with parasitic code injection that actually makes it very very fast, but the code is pretty complicated um and uh I don't know adam. If, if I remember at one point, you were having some issues, but I don't know if those have all been resolved, but it sounds like at some point. It was kind of a little flaky still.

B

It was problematic at some point of time. Then I fixed uh proceed, um p, trace attached procedures and then it seemed to work fine. From now on, I didn't see any problem with that, except, of course, architectural and philosophical problem, that there is that uh um very clever in a bad sense. uh Injection of the other process, yeah.

A

Yeah yeah. Well, I will. I will say, though, that that your your version is much faster than anything. I've come up with so far, but I'm working on trying to to have the best of both worlds. So I I started working on uh basically a port of gdb pmp plus replacing the gdb part with lib unwind as well, and it's both faster and works better than the gdp version, but it's not as fast as atoms.

A

So I started digging into lib unwind since I, of course, one of the first things I did when I got it working is to profile the profiler, and um it turns out that almost all the time is spent getting a process name. uh The unwind is rereading stuff from disc over and over again that was partially mitigated by utilizing some caching that they have built in, but not fully.

A

So I started digging through the live unwind code and identified the code that was happening and even tried fixing it. To some extent um I've got a branch of unwind where I I go in and I'm trying to cache parts of the the in parts of the code. It was not utilized in the cache previously, but it's not really helping that much and as far as I can tell lip, unwind is now a dead project. uh No, the mailing list. No one got back to me.

A

The project maintainer didn't get back to me after a week. um No one is submitting code. Really. I think the last commit was maybe in the spring. uh So it's it's pretty. If it's not dead, it's it's definitely um back burnered.

A

uh So we have people at red hat actually that are working on elf utils and they have uh their own method for unwinding things. uh Lib dw, so I uh abstracted the back end for this. This thing and now we'll be able to use uh lib unwind and hopefully soon the dw has an alternative which is both supposed to be faster and is better maintained by people at red, hat and and also other people as well.

A

So that might be the right way to go, um but the good news is that this is actually working now uh and it's it's better than gdp pmp was so that's it's certainly an improvement already. um I think if we can get the sampling fast enough, we can start doing some really interesting things with maybe looking at uh trying to match sample periods with things that are happening in the osd.

A

Maybe if, like uh roxy, b compaction kicks in or scrub kicks in, we'll be able to like look at sampling periods that correspond with those events and things, but that's that's out in the future um right now, I'm just trying to get this working as fast as I can, um and so that's really all I've got. This is maybe gonna be a fast meeting, but I do see other people showed up so I'll open it up. Anyone have anything they want to talk about this week.

A

I might pick on you a little bit since you were talking to me earlier about the the um the blues bluefest. uh Looking uh things did did: were you able to confirm that it was uh kind of a false, positive.

B

Yes- and I fixed that really, that was always because of our log depth feature is actually a symbolic um feature, meaning if you have two different objects that basically have the same mutex of the same name. It is interpreted as the same mutex. So if you have two logs to the different objects in the same thread, it was, it will always complain, and that was exactly what was happening.

B

uh Hopefully I was able to fix that, but I suspect that there might be a cases in future where we would get a false positives from log depth that are not really solvable.

B

A

Well, at least you were able to fix it in this case so that it doesn't throw those in testing.

B

Yep, I would did one a change that made sense, meaning reordered logs in the same way, in all the place of code that made sense.

B

It just meant switching order in compaction and in addition, I had to like exclude compaction from other lock, which just didn't make sense, but it was not a problem, so I was fine with that. So that's why I was saying that maybe in future it will be difficult to actually get uh get that working in different locking schemes.

A

And uh gabby, I'm going to pick on you a little bit too. It looks like uh maybe there's been good progress on figuring out uh what was going on with the allocator changes.

A

Sorry, so what was the question? Oh I was saying I was picking on you a little bit. It looks like uh maybe you've you uh figured out how to fix uh the issue with the allocator changes.

C

Yeah yeah, so there was a problem with hybrid allocator uh caused, probably by my code, I found a problem which could explain this corruption. Unfortunately, it's now impossible to recreate it because there seems to have been some change in the code somewhere else, so this issue doesn't appear anymore. The problem was that when I save the allocation information, if there was a corruption in the file, but when you open the file, you still got the file uh without failure, then I would start loading the allocation into the allocator.

C

At some point, I would recognize that there is a corruption because of crcr and then I would result I would fall back and run a full recovery using the same allocator. So I will some allocation information appear twice.

C

I fixed this issue now that I'm using a temporary allocator when I'm reading from the file, and only if and when the whole allocation was found to be in good shape. I'm loading this into the real locator, so that should solve the problem, but in the last few weeks this issue disappear. We cannot recreate it.

C

I suspect that maybe it could be one of two things: either the test or not s or change somehow, and they are not as aggressive as before. So this corruption or this failure is not happening. All that blue store code was modified and it's now able to better recognize uh corruption in a file. In theory, it should never happen. In theory. Blue store should recognize if a file is corrupted and not give it to me.

C

But that's uh everything is just you. We know.

B

A

D

B

So interested in actually trying to replay that problem because having count and error in blue fs files, consistency would be a great news for me. That would at least put a hope that some errors that we see in the field regarding reading and from corrupted ssd files could be attributes to this error.

B

But yeah. That's still a theory. As gabby said, we don't know that needs to be verified.

A

uh uh Adam, did you uh find out anything more if it was uh possibly your your locking changes that that now made the problem disappear.

B

No, we tested that with gabi and also before locking changes. We couldn't reproduce it as easily as earlier, so it did vanish even earlier. Okay, interesting.

A

Are you, are you guys going to.

C

Do a few other things, the arrow that my fix addresses is if the file somehow got internal corruption, so it could mean that in the middle of the right, if I open the file to right and the and the right is long enough and in the middle we we kill the system. That would happen. But if you change the test, for example- and the allocation information would be shorter, such a single right would include the whole information. You would never see this problem.

C

You need to have big enough uh systems, so I don't know if anybody changed the test, for example, and if before you would have, I don't know hundreds of thousands of location information, and now you got only like 1 000 of them, which could fit in a single uh right.

C

I think I'm writing up to 4k allocation information in a single right, so you so you get it all or nothing. But if you do more than 4k, then you could fit. You could succeed the first one and fail the second one. If there is it's like a race issue, but it could happen so if the test was modified, the size of the system was shrink. That could explain.

D

Why you don't see this issue.

B

That's not fully irrelevant, because blue fs is still buffering, that for 4k chunks that you're giving to it and unless the file is longer than one gigabyte, it will not flush it to disk. It will just has it and as a very long buffer list, not not flushed, so that couldn't even be that.

C

Okay, so, but it could be something around this if the timing for flashing would be modified, flash should happen earlier.

C

Actually, sorry, I take it back uh adam. I I uh send sync request in the end of my of my of my right.

B

Oh you, in the end or in each chunk. I thought that only the end.

C

Just in the end.

B

In the end, okay,.

C

Information fits in a single extent in a single chunk. Then I would do a single right sink immediately. Then I get all nothing, but if I do multiple right, then, if I'm crossing the sync point, the system might start flashing the data and eventually, when I get to the end, I'm going to miss it yeah it requires inspection yeah again everything is just speculation.

C

We could create this issue once uh did you commit your change, uh the one about the sanity check, which was uh you had and when in debug mode? Is this committed.

C

B

Have no idea what you're talking about.

C

Remember when I was running restart- and I was failing because of a problem in some in in some aggressive society check, you added.

C

At the moment, I cannot write data using restart if I'm compiling the debug mode.

B

Oh, yes, that's! That's! uh That's fixed! There's, a pr! It's not committed! It's still that that's the log depth issue from uh blue fs faint grain locks.

C

Okay, so I can write uh an internal code in in my in my cipher box. I got this huge allocation information having like 500 million extent, so I could insert um some assert, not some force abort or set myself uh sig nine.

C

After writing, say hundred million extends, and then we could see if uh bluefs is going to reject the file on startup.

D

Do you understand the suggestion? Yes, that's something.

C

But all in all, I think we figure out the issue. I was combing the code back and forth until we reach to this one, and I try various changes and all of them show the same behavior.

C

So I'm pretty confident that that one should address the problem, but there might be some other issue in blue fs. We need to figure out, but I think the code should be safe. Now.

C

And maybe we should also add a test to tautology with big allocation. I don't think we have anything like that.

C

I think autology allocation tends to be relatively small. I understand it because it takes time and space, but on sephia my test machine is running: uh half a billion locations. Sorry half a billion independent extents.

C

Which could actually be billions of allocations once, but but there are some discharges.

A

How long does it take to do that? Gabby.

C

If I want to add test, should I talk to.

A

Josh, are you? Are you still running your teethology at like meetings? I think you were yeah at one point right.

D

Yeah those are my mondays and wednesdays so kevin. Maybe we can talk about that. The monday uh commuting.

C

A

Cool all right, um anyone have anything else. They want to talk about this week.

A

If there's nothing else, then we'll end a little early and can get uh ready for their their long weekend. Here at least people at red hat.

A

All right: well, then, I will see you next week. Guys have a great day have a great weekend.

B

Bye, thank you guys see you next weekend.