Ceph Performance Weekly, 15 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2021-07-15

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Good morning, core should be wrapping up here soon. Hopefully,.

A

Orlando long time no talk, how are you guys doing.

B

Hey mark yeah, good morning, it's been a while doing: okay yeah.

A

Have you have you played around much with the persistent rbd right cache.

B

No it's on the queue I haven't played with it yet, but I've heard a little bit about it.

A

uh One of our our guys on our side has been playing with it and he wanted me to help him a little bit with the performance debugging since it was going kind of slow, and so we ran gdb pmp on it and it looks like there's a worker thread pool associated with it. That was looked really busy to me, given kind of the uh the throughput rate that he was seeing.

A

I don't know very much about it, so I wasn't sure exactly uh what what the expectation was.

B

Yeah, I'm not too familiar with the expected performance either.

A

Okay, well, I I had to send it off to uh the developers into ilia on our side. So hopefully, uh uh uh smarter brains will take a.

B

Look, I see what was he testing it on.

A

Iran, uh both fio tests uh with uh libraries, the rbd engine and then also ran um rbd bench and some python. He wrote a little python program that just used the rbd directly for you know doing stuff and all of them kind of seemed to show similar issues where it was. um It was actually a little slower than just running rpd natively.

A

So maybe this configuration issue or something else was going on strange, but that was uh that was kind of the seemed to be the prevailing theme, at least on the the test setup that he had seen.

C

Well, hopefully, we'll get people that that know this better than we do looking at it and they can. They can decide what they think.

C

Yeah, what do you guys been working on lately.

B

uh My I've been focusing on uh both rbd uh workloads and a little bit of cfs, actually um kind of getting my hands dirty with io 500 testing and working with the former music guy dennis nujab. He um we've been kind of looking at uh ways to tune cfs.

B

You know in the entire system, kind of improve the performance as well as kind of figuring out. Where does octane fit in all of this um either the ssd form factor or the persistent memory um version. So uh it's kind of been a lot of experimenting a little bit of hacking, but um yeah that's kind of what I've been working on lately.

A

If um yeah, if you see anything interesting, let me know I was the one that submitted our. I o 500 results for red hat.

B

What uh actually I was wondering about that was that with replication or.

A

No, no, they don't require it and it was actually told me, don't even don't do it, no one else.

C

A

B

A

B

Yeah yeah, though we figured and uh yeah, we were thinking about that too yeah. It's interesting.

A

I know it's ridiculous right, like it's.

B

Totally easy to turn into one.

A

I know it's just like I mean the whole thing is kind of it's not stupid, but it's kind of like no one ever runs a real file system this way. So what what's the point.

A

But uh yeah, no there's there's lots of. I had to do a lot of crazy stuff to get those results. Like honestly, um uh our numbers were real, but that was like after 10, probably taking like the 10, the best of like 10 runs honestly and there's a lot of variability with the way this fns works. It's just. um uh You can end up in a situation where early on, if you get good distribution through the dynamic subtree partitioning like you avoid having it like.

A

um So so, if you can't acquire locks with the way that it works early on, you might end up in a situation where it can like never acquire. Loxes is trying to do distribute uh uh subtrees across all the different mbs's, and if that happens then, like your score can just tank. But if you can like happen luckily to grab it early on, then, then you can actually get like a score. It's.

D

A

So that would be honestly that, if you can like, I, I tried, and I sort of made it better and some of the work that was being done by uh u-kernel um looks like it could really be maybe relevant here. If you can fix that you'll get like way better scores than we have already even.

B

A

All right, I don't think course, actually ended. They must be like chatting about stuff over there, so maybe we'll just get started. um I imagine they'll they'll get in here sooner or later um all right. I was surprised. We have like a ton of pr's that came in from a guy at ibm in the research group, and this is like out of the blue, but he's it's like catnip for me because he's looking at like memory allocation, optimization and uh like avoiding memory, copies and all kinds of other random stuff all over the place.

A

uh Most of it can focus actually in lip rbd, but um some of this is relevant for other stuff, so um I won't get into each one of these individually, but go take a look. I'm I'm like uh impressed and and very interested in what he's doing so uh ori. I think his name is. I hope I got it right. um Oh and then I've got my stupid little pr. That increases the osd client message gap, which is you know like a one-liner that now needs to be rebased.

A

um But that's you know based on the stuff that we talked about last week. So that's um yeah fine um must be closed. uh Kifu's abl allocator emerged. uh I think they got some reviews good. So that's neat.

A

Or sorry that was not l avl allocated. I think that was just changing some options here. Sorry, um he had something else, a b3 thing, maybe that already merged that's what I was thinking of. um Okay, updated uh this manager, ttl cache. uh There were results that were posted a little while ago. Then there wasn't a whole lot going on with it, and then it just got some updates recently. I don't know what those are, but that seems to be just kind of moving along um and hopefully we'll make the manager faster.

A

So that's great um igor's teaching removal, optimization pr. It looks like that may be past kefu's testing. The failures he saw, I think, were unrelated. So that's really good um and it got approved by someone I don't know, uh but that's good. um So hopefully that merges soon uh yeah, that's excellent!

A

No movement! uh Adam's bluff has fine grain locking. I don't think he's worked on that much adam is that is that still do not merge.

E

It still do not merge this week today I making a final review myself because I gave myself a week time from the code, so I could look at that with cold eye.

A

Okay, okay, and is that more or less the same thing that majiang peng was looking at? I don't remember.

E

No in any possible meaning no.

D

E

E

Into valid domains of locking not making a hacking unlock in one place or multiple places,.

F

Okay, good job adam, that's awesome!

F

All right! uh Let's see um oh and then you also have your the the tc malik thread. Cache.

A

uh Which kind of I think is languishing because we fixed it at the other layer, but I still think that's really good if we can get it in.

C

um Okay, so let's see what else.

A

My sharded object, cache and rgw still hasn't gotten a review. I need to go bug, mark and and uh and the other rgb rg of you guys over there and see. If I can get somebody to look at it, um it's not great, but I mean it's it's I think it's probably better than what we've got. I know that um that matt really wanted to rewrite that whole like completely rewrite the cache, but this you know it is, I think I mean it's just a stupid wrapper around the existing one. So it's pretty simple.

A

C

A

C

Let's see a lot of this other stuff is old. uh My age banning pr still needs to be updated.

A

And merge, I need to work on that, um but still there it was updated a couple months ago or something when adam and I were working on his um trying to see if we could get running on top of his stuff.

A

Okay, yeah, I mean there's other stuff here- that we need to look at again like um adam and and igor's not here, but we still need to make a decision on what to do about the uh the the pinning. In the note, cache.

A

But we'll wait for you to be around, I guess so. um Yes, I mean that's really about it. I think. um Okay, anything! I missed guys possible. I missed some close stuff because I don't know that I actually got time to look over it.

G

I have a quick general question: um there's that big blue store change that um basically drops the um continuous tracking of allocation and rebuilds it on restart. What's the.

D

G

That advert into listed here.

D

Almost ready to go guppy's, I got it almost just like one last plug to fix. Basically.

G

And that's like a 20 bump or something like that.

D

You can get.

E

Almost like 20 percent bump, if you just do a small 4k random rides, then you can get close to that. But that's of course corner case.

H

Maybe it's like 25 yeah, it's 25 on 4k io.

G

A

Okay, hey john it it looks like we, we kind of oscillate, depending on on what we've been doing. What we've been optimizing on, whether or not the kd sync thread is a bottleneck for rights so like we kept oscillate back to it being a a bottleneck again- and this looks like to me- this is really the the real advantage here right is that we just we're doing much less work. There.

G

Yeah yeah fewer cpu cycles, less data to push around inside db, yeah.

H

It's like 25 in 4k ios, maybe a bit more.

G

Yeah cool, that's.

G

Exciting will it be optional?

G

I mean it's kind of it's interesting that it you get the best advantage performance wise when it's small ios, which in general is going to correlate to like a more fragmented store, which means a slower rebuild time.

G

um If you do fail at restart, um whereas, if you're doing like big objects, then there isn't much benefit, but also there's much cost either right, because the restart will be quick to scan the um rebuild the.

H

G

H

Just to enable disable this thing and you can also hold it yeah.

G

Yeah yeah, I'm just I'm. I was just thinking whether it makes sense to like point people at that, but I'm guessing that it doesn't. It won't matter that much because people who um wouldn't benefit from it would be those who are writing large objects and they won't be penalized by it either, because the restart rebuild will be quick.

G

That makes sense. Is that right? There.

H

Is assuming that it's work, then there is very little penalty associated with this scenario yeah, because even the rebuild is no big.

G

H

Assuming your system is not crazy, big.

G

H

I think we capped, like the max possible system, would take less than one hour to recover, uh like four terabyte uh uh osd.

E

Gabi, you said one hour.

G

That's pretty bad.

H

Yeah, but how many systems have and sorry it's not just the size of the system in the number of possible extent, so you would need to have like two billion extent.

G

Oh so, like a 100 fragmented, like yeah worst case gravitation, yes, I see the number.

H

Is uh correlated to the number of extent so.

G

H

New biggest extent, then it's nothing.

G

Yeah, okay, yeah we'll be interesting to get some real world um numbers on this. I guess.

H

Yeah and it's something that we don't have, we don't know how many objects there in the typical system, how many extended typical system and such I I we discussed it in the past- it's possible to get back this information with standard geometry.

A

And and gabby, if I remember right, even in that worst case pathological scenario, it was still a small fraction of the overall fsck time right.

H

Yeah yeah yeah, because you're doing much less work than fsk. I think fsdk. What was this factor? Five factor of.

H

Yeah yeah, so if any way we are going to do fsdk after disaster, then we going to make the say 10 20 longer.

A

So so sage, in that, like worst case pathological scenario, you're already looking at a huge fsck time- and this is adding- you know like 10 on to that- even if it's one hour.

G

uh Will there be an fsck time like in uh like just because your oc crashes and restart doesn't mean you have sdk.

A

Gary remind me, under what scenarios we end up doing this? Are there scenarios that.

G

I mean with this change. Yes, you have to do.

H

A system drop dead without proper shutdown.

H

When we do proper shutdown, then it takes half a second to this stage. The whole thing.

H

If we get if we are allowed to do proper shutdown, if the system goes down that shutdown, then we need to reconstruct.

G

Yeah, so I does this pull request, also change the shutdown behavior, because I seem to recall about a year and a half ago. We made it so that the osds, by default, when you send them a pop signal, they just kill themselves instead of trying to do a shutdown.

H

G

D

When I get this first part merged and then uh I also make the fastest and behavior work.

G

Got it okay, okay,.

A

All right cool, um any other prs guys.

A

Oh then, let's move on, uh let's see so this week. The only thing I've got is I've been spending some time looking at crimson again, uh just because we're trying to kind of keep track every quarter on cap, how we're doing uh with performance and just can't generally getting it going um so compared to last quarter.

A

uh Had some problems uh comes in with segmenting, uh seemingly due to the most recent cpr update, uh but guy working eventually uh with with uh going back and then cherry picking a couple of fixes on top of it um we're a little slower than we were before. That's not super unexpected, but a little irritating. If we're trying to report it, um you know how we're doing on it. um I went back and did a bunch of wall clock profiling again were still completely bound by the reactor thread. Oh, this is alien.

A

Storm blue store, um not not science store, so um yeah. Keep that in mind that uh you know some of this is is alien star related. um What I saw that is, that we're uh spending a fair amount of time in the started up or sorry the chartered work queue. This is both before and after the um the lockless q implementation by kifu um before we're spending a lot of time in notify one after we're spending a lot of time dealing with the semaphore as being used by the lackluster implementation.

A

There seems to be very little performance advantage, but maybe a little bit, but but not much um so so it's it's about the same as it was before.

A

uh There's a fair amount of time being spent malekin-free in the reactor, uh a fairly big contributor to that is just dealing with buffers and buffer lists, but there's other stuff, sometimes within c-star itself, especially in the networking code. So uh there's that uh manager updates uh the reactor is spending a lot of time like maybe up to 11 or 12 percent, just dealing with updating the manager.

A

So that's unfortunate um and maybe something that we need to talk about a little bit more, how how that happens and how much time we really want uh each reactor spending doing that um lots of time spent dealing with event fd- uh and I were talking about that this morning- a little bit of c-star uh uses event fd for uh uh communication.

A

um I don't think there's a whole lot that can be done there short of just reducing how much stuff is being done. um One thing radic that I was thinking about as we were talking predicts here right.

A

Oh maybe write yourself. It's not here.

A

Okay, um one thing that that we were talking about this morning is um he's kind of bringing up the point that blue stores, maybe an alien store, is not necessarily the the best thing to be using to measure efficiency gains with crimson uh just because of the amount of time spent in blue store itself. But as as I was thinking about it in fact, like a hundred percent of the cpu time is being spent in the reactor and only about 50 to 100 percent is being spent in blue storm.

A

um We're not actually pushing the blue store that, but I I'm not sure it actually matters that much, um and I have some misgivings about how much work, memstor and possibly science store are doing.

A

I think actually memstor's right path may be less efficient than blue stores at this point. So um that's something that needs to be looked at more and I think I need to go back and finally,.

A

Update the changes I was making in pet store to just apply to mem store instead and and pull the best of that stuff out and and get it in uh since pet store is basically dead. So it has been for years. So um it's not that much, but it's mostly the the biggest benefit will be like a vector based object, implementation in memstor uh that that really seemed to be uh uh quite helpful so anyway, um that that's what I've been looking at uh largely this week. um Any questions or comments on that stuff.

A

I'll put out like a a quarter two by deck, like we did for quarter one, uh hopefully in the next week or two here. So people can see some of the results of this testing and just kind of a general update.

A

That is all I had.

H

Can you actually do enough? I got something yeah, just small things. Oh.

D

H

The problem that was stopping me from committing is happening without my code. I was able to recreate it without my code, so I'm going to commit and somebody to explain how come we have corrupted um a location on bluestora. But that's aside from my code.

C

A

Gabby remind me: what was this latest problem I haven't? I haven't been keeping track of of what's been going on.

H

In the same object, node, we have more than one a location with the same offset but different length. So you would see two extent starting in the same offset one of them with one length other than another. One.

H

And my code was planted, checking against this while fsck- I don't think, is checking for this, and you only saw this complaint. This assert coming from my code, so I just edited this party check to be executed on new mount and mount without my code, and I saw the same thing happening. It just took me time to realize that the tests which didn't finish it's because it was the failure.

F

Wow good job wait: uh where do you catch stuff.

H

Yeah, so now you need to understand why this thing is happening.

E

Yeah now the difficult path to find, who is doing that.

A

But that's good news for your pr, though, sounds like you're ready and that's that's exciting.

H

A

What's what's a little uh interesting is that you're going to push the classical osd's performance higher than crimsons, with this increasingly more volume like, but uh the classic osd is going to be faster yet than than crimson with your change, because your change is going to make classic osd faster for small random rights and crimson won't change, because we're bottlenecked by the reactor thread.

H

A

Just uh it'll make it even a little more tough to make crimson. uh We have more work to do.

H

We going we, we can do the same change for crimson if we use blue storm right, because we also have crimson storm.

A

That's what I'm doing right now, but the problem is we're not bottlenecked by the kd sync thread in in crimson we're both by the reactor thread. So your change probably won't do anything.

H

Okay. Okay, so it's not that kv-sync thread which is holding crimson. Oh.

G

Because there is.

H

No heavy secret right.

A

There is one actually, yes, uh because we're wrapping blue store an alien store. Basically, so we still have the kvsync thread. We still have worker threads, um but uh but a lot of work that is being done in other threads in classic osd is being done in the reactor threat in crimson. So until we get multiple reactors, crimson is going to um is not going to change much when we do blue store improvements.

A

A

Which yeah without radicure and kifu? Here, I guess it's not worth talking about it, but um the the the more I look at this, the more I think we need to get the mn multi-reactor work going at least if we want to be able to show anything that looks like you know, remotely like performance, parity and tech preview.

A

I don't think we can, we can do it just with optimizations the reactor.

A

But uh anyway, that's another discussion for another time. I think.

A

All right, well, anyone anything else from anybody.

A

All right: well, then, we all get a half an hour back to our day, have a great rest of your day. Everyone and we'll see you next week.

E

Bye, thank you see you next week, hey.

G

E

H