Ceph Performance Weekly, 1 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2021-07-01

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

New pull request this week. I only saw one new one uh that uh showed up in my list and that is uh uh adam on the core team uh as a new pr, as do not merge or implementing fine grain locking in blue fs, and I suspect that that is um maybe based on some work that I think majian peng uh did earlier.

A

uh Oh adam you're, here excellent. um I was just mentioning your bloophouse fine, green locking pr looks like uh not ready yet, but maybe hopefully.

B

Basically, it is working it passed test and it passes various testing, but still- and I'm not really satisfied with the possibilities of deadlocks. I have to find out some systemic way of discriminating that this call code will not cause any deadlock because it there is a basically one. uh Global log has been switched into four logs for different parts of bluestora and it's not really a progression from least locking to most locking in like kernel layers.

B

It's different domains of action require different logs. So that's why deadlocks are really possible. So I would like to review that more before, um actually pushing that on anyone. I mean that's free to to look at mostly. I put there because gabby wanted to review that. So it's free to look but- and please comment if you see anything, but it's not ready for um rigorous review out of curiosity, have you tried running through teethology? Yet no, I did not run it through taboology.

A

Yet, okay, that would be the first step to see if you hit deadlocks there.

B

Maybe something pops up, yeah thrashing seems.

A

To hit stuff like that pretty well, but uh not always well excellent. uh That was the only new pr that I noticed this week for performance. There are closed. um Let's see it was a little pr for changing a long double in uh common to double. I forget even what code that was in, but um yeah it's fine makes sense, maybe a little bit of win.

A

A

Patrick merged something in the mds to flush the md log. When requesting the read lock. I do remember looking at this a couple weeks ago. um I don't remember very much about it, but it must have been good because patrick merged it so that's excellent um and then the b tree allocator that kiku wrote adam. It looks like you, you reviewed it and liked it and kifu was satisfied and he merged it. So that's excellent.

B

um Yeah, we are now waiting for a continuation which will be merging data locator with hybrid alloy, with bit bitmap allocator forming hybrid. But for that I'm owing uh kifu a rigorous testing on possible scenarios and threshold when we should switch from b3 to uh hybrid mode.

A

Eight, that's actually sounding quite a bit like what some of the file system uh the path that they ended up going down. So it's this, I think, we're on the cutting edge or close to it. Now it's good.

A

All right, um let's see, manager this uh time to live uh cash, implementation, uh uh that pr uh has updated, testing results which is nice to see they they included. Some graphs um looks like the cache is helping. So that's that's nice to see. um I think that there was more discussions. Avl allocator vr that has not merged yet um and then.

A

Of course, a pg optimization, rpg removal, optimization pr, um it looks like that's, maybe ready for testing or no, it isn't testing kifu. Has it testing now again. So that's uh that's good. Maybe we'll see that merge soon.

A

And uh that was that was all I saw for new and updated prs here. um Anything I missed from anybody.

A

All right: well, then, um let's see the the only real topic I have for today is that um uh for a while josh and nihal were waiting on me to to do uh uh run some tests. Looking at the the osd uh client. uh uh What's it called osd client message cap uh parameter.

A

We previously had this set for 100 for many years and then for kind of an unknown reason. I guess we disabled it completely and got rid of it. So there was no cap anymore, uh but I guess that was causing problems in some cases um in in toothology tests, and I don't know maybe actually in writing clusters neha. Do you or josh do you know was that actually a customer.

C

And no, at least one case that I encountered was one of our rocks. Tb related tests and pathology started failing um and it was just uh the osds were just hitting a suicide timeout um and uh setting this option actually helped in that case, which is what made me think that it does work and it does a good job of you know, throttling things at the messenger layer. That was one example recent that I remember.

A

That does not surprise me, given how things are architected right now, so, okay, um I'm glad that we weren't specifically seeing customer issues with it. Although I wonder if maybe oh.

D

We were, we were thinking, valuable.

C

We did see customer issues, the customer issues were there, but the code wasn't there. That was the problem, so we couldn't really ask them to like. You know, even set this and see if that helped, um but we've now finally gotten this code available in the customer, uh you know releases that people are using, but since then I don't think we've hit that. So it's not been a long runway, but uh at least it's there now for us to exercise. If we run into such issues.

C

E

A

Yeah I was gonna say that suicide timeout does seem like something that people have been seeing like in the last couple of years, uh and maybe it's been more predominant since we disabled that, but um yeah well. In any event, um I've owned you guys this for like a month or two, I think, and I finally, over the weekend uh uh just went through, went bonkers and ran a bunch of tests on it um and that pasted the results in the chat window.

A

So for anyone that wants to look um it's, I I personally don't find it super interesting. It's it's kind of what you'd expect to see at really small message. Caps, it's slow! um Not not!

A

Actually, maybe one surprising thing in this is actually it's faster in a number of these tests, even with a really small message gap than I expected, which um you know mostly sequential, which makes sense right because you might have like merging the bios happening or whatever so um so that you know is maybe less surprising, but but in general I mean, like the the what we had said to you before 100 was was not super unreasonable.

A

um We do a little bit better if it said a little higher uh so like in the course stand up. We were talking about like 256, maybe uh just being the new default, but um you know really. Some reasonable cap is not completely like it. Yes, this is kind of just straightforward, there's no reason not to, in my mind um at least based on these results.

A

uh Yeah, both rbd and rgw, uh can't follow similar patterns. Nothing, nothing real super exciting.

A

What I will say, though, actually in this, is that um we're seeing this really weird behavior with 128k random reads where, like it's oscillating between like it's, almost like a binomial distribution, where half the time it's it's like fast and half the time, it's slow um and there doesn't seem to be a whole lot of rhyme or reason to it.

A

So that's something that needs to be figured out, and I don't know if that's related to throttling or not, um but we we always have had really kind of strange issues with that I o size with around there like these middle. I o sizes- and um it's not never been totally clear to me why um but yeah. That is what it is. uh More work needs to be done there.

A

The other thing that came through in these results uh is that uh for large, sequential and random writes like four megabyte ones, um we're like getting between 13 to 15 gigabytes per second and on pacific. I was seeing closer to 20 to 25, so we may have a sequential regression in master that I need to go, look at and track down, um or at least figure out. Why?

A

Why it's happening so um there's that too, but uh that was really what came out of all of this, in my mind, is a couple things to look at maybe fix, and you know some message: cap value of, like you know, between 100 and 500 is probably fine.

A

um One thing that we talked about a little bit in the core stand up earlier this week or last week. um I don't understand why we don't tell clients to back off and have some kind of flow control.

A

It seems like it would be really like a good idea. What what do you guys think.

D

I agree, I think we have something very coarse, grained in in uh existing uh osd back off code, which uh only kicks in when there are cases where we know that we can't service operations for a long time like uh object, needs to be recovered or uh the pd is veering or otherwise inactive.

D

And then we ask the clients to back off for a while until we tell them it's ready to go again but uh yeah. I agree that the better flow control would be quite helpful. I think we were like the message. Cap is kind of a very simple kind of uh throttling approach, but something um you know doesn't happen all cases and it doesn't really uh prioritize things or or broaden any level of, like all your service that you might get with a better or more complex. Looking drawing algorithm.

A

Josh, the other thing too right is that the client right now with with the message cap, is just gonna sit and block right.

D

Yeah, I don't think that's uh necessarily a problem. I mean that's what lots of flow control algorithms end up doing right. They end up locking up the client side rather than the server side.

A

Sure, but like they could they could they could sleep right or do something like they could? Just you tell it, I don't want you to contact me for like two seconds or something. Then it could. You know sure it wouldn't be hanging around and just be like. Okay, fine, whenever I'll give you other work or I'll just sleep here,.

D

But yeah it's it's not blocking the whole client, though just like the connect that particular connection to that osd. So I don't think that there's much of a difference in that sense between the cap and uh different kind of flow or whatever.

A

Okay, one of the things I've noticed is that it seems like under really like in situations where the osd is not responding. I see like clients, spinning at 100 cpu sometimes, and I was curious if that was why.

D

I don't think so. I think if, if, if we are seeing that that's maybe some kind of bug or just okay, yeah.

D

Okay, interesting yeah.

E

D

To learn more about different kinds of flow control schemes, I'm not actually familiar with uh with many other myself.

A

Yeah, I I actually just this morning was looking at like tcp and cam. You know the original implementation was just like. Okay, you send like a frame back and say. Okay, I want you to wait for some period of time right, but then uh at some point they did a much more kind of interesting, like qos type theme where they could have different kinds of uh well, and I don't actually know that much about other than that it looked like it was.

A

It was more advanced and- and let you kind of have different classifications of of flow control.

D

It might be more interesting to think of this as well in different contexts for the future uh interfaces. We have worth. Thank you bye for stuff, like with uh any any protocol changes we want to make for crimson or with nvme of any more direct connections.

D

Yeah, what kinds of things can we do there.

A

A

F

Hello, I think.

A

F

Out there yeah.

D

Yeah, I guess I'm not even different things that um more complex small control could help with, since I'm not very familiar with the algorithms.

C

Something lines of this uh remember george: there was some discussion about uh doing some flow control at the rgw. Just for the rgw uh workloads. um There was some pr from somebody who's trying to implement something of that sort.

D

Yeah, I can't remember exactly what that was at this point, but I remember that being like that existing in some form.

C

Yeah, I think at some point the artillery was very interested in that, which is why uh we also looked at it. But after that I don't know much happened to that or not.

D

Adam emerson, do you remember about anything in flow control in rgw.

G

um They're currently discussing it, um I think dm clock.

G

Tim clock is used in the beast front end. They were just discussing an email today, uh but apart from that, it's still under discussion.

D

Okay thanks, I think I think, there's something in addition to dm clock that it come up before too, but I can't remember what it was.

A

What would be the advantage of doing it specific to like rgw like that versus something more kind of rados level.

D

You could have more awareness of like the data like if you're blocked on a particular bucket. You could stop all operations that that bucket, not just the radius objects individually, but maybe.

A

F

Yeah, the policy is probably going to be per bucket energy side right.

D

Yeah you might want to integrate with like rgw's level of qos.

F

D

Since it's going to have a different division of attendance than raiders will sure.

F

Are we talking about the osd client message cap, but.

A

Yeah, I was, I was wondering why why we don't just do full control like you know, actually let the client know to just sleep for a while.

F

Yeah, I guess I don't really know how those algorithms work so.

D

That's yeah, that's exactly what I was just saying.

F

Like how long I mean tcp does flow control when you stop reading off the socket, which is effectively what we're doing or just sort of like piggybacking, on tcp, slow control.

F

I think that, but in my recollection of the um the arm, talk was just that they increased that and it increased performance in their tests, um and it wasn't clear to me that there was really a downside of just increasing that in general. But I think there's also.

D

F

Separate memory cap right, but we already have like a.

D

Yeah there's like 100 megabyte memory cap or something like that, but today the default is the measure. Cap is turned off. So that's what mark doing some tests and.

C

D

A reasonable value to add as a default that won't impact performance.

A

Yeah we for many years we had set to 100 the message cap and now we just don't have one. We just got rid of it for some reason, but no one really just didn't know why and then we should just turn it back on.

F

Yeah just keep it high enough that it won't affect most people yeah. I think the only uh deal my recollection is that these the message cap and the memory cap are both global, which means that um there's some like, if you have one client that comes in first it'll, it can fill up the queue effectively or fill up that cap and then another once other clients come in. They won't be able to it'll, have a harder time sending messages.

F

So there's like a little bit less spare in that case, even though they're like the dequeueing underneath there so, but that should even itself out um over time like a first mover advantage or something.

A

I might take a stab at just looking at the throttle code. I started looking at it a little bit yesterday or the day before, or something and just seeing if there's any clever way to make it a little more.

A

C

Yeah, did you see somebody who's already using it, uh the arm drop or are they using what value are they using? Do you know.

F

It within that presentation from huawei, I think it was okay, they said it. I think they like just bumped it up by like 10x or something. I remember.

D

Okay, thanks for watching.

A

F

Turning this way at 100, I can't remember, I just remember they increased it.

D

A

I it surprises me a little bit though right like if they change it from like a hundred to a thousand, I mean thousand just I mean think about that right, like that's. That's a a huge cap yeah and it's like crazy that shouldn't once you get past like a hundred even like 200 to 300. Okay, maybe, but like that's.

F

Yeah, that's also a little bit surprising.

F

Yeah I mean the question I asked actually was what the q depth was for their clients, um because.

F

Basically, if they have a small number of clients, they have to have a really deep cue before you would ever come close to that limit.

A

F

um How realistic that is necessarily in like a real-world scenario, but.

A

Well and how much I would have, can the device under the usd even absorb right, I mean like yeah, you're you're, throwing like you, know, 300 ios, current ios at that device. You're, probably not.

F

Saturated the q depth is limited at the blue store level like it's pretty shallow there. So these two, it's probably just like yeah, there's no black. I guess in the in the client so like first q,.

A

Yeah yeah, exactly right, like everything, would have to be even even get the point where you could saturate the the device you have to have that cue, open wide and, like you said, you're, probably just making sure that there's stuff in line to get through.

A

But that's why I'm I'm really skeptical that, like bumping up to a thousand, would have like meaningful.

F

Yeah, well, it might be worth just like trying to double it say and see what happens just to see if it really.

A

Oh well, yeah, I mean sage, he probably didn't see in the chat window. I've got a bunch of stuff. Oh.

F

I see oh, I see. Okay, oh so, basically, once you hit 32 basically.

A

Well, I mean there's a little advantage to going higher in some tests.

F

A

It's kind of what you'd expect right like for sequential it's not that beneficial, because you can just like angle, stuff together.

F

100 seems reasonable, though I guess these tests.

A

You're talking about like 256, as maybe being like a.

F

F

Well, the other, the other part of that talk. I want to there's a whole list of stuff, but most of the good news is that most of the stuff that they mentioned has already been fixed. There are some like page size stuff with 4k pages. It sounds like basically they've addressed the the big problems and basically pumping up the page size on arm like a 10 improvement or something which is pretty nice.

F

um Even the right amp thing. I think the recent rocks to be direct. I o right re-caching behavior thing does the right thing, um but there's one other thing that we haven't looked at and that's cpu partitioning, um and this was something that I just like.

F

Don't I don't even think I realized that it was a thing or at least hadn't ever thought about in much detail um where they basically set it up, so that the threads that are processing um like the op q versus the like blue store dispatch, kb thread whatever are on different cores, but on the same socket and they pinned them there and so that the like division of labor they would be work better. They they managed to get a decimal bump from that.

F

What were they.

A

Doing before that, just letting them float.

F

Yeah I mean by default, like the kernel just does, whatever, whatever.

A

Yeah- and I mean you'd- hope that the colonel wouldn't move stuff around constantly. But I mean could.

F

Have been yeah, but it might be having it might be running them on the same course, I guess so like it'd, do one queue and then do the other cue or whatever, and they basically force them onto separate um cores um which made me wonder if that's something that um we could teach the osd how to do automatically.

F

um I don't know how I mean we've, so they also talked about numa pinning they had a machine, their machine had um like the network and the pci devices, or the nvmes were directly attached to different sockets, and so all the automatic nema stuff just worked, which was like great news, because I don't know if I've ever actually managed to test that on a system that had like a balanced number architecture.

F

um So that's good news, um but it made me wonder if we could do the same thing for this. um This other thing too.

F

If we we see that the socket the pneuma know that they're on has multiple cores. um We divvy up the osd across those cores.

A

I mean yeah, you you'd hope that in like right, heavy scenarios where the kv-sync thread is basically just like pegging the cpu like taking one core that nothing else would get scheduled on the same core, because it's just like consuming the entire thing, but that doesn't. That means that we trust that you know the next scheduler is actually like saying, and you know I don't know, if that's a reasonable assumption, I guess.

F

And maybe that maybe the thing to do is like replicate the experiment and then try to understand what what's going on like? What's the behavior, when you pin it what's the paper, when you don't pin it like how much what threads are getting scheduled, where there may be, maybe there's some kernel tunables that control the the like the heuristics in the kernels scheduler that might.

A

Yeah, I could see I'm curious like in their tests. Do you know, did they have.

A

One one were they doing. Was this like small random rights? Was that what they were looking at, where it should.

F

I wish I don't think I have if I posted a link to the slides.

C

Is it the one r stefan arm this one link I found, is it this one.

F

Yeah, this is the one wait.

F

No, it's not nope, it's not that one.

F

It was um I'm pretty, I'm pretty sure as well way. We did the talk um in a youtube. Video.

A

I mean on arm. I think that um I might be totally making this up, but I thought arm tended compared to x86 to have higher contact switching overhead.

A

It may be that they're seeing have a very specific effect.

C

I think this is the one I think the second one I pasted here they are talking about messenger throttle and.

C

New opening and etc.

D

F

Yep, this is the one okay, it was slide. 11.

F

And they showed.

F

This is yeah, so they got between like five and twenty percent improvement with different workloads.

D

F

But it looks like they didn't, show a lot of detail about.

A

So do you know what they were, comparing it with? Was it I think, uh without.

F

Pinning was my understanding.

A

And, and not also not restricting threads to any particular pneuma uh node or anything, it was just like random.

F

I think it was. It was the the default pneuma aware thing so, pneuma auto or whatever it's called.

A

F

And that was just that just already did its thing right, so the.

A

C

F

Were confined to a pneuma node this additionally divvied up the different threads.

F

Yeah the messenger worker tbosd and be sore threads.

F

Yeah so basically all the messenger and osd stuff was on one cpu and then the blue store threads kv sourcing thread and the finisher stuff was on the second cpu.

A

Yeah I mean you could imagine that if, if the scheduler was stupid and like migrating stuff around, then that would not be great, especially if it was moving it to a different like cash domain.

A

For some reason,.

A

It'd be nice if they had any data on what was going on in that case, um in those tests like whether or not they were seeing a lot of processes moving and where they're moving between, we could do it in-house.

F

Yeah, I mean, I guess, a more general question. Even it's just like does that behavior carry over to intel.

A

F

A

F

It's different.

A

And it might, but it might not have the same impact too right like it might be that you can absorb it easier. Maybe it's two percent instead of ten percent.

F

Yep well anyway,.

F

A

How would this work in a container world that gets all really messy.

F

um I mean right now both the rook and the sephardium containers aren't doing anything as far as trying to confine containers to anything so.

F

It's no different than a non-containerized case, though the osd is sort of in the pneuma case. It's doing it. It's setting its own um whatever to confine itself to a node based on the devices it's using.

D

We're going to need to let both staffidium and rook handle this kind of situation for crimson, though, for splitting up my cores, so the solution for that might be maybe applicable to existing usb. Potentially, if this kind of partitioning is promising.

A

Sage, does the osd inside a containerized environment even have the ability to tell what pneuma node it's running on or any of this other stuff.

F

Yeah I mean the encephalidium the container, I think, even in rook, the container is um has all these system privileges, so it can look in sysfs.

A

F

All the hardware, so I can do whatever so I did yeah it should be able to deal with it- should behave. The same.

A

Then how did the oh go ahead?.

F

That's going to say, probably at the point where we start trying to have the container orchestrator um direct it to a particular abuse or set limits or whatever, then we'll have to make sure that these two things play well play nice together, but so far you know, orchestrators haven't done any of that.

D

C

D

More complicated because they have kind of a different concept of um cpu sets. uh That's that's like kubernetes scheduler controls.

D

So it's unclear if uh I guess we might need to play nice with you with that versus having the osd. Do it itself.

A

And josh: that's what confuses me is. I thought that I in kubernetes at least I thought that it controlled a fair amount of this. I don't know how the osd does it.

F

I think it's it's a it's a can. Control like it does control so yeah.

D

F

Okay, but the point at rook like sets some scheduling properties that tell it to pick a cpu or use this many cps or whatever. Then they can then kubernetes will go and allocate something and that the approach we took with the memory restrictions is basically that cedium just tries to mimic the same um container inputs that kubernetes does, and so from the dean's perspective it looks the same.

F

We'd probably do the same, but when we get to that point, we just need to figure out how understand how kubernetes is doing it and how that's communicated into the container and then try to mimic that.

A

I suppose one downside of this right is that if you're pinning the tpos ttp threads and then the um the also the the qb sync thread to specific cores, it does kind of take away the ability to kind of dynamically adjust things right like your rock cb compaction. Threads come in that all of a sudden are using a bunch of cpu or other stuff coming in using cpu scheduler can't be like magically rearranging it.

F

Or like more osds are deployed on the same node that suddenly you're over under subscribed and like yeah yeah yeah.

B

I mean it would.

F

Be nice if the scheduler just did it all on its own yeah.

A

I wonder a little bit if we would be better off trying to understand if the scheduler is behaving badly and if it is yeah, there's a way to make it better.

F

I guess either way that the first step is to replicate it and see. Yeah see how big the difference really is.

A

The um we've there's been some desire to do graviton testing on aws. I wonder if it'd be worthwhile to try doing this there just to have another arm set up.

F

Yeah I mean we have. We have a bunch of arm systems, I think there's some that aren't being used for build yet.

A

Yeah all right go ahead.

F

I would probably try to replicate this on intel. First, let's see if I can affect more people.

F

F

All right, well, that was all I had cool.

A

I that's kind of all we have in general right now, unless anyone else has anything that they'd like to bring up this week.

A

All right well done have a great week. Everybody.

D

Have a good one: thanks see you.

B

Thanks: okay, thank you.