Ceph Performance Weekly, 13 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2020-08-13

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

So open all right, not not a whole lot new stuff. Here there were two new prs that I saw this week. um One was for me that was uh a result of the collaborative work that uh that radek and I did looking at trying to make bufferless depends faster and we, we kind of ended up with two complementary changes that I think are both worth doing, but this is only the first one, which is a smaller change.

A

This basically changes it so that in uh refill append space we dynamically change the length of the append buffer that we're adding um based on how much we, how big the old one was. So basically, this is kind of like in c plus plus, where vectors grow. uh They double each time, that's more or less the same thing we're doing here.

A

uh We do bound it, but um but the the end result is that, hopefully we're we're doing that much less often, if we have really big uh buffers that have a lot of depends happening um and it seems to help fairly significantly. um The only interesting thing is that tc malik actually ends up slowing down at some point when you have larger append sizes- and I don't know why exactly um it changes depending on how we tweak different things, but it it consistently.

A

We seem to hit it sooner or later, so um it overall, it still seems to be better than what we have in master, though uh far better, especially for pen hole um with the ring buffer. It's even better. Yet, especially when we stay within the ring buffer and don't kind of go outside of it, we can, um if we have too many buffer lists at once, build memory, but um but it overall. The combination of the two, at least according to our micro benchmarks, looks really good in practice.

A

I see a lot less tc malik overhead, but it's not really speeding things up yet, as far as I can tell at least in our test case with the mds, so uh anyway, that's kind of where that is at. I guess sorry. I just went over the whole thing, but um that's that um any any questions or thoughts on that pr.

A

One thing I am curious about casey: do you guys use any of the dank framework in rgw, yet.

B

um Not directly, I don't think okay.

A

So this may actually help if you guys have any cases where you're doing like a pen hole, and you know lots of little pens. I don't know if you do, but um that was what we were noticing in the mbs is that it was doing a lot of.

A

A

And they weren't using dank, which kind of also fixes this. But just in a different.

A

A

All right, well, next, pr that I've got for new pr's is one from uh uh yan, and this is uh shoring up. Some of the ephemeral pinning work that just went in a little while ago. The idea here is to um distributor franks, so that's um kind of like what I was doing with the uh the pre-sharding and pre-distribution of different eggs, but they're doing it for ephemeral, pinning and theoretically.

A

I guess the idea here is that we're going to reduce the number of sub-trees because of it so um anyway need to see how that works, see if it works well or not, and that was it for new pr's updated.

A

uh I think adam and I both that we should use the majin peng's uh pr to avoid flushing too much data at once in blue rock's environment. That means qa, uh hopefully kifu can add it to his uh qa suite and then um this d3m caching thing that got reviewed by a couple of people and then has had updates based on that review. That's still in in review and we'll be in testing. Assuming everything looks good um and then radic is still.

A

I think, looking at majin peng's reduced buffer list rebuild pr in blue fs, trying to do that more generically than just in the bluefs code. So uh he's working on that, though I believe- and that was it for updated prs.

A

So uh I already basically talked about the bufferless depend stuff. So uh did we move on to talking about crush.

B

Yeah can do excellent cool. I'm excited to hear about this.

C

All right, so let me just quickly first link what I'm talking about in the chat. So people can take a look. um So this week I came across a paper published uh at the fast conference uh like uh five months ago. I think back in march, and this paper talks about um sort of like an expansion to the crash algorithm.

C

So first they they discuss some uh weak points of crush in the case when uh there's a cluster expansion and then there's a lot of data migration happening, and that was basically that their motivation to to making that extension to to crush and the idea is making um sort of like a centralized uh authority responsible in addition uh responsible for uh for the data migration in the cluster expansions.

C

um So those were kind of like that. That was the main idea and they were. They introduced a number of ways to to handle the load balancing and at the end they talk about uh some of the results they got and it seems like the the algorithm they introduced is achieving uh much better uh results than uh then crush. uh They were performing the test on on on cfs and and rbd.

C

um So I saw I I saw it like two days ago so they're still, I I still need to go over it more, but I kind of wanted to use this meeting to uh to show it to you guys and and give you time to go over it. If you want to and tell me what you what you think about it.

A

Oh go ahead: josh.

D

uh I I hadn't seen this before myself and it looks very interesting I'll have to like read the paper to kind of understand. What's going on here, um if you've already looked at it, a bit is like the main idea that they've kind of reduced the amount of data being migrated or that they're spreading the migration over a longer period of time or.

C

Yes, so so the idea is they are, are they are reducing the the amount of data migrated because they are adding another uh virtual sort of layer under root? um You can see there. There's the. There are multiple figures in the paper, so they're, adding a virtual, uh a virtual layer, so they're looking at the different um um uh shelves and cabinets as as layers.

C

So when you make the expansion, it doesn't affect the weight of the other layers and then you don't have uh the migration happening. So so, when you have new data coming.

A

C

Only goes to the new layer, and then they are discussing how um how they are handling the load balancing, because because there's sort of a trade-off there and they're presenting three um three methods, namely pg remapping, uh cluster, shrinking and um uh layer merging. So yeah.

D

Okay, so it's kind of that's.

C

D

Than all data, to avoid uh moving too much around.

C

Yeah, so they are so yeah exactly so they are using this. um I have it here, one second they're, basically, timing: uh they're, adding like timestamps they're, calculating the time that the data is coming in so that they that they can distinguish between all data and new data. So when you're doing the expansion.

A

C

A

That new data is always landing on uh like osds, that are are that they won't have to go back and then do like future migration for or is it how? How is it? I guess I'm still trying to understand how it um how what are the bounds on data migration, like for new data and old data,.

B

Does it change that okay, so so.

C

Yeah, so they did they. I I didn't go into specifics of like their uh exact uh calculation of what did they determine as um as old data and new data. I kind of went over like the the main idea and the results.

C

um So the main idea was, like I said, uh doing those uh those uh timestamps and adding a a virtual uh layer under the root uh kind of grouping, all the um uh the the cabinets and shelves in in a virtual layer uh so that when you expand, for example, you add another- um uh you had another shelf, then it only affects that virtual layer and it doesn't affect the weights of the other layers. So then crush doesn't um do the the load balancing or the map x.

C

In this case, which is the extension to crush.

A

Yeah, I need to read through this probably a more to really understand what that the ramifications of that are.

C

Yeah yeah there's definitely a lot, um there's, definitely a lot to read. I myself, like I said I I came across that just two days ago, so I thought this will be a good forum to to bring that up to give everybody time to everybody's interested, give them time to go over it. I'll. Definitely look further into that as well before next week's meeting and then next week we can all kind of have a better idea of of the entire paper, and we can uh discuss it further.

C

If that is something you're interested in yeah, absolutely.

A

Very interesting, I josh. What do you think about the claim about, like I think they said specifically unbounded data movement? Maybe uncontrolled data migration is what they say. I guess.

B

A

D

A

D

When we describe things, um I mean it's certainly controlled in some sense like that, there's a bounded amount of data being moved and you could there is control over how fast it moves outside of crush itself.

A

D

It sounds like they're kind of describing um it being uncontrolled in the sense that it's not directly controlled just by crush.

B

Yeah yeah, basically.

C

Yeah they're saying that it's basically like crushes using like a decentralized uh placement methods and they are sort of um adding like a another centralized authority um once cluster expansions are happening um and they are showing that you know once you add that centralized authority, you essentially you don't hit, you don't hurt the load balancing, so that's good, but at the same time you're also getting better performance, because uh data is not migrating over between the different layers.

C

um Yeah. That's that's kind of the idea.

D

Yeah, I think I'll have to read more to understand the tradeoffs involved like if it has potential implications for the failure, domains or uh kind of the balancing of a via load as well yeah. Definitely that's that's the reason why.

C

I thought I'd uh bring that up here.

A

And and when you were reading through this, did it sound like for new objects? Do we end up what happens to new objects like it, while you're in a uh degraded or recovery scenario, what what is the effect on new objects when they come in with this scheme versus traditional crush?

A

Did you did you see anything on that or.

C

um I'm not sure I've seen something I'm not sure I've seen uh either they didn't address it or I just didn't uh notice it.

C

Okay, um okay, but I think I think that in section 3.2, when they are talking about migration control and they're kind of introducing the three, um the three method methods they are using to to address the potential load imbalance between the layers. I think that the answer to your question might reside there, I'm pretty sure. Okay, um I just didn't, have enough time to uh to really go into detail.

A

um Yeah yeah, no, no worries, no worries.

C

But I just yeah, I thought it it. The results were pretty surprising. That's why I thought it would be pretty interesting to to have a look at that because um it seems like just the results of significantly better in terms of of the iops they are getting and the latency yeah.

A

Yeah did they did they say in here? What their test setup is.

C

Yeah yeah in section four, you have evaluation um and they are describing both the machine they are using, um um the they are describing the cluster. uh The storage devices uh stuff like that and they're saying the ceph version, which is a luminous yeah.

A

Yeah, it's luminous with four 5.5 terabyte hard drives and they don't mention any nvme or or ssd storage.

D

C

No, it doesn't seem, it seems like they were, they were doing the tests on on hdds.

A

Yep, so that's.

D

Like kind of the best case for their their algorithm right, I mean that that's.

A

D

Most expensive to recover.

A

Yeah, so the question yeah. The question is right like if you did this test on on ssds, if it do anything at all or if it would just yeah yeah the background.

C

Matter yeah, the thing is um that the way they described it, it seems like uh implementing their algorithm was pretty easy. So so, in the case that you know they didn't run into ssds, um maybe we can test it ourselves and and see how it uh how it performs this.

D

Thing is just kind of the the impact they're trying to reduce on flying tio or left something with the work quality of service. Where we're um balancing background io versus client io as well, and we.

D

Wondering how much of that of the kind of impact of this uh would would already be captured by a better quality of service.

C

Yeah um I get what you're saying so that that that's why? I think that that uh it's it's worth to to go into more detail uh like read through it again and uh maybe even try to to recreate their um uh their tests on on uh different hardware and see and kind of see the results. I don't think it should be that complicated, because they didn't completely change um uh crush they sort of like just extended it.

C

um Oh yeah, I'll I'll, take a look into.

A

That that'd be really interesting. Do you have a hardware you can test on.

C

Yeah me personally, uh no not really: okay, okay,.

A

um If you decide you want to do this and like really, you know, do it for real, let me know we can probably get you on one of the inserta notes so that you can test with hard drives and and mimi nvme drives.

C

Okay, um so I guess we should. We should uh discuss this further like next week, once everybody or I don't know in two weeks, once everybody has everybody in the call has uh some time to uh to go over it if they want to um so then we will all have a better sense of of exactly what um uh what is going on in the paper and see if there are any like uh potential issues there.

C

What do you think.

D

Yeah sounds great. Thank you yeah. Actually, I certainly will take a look and uh try to get a better understanding of the tradeoffs involved with data placement like this there's always tradeoffs yeah.

A

Sounds good, so I mean, I guess one one thing here right would be that they they talk about performance, but really what we want to know. At least what I want to know is with like this migration free expansion and avoiding migration in general. um You know what how much? What's the reduced amount of data movement right? Because performance, I don't really care about- I mean that's kind of arbitrary, but but what's the the real effect on data movement in the cluster? That's what I really want.

A

I don't know if they actually show that do they.

C

um I see here a figure talking about the uh the number of affected pgs um yeah number of like the pgs in layer merging. So um after after four expansions of the cluster.

C

um I don't see here anything specific to um to the percentage of of data migrated, but they might have like um just you know, wrote that and didn't um add like a a graph or something it might be discussed in one of the sections. So.

A

It's just a really. It surprises me that they don't have a figure showing the reduced data migration, given that that's like the whole. As far as I understand it, the fundamental benefit of this is that I don't care about latency or I have in the you know grand scheme of things that much um for this specific thing that.

B

They're doing I want to know how much was the reduction in data migration because of it yeah the idea I got from them from the you know,.

C

From the beginning of the paper was that they are more interested in the effect that reduced uh data migration would have on on. You know stuff like iops and latency, so that I'm assuming. That is why uh the figures they supplied were mainly discussing about those sort of things: iops, latency, um yeah and all the rest of the things we have here.

C

The computation overhead and stuff like that.

A

Yeah, I mean that's all, that's all kind of you know nice, um but that that you know comes as a result, hopefully of the reduced date of migration and then.

C

A

Yeah, the the big reason why I like that, potentially more even than iops and and throughput, and all these other benefits latency, is that um especially on nvme drives, you have a reduced amount of right endurance right. You have a limited number of amounts of right endurance, and so, if, during healing events, you can reduce that you can make your your nvme drives live longer.

C

Yeah exactly um I guess I will try to you know it's hard to me to believe that they didn't um uh supply even like some sort of number of of the reduced migration. You know. Maybe they didn't dedicate a a full figure for that, um but I do believe that they um somewhere mentioned uh they reduced migration, but that is just me, hoping um I'll have to I'll have to to search for that.

A

I mean I'm I'm kind of doing a keyword search for migration through the paper right now and I see lots of mentions of it, but it's always in terms of things like the migration may never happen or less migration or you know, reduced migration. It's it's like you know, kind of wishy-washy.

C

Yeah, if you take a look at figure number two um before they even discuss map packs, they are. I think this shows like the the data migration in in crush. um It's yeah figure number two.

A

A

Sorry, what page is that on.

C

um Like one of the first pages.

A

C

A

D

Finger two yeah yeah three.

C

Yeah yeah, okay, yeah.

A

Okay, so they're saying that this is when you've got a large cluster of 10, 000 osds and you add a rack of 80 osds. Five 5.86 percent of the pgs are affected, but if you have a small cluster of 240, osds and 59 of the pgs are.

A

A

Josh, do you remember in the sages crushed paper, we have limitations on data migration.

C

Yeah uh they discuss it in the paper. um I think I know what you're talking about. It's called uh there's a flat or um there's a variable called. Oh, it's the max the rsds max backfill. I think I'm trying to to grab that um I'm pretty sure they.

C

They mentioned that um that ceph has some sort of uh variable, limiting the migration happening and, and they do um address that in their experiments actually, and they show that that even with that um sort of uh uh optimization, their algorithm still outperforms once again in terms of iops and latency.

D

Yeah that max bag tells us about the rate of recovery, not about the amount of data moved. I think um our mark. What you're talking about may be more related to kind of the uh modular stuff and crush to try to limit migration to like half a subtree.

A

Yeah yeah exactly it's, it's been a while a long time since I really looked at this, but when I was trying to make um like a a halton sequence based distribution, algorithm that all all it ultimately just didn't work. It was one of the big problems with it is the amount of data migration.

A

I thought, if I remember right in crush one of the big advantages is that you had a bounce on how much data migrated when you removed or added a target.

D

Yeah, that's that's right. I just don't remember the precise bounce and what that is, but I think it yeah.

A

D

Have a more of an impact on a smaller cluster for sure, just because uh you are usually changing a larger percentage of the cluster at a time.

A

Yeah, so you have like a bound on adding any one, and then it must like have a cumulative effect where, like this 59 number, that they came up with for like 80, osds and 240 usd cluster, it must be the like. The that 59 is the cumulative like uh change. When you do all of these together.

A

Is that? Does that make sense.

D

Yeah, maybe I mean like 80 out of 240- is a. I got an increase of a third right, so it's going to be a much larger impact than adding 80 osds to a 1000 cluster.

A

Yeah, so I don't suppose do they have a graph in here that shows that specific example compared to what then they they do, how many pgs are affected.

C

um I'm pretty sure, let me take a look. I think it's at the end of the paper.

A

There's this there's this thing on page seven, I guess.

C

Yeah yeah there is uh the figure number eight yeah in page seven yeah. It talks about the different um different experiments and then.

A

And he says number of pg's affected, but what was the configuration for that? For figure? Eight like how.

C

C

um I think are you talking about the experiment setup?

C

A

A

Good well, what I really want is: I want that result from figure two, adding one rack, 80 osds to 240 osds. I want to see the crush result and then I want to see the map x result. That's that's! Basically what I want yeah okay, so I'm trying to figure out like in in figure eight now what exactly that means.

A

Like what's the comparison here,.

C

um You're gonna have to read through um where they discuss about player merging in section number three uh 3.2 to be exact. To really, I guess, understand um what they're, comparing there.

A

Yeah yeah, I mean it's not like a one-to-one comparison of. What's in figure two, it's just it's something different right, yeah,.

D

C

I mean I can, I can contact.

D

Them yeah- the authors here are- are active in this left community so that I'm sure they'll be willing to answer questions and uh maybe may even be able to share their code. I mean, if you wanted, to run more experiments or extend their work.

A

D

A

What what do you think about having them? If they're active I mean, maybe they could come present to us.

D

D

If you like, I can reach out to them.

A

I'm open, uh I'm not what do you? What do you think would you be interested in? That would be useful for you.

C

Yeah, uh it would be really useful because the thing is in the conference they actually, they were supposed to also uh present their work in the conference, but they couldn't they couldn't make it because of coronavirus. So it was just somebody else, just reading their slides, so you couldn't really understand anything because he was just reading the slides, so I think they would. I think they would also be interested in presenting that um I know I would definitely so I mean I can contact them or josh.

C

Can I don't really mind whatever you feel comfortable with.

A

Yeah, it doesn't matter to me uh either if, whichever you guys want to do just, let me know and uh we'll we'll set up a time slot on one of the performance meetings for it.

D

Yeah sure, if you want to reach out to them that sounds great yeah cool I'll.

C

uh I can do that I'll, send them an email uh today.

A

Cool yeah, just let me know when and we'll reserve it for them. um Yeah.

C

Okay sounds good and then uh we between us, we can talk about it more uh next week. I think.

D

Cool that sounds great.

A

All right, we've got a lot of uh latecomers from the course stand up here, any any thoughts or questions folks.

B

I think I need to read through the paper to ask more questions, so I'm going to sure.

A

All right: well then, let's, let's all read the paper in more detail this week and then next week, we'll uh we'll talk about it.

A

Oh there's a chat. The.

C

Link, I just attached, has um um well it has a link to the paper itself and also to the presentation uh video from the conference. If anyone is interested, but once again like I said, the person presenting is um he's just like reading the slides, because he's not one of the uh uh one of the people who worked on the paper because they couldn't make it to the conference, though okay, okay, but if you want more slides there, you.

A

A

Thing here: okay, cool, all right, well anything else. This week, then.

A

Guys, if not then uh have a good week and uh looking forward to discussing this more next week, have a good one have a good week.

B