Ceph Performance Weekly, 22 Sep 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2022-09-22

Description

Join us weekly for the Ceph Performance meeting: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute/
What is Ceph: https://ceph.io/en/discover/

A

Hey good morning, folks core got out on time today. So hopefully we'll see people showing up here pretty quick.

A

Ronan that um the the Deep scrub thing that you mentioned in stand up um is that discussion happening on the mailing list, or is that uh was that a private emails.

B

Pay on the actually on.

A

The pr itself, oh on the pr, okay, okay,.

A

uh I have to apologize uh this morning. I was supposed to be in a two-hour meeting that got canceled at the last minute, but I've been busy with uh actually exciting things that we'll talk about today uh with Adam, but um I didn't go through PR's at all, so uh I don't have any any updates on that. I I, the last like couple months, I've been kind of slacking on going through PR's, so uh I apologize for that, but uh hopefully.

A

No, no I I haven't updated the list, so um I'll I'll try to be better next week about that, but uh but yeah. Hopefully the trade-off is worth it. We've got some exciting things to talk about today, I think so um anyway, it looks like we've got uh people from Cornell, so uh this is good uh all right, so uh uh I guess, since I don't have any pairs to talk about I'm gonna, Dive, Right In.

A

um We had a really good meeting last week where we talked about all of these uh issues surrounding snapshots and um ideas to try to make it faster uh in the OSD or potentially um specifically with RBD mirror ideas to maybe uh make RBD mirror faster without relying on snapshots, so um I think just as a quick recap of last week, um all of us kind of had different ideas um about things that we could try to make this better.

A

And uh this we started working on.

A

Some of those um I ended up uh working on defragmenting objects prior to snapshot, with the hope that um I could reduce the number of extents and the number of shared blobs uh periodically by doing a copy, and that ended up being kind of interesting I I'm, going to actually talk about it uh before we get to Adam's stuff, because Adam's approach ends up being so, amazingly good that um that, if I presented mine afterwards, I think it would just kind of you know be really unimpressive so um I'll I'll start out, and then Adam I'll give it over to you after that.

A

um So, okay, uh I'm gonna, share my screen here. If you can see this, the the the gist of it is when I started implementing uh the defermentation. This is. This is really really stupid and simple implementation. It's it's just! uh If, if we see that the number of extents has exceeded some threshold, then we will read the object and rewrite it out, basically and and rewrite it out in a defermented way.

A

So um it's that's all it's doing uh this DT value that you see in in this graph is the threshold and then DJ is Jitter. So basically, I made it so that um uh there's some random Monte Carlo like aspect to this, uh to to make it so that we don't have just a stampede of object rights all at once, uh hopefully is spread it out a little bit more. So um you can see in this first graph here that it is effectively reducing the CPU usage.

A

um uh It's Baseline is maybe in this test around a 100, One Core and we're dropping from about two and a half cores down to closer to around one and a half course. So it is the CP when, um for sure um the one of the big concerns with an approach like this is that when we copy the object in sever um the object from uh the snapshots, we use more more space on the disk for sure um when we have a very low threshold. It's quite a bit more.

A

In these tests we have a five gigabyte volume and up to five snapshots, uh once we hit steady state, is always five. So uh if you were just copying objects, it'd be 30 gigabytes of disk usage.

A

um When we have a fairly high threshold uh like in this red and yellow graph, um we can actually get fairly close to kind of the Baseline case uh where we we only have copies for extents that have changed.

A

um There's there's a little bit of uh of uh uh kind of fluctuation as we go along here, but it tends to even out over time um and we're a little bit higher. It's not too bad, though. So. That's not really a problem. uh You know. Maybe it is a little bit, but it's not not a major space amplification, but the bigger concern is Right amplification. When we copy objects, we end up um actually invoking a pretty big uh workload on the disks.

A

um When we have this kind of Jitter uh uh approach, where we add Jitter in, we can kind of smooth it out. So this yellow line, you can see that when we do a snapshot at least temporarily we're we're using about 100 megabytes per second of of disk IO to copy objects, not ideal, but if we were able to further smooth that out over time, we could probably drop that down um if we could make it more of a background, constant workload. uh There's some possibility there, so this does help.

A

um If we actually look at the client level uh iops both read and write, you can see that the blue is our default case and in the red, yellow and green cases, we don't see as much. We don't see as many stalls. This is on nvme on ishidi I suspect that all of this looks a little bit different, but um you know at least as far as the code goes. This does appear to be helping. It's not you know it's there's some value to it.

A

um It's not going to look nearly as good once you see, Adam's results, but still it's not bad um I.

A

Think if we want to continue pursuing something like this it'd be better to transition this to some kind of a background process where we've very slowly go over and look at at uh Ono's that are heavily fragmented and and then maybe um you know see if we can make it better uh and that might still actually even possibly be some uh uh might provide some advantage in with even with Adam's code, but uh it need to be a very slow process. I think much slower than even you know this 0.8 case I've got here.

A

uh We'd wanted to to have you know as minimal disc impact as possible. So uh that's what I did over the last week.

A

Let's see I still don't think. I can probably hear anyone, but if anyone has any questions, I just put them in the chat window.

B

Testing testing Mark. Can you hear me.

A

All right, I, don't see anything so um Adam. Do you want to take over and talk about what you did.

B

Love you guys hope I can get some signs that you can hear me. Otherwise that will be futile.

B

Basically, we do. We do that's great and basically, last week we had a great brainstorming session and one of the ideas was to make completely different approach to tracking shared allocations with a different objects.

B

The idea was to, instead of have each shared blob, each blob track track its own allocations with the other user uses users and making and that's required to convert a standard blob into shared blob. We decided to just try: how will it work if we had just one object, some tracker that will attract the data for all.

B

All the allocations for all that object and all the Clones snapshots of that object. So then we have in that sense. After we start making a snapshot of an object. We just have um a class of uh objects joined by a common tracker.

B

uh The expectation was that it will be somewhat good and it will require some further improvements, but the idea was the: the concept was that we could somehow um move uh the load from shared blobs to some just CPU intensive actions that will be contained into just one one place. We could then improve that, but um it turned even better than uh than expected.

B

I well, I would prefer if Mark was showing this, maybe I will write him.

B

Well, certainly, today is a difficult session.

B

In the meantime, my uh my Branch seems to work. It's only now has a problem that it's fsck is basically asserting and not it's still okay, but the tests are um getting wild because they do find unexpected results. So I cannot do full proper, safe objects or tests, um but for what I can ascertain it's just the logic of snapshots and containing different tracking properly and containing different data in different snapshots is working properly. So now, uh there's only a thing of Mark presenting the comparable results with his setup.

A

So um the CPU usage is really really low, uh much lower than it was in my PR.

A

um In fact, it's it's kind of amazing I, don't think Adam, either Adam or I could believe just how good these results are um it's. It appears that, like all of the um usage that we're seeing in the default case and in Maine, is primarily due to um uh sure blood breaths, um it's it's pretty crazy, uh no space, amplification whatsoever uh as expected for Ray apps. We we don't see any additional um write load on the disk, as as you'd expect.

A

um These props that we see are that happened in my test cases as well. um It looked like it was maybe a little bit diminished versus the default case, but we're we're seeing it kind of consistently um across all versions.

A

Interestingly, though, that doesn't appear to be correlated really with the um the client write drops that maybe a little bit I mean we, we kind of see something here that could be correlated with, um like maybe the the 8000 second Mark, roughly with um what we saw in the the block device right drops here, but we don't see it in like the red case, uh with with Adam's um chart tracker, there was a drop in the block device um uh uh right throughput. Yet there's no correlated uh client breakdrop.

A

So um it's still a little bit of a mystery what's going on there, uh but the good news is that um similar to some, in my case, uh Adam's case seems to be doing a very good job of reducing client right, um I O uh throughput drops here. I will have to see in the HTT case what it looks like, but I don't think it's gonna make anything worse.

A

I suspect it will only make it better, so it'll be very interesting to see what Paul's tests show um and the the read side, um Adam and I were puzzling over this this morning. uh For some reason, in the default case, we're exceeding the fio cap on read iops. uh Actually, we do that on the right. Apps too, it should be 500, and yet we see it bounce up and same in the read case and with atoms change it.

A

It still goes above it, but at a it's capped at like a lower uh limit, which neither of us quite understand, maybe it's some artifact of how fio does its capping, but um in any event uh again we see that um in the default case we drop below the the cap more often without M's changes. So it all looks very, very good.

A

um I I'm I'm still amazed by how much of an improvement atoms changes are making here, and it's also making the the um the structure some of these data structures simpler by getting rid of shared gloves, uh which I'm I'm just uh incredibly excited about so um uh yeah, that's. That was my read of some of this data that we got um Adam. Is that I'll sound, sound right to you as well.

B

Exactly the same, I can confirm that the quality of results was so extremely good that we really thought for some time that there might be some error. Hence you can see two sets of results. One is original um branch and then there is a small fix, but that didn't actually change much so I'm for now I'm a bit positive that it might be a real real output and we can have such performance for wrong running cluster.

B

And as Mark said, that change opens a window to severe simplification of some blue store data structure. We will I'm always I a great proponent of.

A

All right, that's um I, think all I've got Adam anything else.

B

No I will I I can just promise. I will continue to feel in First Step uh make an fsck in sync with this code, so I can properly run all tests and then have some road map to put it into a compatible extension to current, because current in in today's State it's a bit hacky implementation, but with little effort in it can um just extend how we organize our objects for now.

B

I think the easiest solution will be to maybe for some time have both ways of handling shared blobs and possibly still using that old, complicated approach when we have compressed objects. Maybe this is the I mean even not sure now that it will be ultimately the case, but for now I would love to preserve that compatibility.

B

A

Cool, so um any any questions.

A

All right: well, then, um I'll open it up.

A

Anyone have anything they want to talk about this week.

C

Hey Mark, um not so much a discussion, but I just wanted to. Let you know that there's a PR for rgw's HTTP 3 front end now not working yet, but the Prototype is in progress.

C

So that's kind of a performance related project that I'm excited about cool.

A

What um what kind of difference do you think it will make.

C

um I think that at enough scale, with enough connections and streams, I think that we should see uh slightly better performance on the server. um You know faster SSL handshakes, hopefully.

C

um A better control over buffering in rgw.

C

Http 3 itself has stuff like header compression that are that's a good optimization over the text-based HD HTTP one.

A

That all sounds very exciting.

A

How how are we doing now in terms of uh CPU usage and our job argument?

A

This Echo is killing me.

C

um Cpu usage generally uh I, don't know that I have a handle, but I mean we've. We did a lot of work on the the Beast front end um and it's CPU usage.

A

I, don't know if you can still hear me or close the other browser window. um Oh good, okay and I. Don't have an echo anymore, which is fantastic, um I know a while back.

A

um You know we had those really big uh CPU usage numbers. That I think you guys improved quite a bit, um but that seems like it might be the um yeah every every time I hear feedback from um kind of people trying to do cloud deployments and things they.

A

They always want the CPU Stitch of our demons to be lower and usually the simplest decide, I hear about it, a lot but I suspect, especially if, if they want to scale out rgw, um you know anything we can do there to to reduce overhead CPU overhead is probably a win.

A

But anyway sounds sounds like it. It has potential um at some point, I'd like to get back to actually doing a profile on rgw and just see kind of where we're at these days under, like a really heavy ball object. Workload.

A

All right, uh I, don't hear anything anymore, I, don't know if that means uh there's no talking or if it's just I can't hear anything, but.

C

B

Here: okay, good I, don't have any more topics.

A

Okay, well, if no one has anything else, then I suppose uh we can wrap up. um Adam I just want to say again how impressed I am with the the work that you did. um It's it's really really impressive. How how much this helps. So! uh Congratulations, I, I, think you you uh you I, don't know if it's a hit a home run or won the lottery or.

B

It wasn't like a lottery, I guess.

A

B

You nor I expected that it has some big impact.

A

I I was gonna, say I, don't feel like laundry captures it, though, because it was really a result of your hard work, not not just random random chance. So, uh but you know good job, it's it. It looks like uh Your Instinct was was right.

A

I'm I'm very excited to to get your change to to Paul, to try out, because I I suspect it's going to be a big win. um Hopefully, hopefully, um I I'm still a little nervous. What the just the general process of snapshotting and um fragmentation might look like on hard drives um with this workload, but, to some extent we're going to hit you know the limits of what you can. You can do right, um but what we can solve ourselves uh here with this approach.

B

So that's why I think that in a long run we need a different mentor, because some snapshots and small rights will completely break performance, forcing on spinners yeah.

A

Yeah I'm I'm a little nervous that that might still be the case even after this, but we'll just we'll see where we're. Where we end up at.

A

Well, thank you all for coming um and uh we'll we'll talk again next week and have a great week. Everybody.