Ceph Performance Weekly, 17 Aug 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2023-08-17

Description

Join us weekly for the Ceph Performance meeting: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute
What is Ceph: https://ceph.io/en/discover/

A

Okay, so let's see um the first updated Pier that I saw promote object when reading newly written object at read. Only cash I have not looked closely at this one. I. Don't think it's actually gotten a review yet from anyone. It's just assigned to core right now, um so yeah I, don't this is really small. This is just in the primary log PG code.

A

huh Well, I didn't even um yeah. This is a small change, so one of us should review this at some point. um No real description other than this little one. Well anyway, yeah there's that PR and then, um let's see, there's also an updated PR, uh not not much uh up in terms of Updates. This is Igor's uh not resetting the prefetched buffer uh PR from a while back. This was like um yeah.

A

This is like this blue FS pre-fetching thing, um so uh uh it was marked stale, Igor Mark to unsdale and he's hoping that Adam can do a review on this. One um I believe that, if this works, maybe it would improve our whole situation around needing to use buffered.

A

I o for blue FS um I, don't know if Igor ever tested that or not, but our one of our big issues right now is with how prefetching works for um reads that we would expect to be coming from the rocks to be black cash, but are not so anyway um yeah. Definitely if we can get that one under a microscope a little bit more and see how much is helping that would be good.

A

um Those those were the only updated PRS I saw so uh did I miss anything guys.

A

All right, it doesn't sound like I did so um for this week. I don't have a whole lot uh to talk about um the pr that that I'm working on uh to improve Erasure coding has not seen any updates.

A

um I've been kind of holding down the fort at klyso, while Dan has been on vacation, so I've been involved in a lot of other random stuff, but I'm hoping to get back to that I wanted to this week. Maybe next week. um One of the things I do want to do is once that. We're we're satisfied with that. Pr I also want to go back and revisit the idea of um uh data cache for EC shards on the primary. That would allow us to avoid doing a remodify right over the network.

A

I think it's maybe something we should. We should also be looking at kind of closely is probably more important than having a local buffer cache inside the OSD for Erasure coding is actually caching, the the remote shards um so we'll see, but I think that's also, probably something that we should be looking closely at um other than that I. Don't really have a whole lot, so um I'll open it up, Gabby I know: you've got some interesting stuff. You've been working on. So if you have any interest in talking about that.

B

Yeah so I got um this thing about that snip, my pair dream.

B

The fix itself is very small, but the testing for this um taking forever to complete its. So it's supposed to be walking fun and so.

B

I'm having troubles hearing you I love.

A

B

A

Yeah Gabby, it's really. It's really bad.

B

Investing girls.

B

Is now better seems like it.

A

B

B

What weeks unacceptable.

A

B

A

Well, we wait for Gabby to come back ahead. Oh here he is okay, never mind.

B

Can you hear me now seems like it thank.

A

B

uh Let's hope for better um sound, so first the problems I'm still don't know. How to explain is that my tests generate 100, 000 objects, create a snap and then override those hundred thousand objects which create hundred thousand clones and then two hundreds snap object.

B

200 000 snip objects, so all together, three hundred thousand, when we dream we now end with three hundred thousand tombstone, so the first streamtex sometime and then it seems to be growing more and more and my explanation to this is the accumulation of tombstones and then eventually, things go back to normal. So we start with I think with like 40 minutes trimming and after a few iteration, we go up to 37 minutes for the same amount of work, but probably with a lot of tombstones created.

B

But after that we got the next one. We go back to like 12., so I assume something triggering and cleared tombstone.

B

um What I didn't keep record and I didn't pay attention, and maybe I should next time is that some of the tests been executed, one after another, meaning I finish one test, I record the numbers and immediately start the next test, and some tests have very long uh time between them. Like I, ran some tests in the night and then I continue in the morning, so I'm speculating that.

B

Maybe there is auto compaction happening, but it only happens when we have enough empty Cycles, and that happens when I take breaks between tests, but when the tests are executed one after another, we are always very busy because when we dream the system is about 200 percent busy and additively to 200 busy. So maybe three, maybe compaction is not very effective at that point.

B

So that's this thing with my code. The times is much better and it mostly goes something five to eight minutes and even the spikes I think the worst spikes was like 15 minutes. So this the whole spikes were a little bit more than the best case in the in in base code and I had much less spikes and I.

B

Think that could be attributed to the fact that in the previous code we spend disproportionate amount of work on the last entries, because the first 12 and a half percent of the object we pay by one call for every trim, the next 25 uh two and a half percent. We pay two goals and in the end we pay eight call so the last eighth or the last 20 12.5 percent of the system.

B

We pay by eight accesses to roxdb for every object and by that time there's much more Tombstone accumulated, and this thing doesn't happen with my change, because we are always paying a single access.

B

So maybe and another thing with um no actually sorry, that's so with my code and that when I was doing if the iterator. With my changes, the tests were executed one after another in a quick succession. So it seems that they are not very um they're not affected by taking breaks or not taking breaks. So that's this.

A

So Gary I'm I'm still super curious, whether or not tweaking the um the uh uh the the compaction on deletion settings to be more aggressive and Trigger more compactions sooner. If that would help or not.

B

You're talking about numbers and the numbers I've I got should means that any settings you have should be enough to trigger, but maybe it's a priority thing well.

A

So so all it does is basically when you're iterating through Keys, um if over by default right now, I think it's. If you, if you see 8 000 Keys over a 16 000 key window, then it will trigger compaction.

B

But what's the probability of the compaction, is it the same popularity like the work it.

A

Will trigger basically a notice to rocksdb internally that it should trigger compaction, but it's not synchronous. It's an asynchronous call. So internally inside rocksdb the priority is whatever rocksdb decides. It is.

B

Okay, so I, don't think there's much you can tweak from that direction, because the amount of tombs that I'm creating is enormous. Every iteration generate 300 000 tombstone.

A

B

And that should be enough to trigger anything instead. I don't see it happening all the time. I I assume it's because I'm every time the tests are executed, I'm also taking about 200 of the CPU. So.

A

B

I forgot to mention when I'm talking about taking 200 of the CPU it's by going to rocks to be because all all the time is spent uh traveling Works to be.

A

um Do you know when you do your your New York, creating tombstones? Is there iteration happening concurrently, or does it all happen in like one one go? What do you mean I didn't get the question. uh You said that you're creating like 300 000 tombstones right, yes, is that all happening uh in like one Loop or is there other stuff happening in between.

B

No, no, no there's many other things happening, but.

A

B

Happened is that every time we extract a two object, we search until we find two object, and then we do a lot of processing and remove them and then get another two object, processed and remove them and.

A

Do you iterate to find those the next two objects in between yeah exactly exactly so. What should happen, then right is that you, you do all this iteration as you're creating tombstones, and it's only until you get to the point where you have like 16 000 tombstones with the current settings where you, then you trigger the compaction, but you've done a lot of iteration up until that point over tombstones to get there. So it might be that by shrinking those those settings down to trigger compactions faster, maybe it's better.

B

I I, don't think I, don't think that's the case here, because what happened is as I mentioned, like it's going a bit like uh I, don't know the word for this in English like the first one is pain, one. The first eighth of the system is paying one access, then the second half is paying two accesses and so on.

B

So what you see is that we finish half of the half of the object, which means 50 000 object, and maybe we only pay 20 of the overall time that that majority of time in Spain is is done by the last 12 000 object, I think the last 12 000 object takes as much time as everything until that finger agreed.

A

B

This is uh if we would be compacting by 16 000. That should be happening long time before.

A

Yes, I I, agree. I agree though I'm wondering is. Are you seeing evidence that those vast ones are always the ones that are taking the most time or is it like? Yes, like kind of like a seesaw thing.

B

No, no, no, it really! You can really see when it's taking a long time. It's usually it's been in the last um twelve and a half percent, that's just where we spend unproportional amount of time. Doing that, because you'll see the beginning, like take you very few minutes to finish half of the system, and then we just slow down.

A

Do you do you know if you're holding iterators open sorry do you know if you're holding iterators open.

B

No I should not be opening anything because between every two objects, I'm closing everything: okay, okay,.

A

But I was wondering if it was possible to generate.

B

Too many compaction, actually that's actually might be the reverse problem. You.

A

Could you could try changing it? The other way make it so those are much uh it takes longer to trigger compaction with the settings. I was kind of wondering, though, if it's possible well, you can find out. You can look at the logs if there's a tool, I've got that you can look and it'll tell you uh about the compaction events that are happening, but I was kind of wondering if it's possible that, because this is like an asynchronous compaction that happens if there's something that's preventing roxyb from doing it right away like.

B

A

Why I asked about the iterator being able to open.

B

I suspect, that's the problem because, again, you'd see the rocks to be threats are being extremely busy because of the search. The search is.

A

Very very expensive, yeah exactly and.

B

That's happening for like 20, sometimes 30 minutes.

A

Okay, okay, so so maybe something's preventing the compaction. We can verify that it didn't compact. Looking at the OSD logs that I'll give you the tool.

B

Now I, don't know how realistic is my test, because.

B

Normal thing to do.

A

I, don't know I mean we.

B

Definitely we'll do another another round. Yeah.

A

I mean certainly we we when whenever this gets really bad, it's like Corner case things right like that's, that's probably when this would show up with something like your test, maybe not normal operation, but.

B

Again, the good thing is that the small change and again the actual change to the code is maybe 50 lines of changes. And then we got I, don't know three four hundred lines of testing, but the actual change of the code is very minimal and it seems to be doing it's very.

B

It seems to be to give us a very good Improvement, which seems to be sustainable.

A

Good that's excellent. This is this is an issue that uh that a.

B

Lot of it's like really like it's like a freebies, there's like it's, not it's not a trade-off between anything. It's just gives you some goodness without disturbing anything else.

A

Those are, those are the best. Yes,.

A

Yes, well good good, yeah, I, think um I think it will actually be a really good good Improvement. We.

A

We definitely see that this is one of the cases that uh or this class of problems are ones that other users users a lot. um The I mean even to the point where you can't like heartbeat timeouts.

B

Even we feel fixed before you fix I, remember, testing this before on, um remember what it was, but it was sometime in Quincy, then we've just seen the system keep growing and after a few iterations we got to 100 sorry to one hour and then it was just keep on growing and just like.

B

So what the customer have in the fields now they would see this thing just growing and growing growing and getting to like crazy and all the time it's 200 CPU and if they have more snaps being trimmed, then this thing just been accumulated, which means the system never have any time to do anything else. Yeah, so I think your change is probably the most critical, because some break.

A

To do things yeah, and if and and but but yours too, on top of it if it improves the the the reduces the amount of work we have to do there on top of it, I mean, maybe we get it to the point where it's just not a problem anymore,.

B

It's still a problem, but much less so yeah, yeah I, think the cycle of streaming in in RBD snaps is about 15 minutes. And if dreams takes you half an hour, it means you're not going to be able to trim in time, because by the time you finish streaming then arrive yeah.

A

B

Have already two cycles which you miss in the back.

A

B

You need to go to like five minutes, I think with my code. It's now by averaged seven minutes: okay, 100 000 object, which again it's a very it's a large amount of object. I, don't think it's it's normal to see that amount every 15 minutes, but at least it means that, even if every 15 minutes they will generate 100 000 objects, we're still going to be able to complete them half the time, and then you could do something else. Yeah.

A

A

Okay, well sounds good Gabby, that's great.

A

All right well, I,.

B

Don't have anything else, guys was.

A

There anything anyone wanted to talk about.

A

If not, then I've got an emergency outage, I have to get back to so I think I'll I'll wrap it up.

A

Well, thank you for coming and uh have a great week everybody and uh see you next week, thanks bye, bye.