Ceph Performance Weekly, 20 Oct 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Performance Meeting 2022-10-20

Description

Join us weekly for the Ceph Performance meeting: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute/
What is Ceph: https://ceph.io/en/discover/

A

Let's get the meeting started.

A

I only have two new PR's this week. That I saw um the first one is actually kind of exciting, because I think Adam was uh trying to start looking into this uh himself, but um we've got a contributor from canonical that uh found a race condition in the unknown cash uh Igor. It's related to all the same stuff that keeps harassing us over over the years.

A

um I'm gonna put a link in the chat window.

B

Yeah I saw that and yeah which hopefully fixes the same or haven't checked at least new PR. Maybe it's a little better but yeah. It's a long issue.

A

It's pretty actually it's a pretty simple fix.

A

Well, maybe deceptively simple, but not a lot of code, so yeah I added you as a reviewer and if yeah, if you have another one before um is, is it a PR or is it just a branch right now your version yeah.

C

C

ah Very good, thank you.

A

Your PR is definitely more involved than this one that this user made.

B

Okay, let's put glance uh this up. I, don't know simpler, but I need to double check if it's correctable I'll definitely review that stuff. Here.

B

C

Ing issues and although net references have been kind of plaguing us for a few years now,.

A

Yeah yeah absolutely.

C

A

I'm, just adding your a link to your PR in in this new one Igor.

B

Oh well, let me actually I link to the tickets. So, oh okay to multiplication.

B

C

C

B

C

A

Okay, now I gotta get back to my uh Hey YouTube that window okay, so there's that PR and then um we've got a new PR from Adam about improving deferred the Deferred right decision making um and Igor I was hoping. I might get you to just quickly look at at that too. I reviewed it I Adam. Those changes make sense to me, I like what he did.

A

um The only thing I'm worried about is that you had added uh some code, uh maybe a year ago, regarding um when you've you've kind of got um multiple chunks, that you're you're making decisions on uh and there's. This has chunk to defer logic um which I I, confess I, don't totally understand uh what your your current code does so uh I was I was a little afraid of removing it um with what Adam's doing now, but I just well.

A

If you have time to maybe pick a glance over it and see if you agree uh with with Adam's solution here,.

B

Yeah, that's from my my Vlog as well so I'm sure, yes from vacation, so once I stopped so I'll stop.

A

Sure, don't worry sorry, I know it's it's worth I'm I'm reaching out a lot, it is doing. I think makes sense to me at least it seemed reasonable. um He is making uh kind of the I. Don't think Adam's here, yeah he's making the kind of conscious decision to um uh just kind of treat um everything that the allocator gives us as a contiguous region, um which you know when the disk is full or if you get lots of little things. Maybe it's not a good assumption.

A

I don't know, but uh in the general case what he's doing seemed seemed to be reasonable.

A

So yeah, if you, if you get a chance um I, think Yuri actually put it in testing already. So it not sure if, uh if the plan is to actually merge that right away.

C

Or not, that's just to get a head start on the testing uh Mark. It's not like you know. We still want okay, we're talking about plus one for it. If there's some obvious things, that's big and then the reason being uh being that I think eager is also a Pacific. The next specific point really at least relies on this, so we we won't be doing the release until we have a fix for this and there's a lot of interest.

C

um So yeah yeah.

A

Okay, well so anyway, yep there's, there's that one new as well and um that was all I saw I didn't see anything performance really closed or really updated this week other than those two new ones. uh Was there anything I missed from anybody.

A

All right, um I did not quite make it through the list of old PRS this morning. I got most of the way through, though, uh so. There may be a couple things that got closed out here, but I'm I'm, guessing it's probably about the same as uh it was last week for discussion topics.

A

I didn't have anything listed here, but there is one thing I wanted to mention: um we've got a user that is trying to use secure mode with uh suffix and uh they saw significantly lower client performance than I saw doing the same RBD workload, a small random reads of like 16k, with not using secure mode and I.

A

Think what we're probably seeing is that that's a fairly significant source of overhead, especially now that um uh Adam Emerson's uh uh boost SEO stuff emerged that they gave us a really significant client-side performance Improvement a couple of years ago, and um it may be that suffix and especially secure mode now is, is uh a bigger bottleneck than it used to be so um not 100 sure on this right. Now, it's it's just a hypothesis, but um it looks like it could be.

A

You know like half the performance with secure mode enabled, so it's a scenario that we may need to uh look at going up here. That's really all I've got at the moment, though um I'll open it up. Does anyone have anything they would like to talk about this week?.

B

To this specific meeting, but I saw your comments on obviously be the capital stuff.

B

So you'll expect some concerns about that. So I'd like you, go liberate a bit, but.

A

Sorry, I I couldn't quite understand what you're saying Igor I was breaking up a little bit on my end. What was this in relation to.

B

So I saw your comments in the pr which uh uses recovery.

B

Oh yes, yes, some concerns and I'd like some operation with foreign.

A

uh Brought that here to my attention, um because he was expressing some concerns about it and I guess the the thing that I'm concerned about is um so the idea here right is that you're, basically you've taken a snapshot of of Rock's DB and if somehow your database gets corrupted or whatever you basically just replace, what's there with the the old version and then use kind of our recovery mechanism to try to get things back into like a stable state right.

A

Yes, so I guess my question is: do we do we actually feel like that's, safe and reasonable to do? I like I I'm, not saying no I'm, just saying like I I'm I'm worried so.

B

uh Well definitely at least is not 100 reliable, since you can have a mismatch between user data and metadata in this backup clone, but it might help us to.

B

To make OSG recovery much faster, so, instead of full.

B

Instead of need to perform full cloning off the broken OSD, you can recover from this database and then use grabbing deep, scrubbing to fix just broken or notes which have no matching checksums, uh not to mention that it might help in case you have multiple bronchologies and you just don't have enough replicas to cover pages.

B

But if you recover OSD this way you might get more. Data recovered. uh Actually I have.

B

On top of this stuff, I I put the link in the original PR, which actually reverts Bluffs to initial State before doing the in the recovery uh and while I played the beat with that, and it looks like it's more stable and more reliable to to bring us the app.

B

uh But anyway, the the the approach doesn't guarantee that all user data are valid after the recovery. But it makes it allows us to recover USD.

B

At least most of which I'd say well, if you have a backup, if you have yeah, if you have backup for mage or often enough, for example, for your, if you don't you're, not much out of scene between backed up dot, DB and actual music data on the disk.

A

Yeah I guess um I. Guess the the thing that that that I guess I'm I'm most concerned about is just kind of the unknown right like like it. It I see this in my my first thoughts. Oh my gosh. How many Corner cases could we hit where yeah something's gonna end up corrupted with with trying you know to do this approach and maybe that's unfounded I. Don't know um you know fear of the unknown right, but um but that's that's. What I am I'm honestly most concerned about I, don't have any specific examples.

A

Just um you know kind of concern that you could end up with uh a really inconsistent statement.

B

In other words, it's like uh Last Hope recovery. So if everything doesn't work, you might try this to recover using this approach at least, please sum that.

A

If you're, if you're to that state, though, wouldn't it almost be better just to recreate the OSD from the replicas on the in the cluster I know it's slower, but.

B

uh Sometime, well, it's slower and you might simply don't do not have enough replicas. If multiple SDS are down.

B

Yeah yeah I mean this allows to to recover this. Some data.

A

Sure sure, if you're like in a state where you don't have enough replicas to recreate things, you could maybe take this, and, and you know, if you acknowledge that you could end up with inconsistent date, you could maybe still get some data back, though.

B

C

A

I wish we had kind of gone the other way where we store the same metadatas and roxdb inside the object on the Block device, so that if your database is corrupted, you could recreate it from the data on the file system or on the Block at the on the Block side of it.

B

Yeah this sounds subjective neat, but because choice, what's the other head of this, but in this approach but yeah.

B

Definitely Netflix.

A

So yeah I guess um I, don't know how Joshua Neha do you do you have any feelings about how to if we were to approve the PRS and and kind of make this um available in in some of the uh tooling that we have? How we'd be? How should we guide users on this? This scares me still, but, um but maybe it's maybe as a last-ditch thing. It's it's worth. Having I, don't know.

C

Yeah in general, I think these animosity can be pretty helpful, um but they do need to have some big warnings around using them, because it's similar to like the TV store, TV store tool like DB repair, where it has potential to and and give you a kind of. You know incomplete state. That may not be so obvious from just the name of the command.

C

But it's really this. It should be very clear in the docs and that command line that this is really elastic type of thing and that you may not actually get your data back.

C

Yeah my two senses: it's good to have such things in your back pocket when you have nothing to rely on you've been there, we've been in situations where we're like. Okay. Only if we had something like this uh but yeah having that extra flag, like you know, there's no guarantee about full recovery uh is definitely something we should add.

C

And if I recall correctly, you got to encourage me if I'm wrong, but we do have such options already in the blue store tool right which either okay. We can try as much as we can, but no there's no guarantee.

B

Well, actually, it will stop to repair controls like that yeah.

B

Yeah, so in some cases we might still want HD off and maybe put some clothes being lost.

B

A

Igor, you may want to talk to Adam you because I think he was also concerned about it, and maybe he has more specific concerns than mine. Mine are just you, know, General, but he might have had more more specific things. He didn't like I.

B

C

B

Say that I'm completely satisfied with this feature as well, but no, please just provide some additional.

B

To perform recovery, but it's definitely not 100 guarantee, uh and in that case the keeping claw or copy of metadata in its main device. Outperforms DB looks subtract differently, so maybe we should consider this optional, maybe start working on some design on that, because well.

C

B

I I I, I I could recall plenty of cases when to be is broken and we want to recover this resist somehow. But we can't.

A

Yeah yeah I wish we had done that from the get-go just appended. The metadata to uh you know a portion of the object or something that are pushing for it.

B

Right, let's say we should probably have a sort of another lock, so maybe which is an efficient for Polo cups, but can be used for so yes, but can we we can use and Recovery yeah, but the at this point.

B

Exact solution just a note that yeah, it definitely makes sense too from us to think about it, and maybe we can keep this metadata log at first device, not at the main device to avoid.

B

The record on each right, maybe something like that, but yeah that definitely worth discussion and thank you.

A

Yeah yeah I've, I've, always kind of thought of roxyb's I mean it's right. Now is the authority source of metadata information right, but uh you know it's really only there for fast lookups, yeah.

C

B

Right so it's good for for for for lookups.

B

It's like a black box, so if something happens to it, yeah.

C

A

Well, how nice would it be, then, uh for a variety of reasons? If you did something like that, you could people could replace their DB device easily with a new one. They could get rid of the DB and recreate it. You know, that'd be so many issues right now, Beyond, you know just people having corrupt DB, uh they want to they decide that they want to move it from uh one fast. You know once a slow device to a faster device or or have fewer on one device or something it'd be nice.

A

If we could make it simply for them to be able to do that.

B

Well, migrated uh immigration is available at the moment, so it's not big deal but recovery. uh That's the option of it that highly appreciate.

A

I know there's a way you can do it, but I thought the at least it used to be at the moving. The database was pretty um end of complicated. Is the right word, but just uh uh a little bit of a an intense process per user.

B

Well, the we've recent releases: they have this option in set volume and I recall we had to copy. You know a bunch of cases, but we performed that for our customers, so I I can't say that's uh pre-care currency now.

C

Okay: okay, that's good.

A

All right, well cool yeah, if Igor, if you're interested I'd, be happy to talk to you more about uh that, I think would be really really nice if we had the ability to recreate the database from from the Block device, that'd be killer, feature.

A

But for this maybe for this meeting uh we can move on for now, I guess if, unless people want to talk about that.

A

One one thing actually not related to that topic: Igor I did want to talk to you about is um uh Adam and I have been talking a lot about deferred rights lately and um one of the things maybe the heretical position I've been taking is that we should just get rid of them entirely and move to a model where we write to a small portion of the flash device as a non-deferred right and then instead migrate data over to a slow device uh uh after the fact, uh not as part of the actual right process, but as kind of a just a later deferred process.

A

um I think Adam's not totally on board with it, yet he's still trying to think about ways to make deferred right better, but um I I was hoping, maybe you're here. If I could get your your take on on things. What do you think.

B

B

As well so at this point, I I dislike roxdb, uh quite significant, so the less Lord we have there. The better to to me.

B

Well, yeah, maybe it makes sense.

A

One of the questions was kind of like where to do it.

A

um I I started looking into the uh into the kernel device and and block device uh interface and I'm wondering if there's some way, I can basically uh create a new kind of abstracted block device implementation that takes two kernel devices and then makes decisions about where to write data to and then from there. Maybe even Behind Blue Store could could kind of migrate data from one to the other, but you probably want some kind of hinting through that block device interface.

A

Regarding um you know when, when you should defer, if it's something that's short-lived, you know these kinds of things um that what.

B

A

B

Think I'm afraid you might need some some additional metadata to keep that stuff which is not present but come in the last level, so just speculating so far, but uh well abusing blue effects for this purpose look more attractive to me.

A

Yeah bluefest already has a lot of this right, like their extents, actually know what devices they're it's sitting on, whereas we don't have.

B

That it keeps a sort of file system which pyramids from metadata.

B

Breaking, let's say top yeah I am afraid it's not enough to have just a pure Capital device for for this stuff. So if you need to sort of file system, yeah break it system, so.

A

Do you I mean, do you think we could use bluefs? Would that be crazy? Oh.

B

Well, at least this looks like uh maybe primary option to try.

C

A

Yeah I was I, was kind of heading down the kernel device trying to do it that way path, but I I agree with you, I'm nervous about the metadata, that's needed, and, and what changes would happen have to happen to that interface to to really make it work.

C

A

I was also thinking well, the reason. One of the reasons I wanted to talk to you about it is um I, was thinking about your right ahead. Log work and I was thinking about how once we've written data into the right I have log I wonder if we could do that scheme. We kind of talked about in the past, where Maybe you can almost treat the part that you've written into the redhead log like an extent like you've, already got the data there, it's already on the fast device.

A

Maybe you could actually treat that as if it's like you're kind of intermediate copy of the data you can read from and then and then slowly as you as you merge things over to the slow device you can clean up old logs behind and um and you know, otherwise, you just leave the existing lugs in place until you're done with them.

B

Yeah, maybe, and please see, mines and now the use case for all these. You know potential use cases.

B

um So yeah, maybe we need an abstraction like that to keep a bunch of logs uh out of logs GB.

B

And this might be g-log. This developed right, lock and.

B

Property right, the headlock and yeah I'm, not sure if they should be mixed or maybe kept depends on the product independently. But it looks like we have a bunch of use cases for for stuff like that. um Well yeah, it was some yeah if.

C

B

Maybe some some design efforts.

A

Be really nice if you've already written to the fast device for the log and you've already got the data sitting there. If you didn't have to rewrite it back into the last device or, like you know, kind of a fast layer that then later gets moved into the slow device. It'd be nice. If you could just leave it in place and treat it as a that. Data is like a new extent.

B

Yeah well well, the idea of these points is the sample of the use cases. So you want to perform a writing fast and then you want to apply some post processing to sort things out and put them the slower device to DB.

B

You know arranged manner or whatever, or maybe throw them away or PG log.

B

So yeah we need a an entity which is able to to accept data fast, keep them internal storage manner and then provide the ability to sort them out. Yeah and and again there are multiple use cases for that.

A

My Hope Is that we can basically just completely get rid of the Deferred right path. I, just don't want it to exist anymore,.

B

B

A

A

All right, well, that was that was all I wanted to bring up to you um since there's just there's a lot of work, kind of in different different people looking at different things, but there's this kind of unifying desire, I think to try to make all this simpler and and make more sense. So yeah.

B

Yeah but uh as we discussed that before all these modifications look pretty dramatic. So at some point I'd like to to Fork the story, presentation keep the Legacy involved and maybe go ahead with new one because making touch changes to existing conveys. So it's dangerous and it's.

A

It's it's probably a good good point that at some uh at some point we have to decide what's reasonable to do the Blue store and whether or not we should make like a blue store two or something.

A

And even you know what what blues? If, if we did something like this, what what are we recreating c-store or are we you know, are changes more limited to really just trying to clean up and fix the the kind of major awards that are in Blue Store?

A

At one point, I was even thinking in Blue Store. It wouldn't be impossible to kind of adopt a a sharded uh right path where we have multiple instances of rocksdb and we have specific shards handling specific uh pgs. uh It didn't. It didn't look impossible to me, but like that's, almost recreating c-store and and uh in Crimson at that point, not not entirely but sort of so I. Don't.

B

Know well well, I think that idea about may be charged in TB instance in around in booster as well so yeah I doable. Actually, it looks doable.

C

B

I better to have parallel experimental implementation, then try to fit everything in in a stable model.

A

Yeah I had started working on a branch like a sharding I think there's a couple of little irritating areas where, like you'd, have to change things to make it work, but it didn't actually seem as bad as I thought it would be at first I, don't know if that was your experience or not. When you were looking at it.

B

Well, fortunately, chance to try that so I can say it's a little doable, but I'll have to to predict any performance. Improvements and stuff. So definitely should help in case when we.

B

To utilize this division thread, so if this thread becomes a bottleneck, that's charging DB instances might Trump, but well.

B

C

B

These these cases for some artificial benchmarking, but I'm not sure if it's a real Edition environment.

A

I think beyond the um the KV sync thread, the the way that the chart it up work Q works right now and especially the way that um we kind of try to fill things in with the messenger threads and then let those uh the worker threads go to sleep and wake back up based on the status of the queue that it all. I I've seen some evidence that that this kind of model is is not ideal.

A

That that whole side of it too, if we can figure out a better way to um kind of make that whole pipeline work faster.

A

It may be even as simple as just cancel some kind of back off polling.

A

Okay, well, I I, don't have anything more guys. That was that was really.

B

Well, maybe one more topic which I I'm not sure if we have anything to discuss, but uh we might want to so well, we definitely need some solution for uh the former.

C

B

Performance drop after bulk removals, so I I saw plenty of cases. Various cases to be honest and to with us.

C

B

I, don't know how how to do that, but it's definitely something some attention.

A

There um I had a PR where I was starting to kind of talk about some of that stuff.

A

This thing here and and you're you're, specifically talking about bulk removals when you don't have rights to force compaction right.

B

uh Well, it's often hard to stay why roxdb is unable to perform its regular returns so automatic comparison. So what I definitely know is that the Rocks DB might get into the state when it performs widely and highly likely. This is related to previous bulk removals and melon compaction. Fixes that it does um yeah I saw that for back feeding.

B

uh We saw that with 20w, but at least the bucket to sharding and I believe your.

C

Father use cases.

B

A

There are two separate issues right, um one is fragmentation at the SST level and one is fragmentation at the um uh the the tombstones in in mem tables themselves. um The pr I think do.

B

You required yeah, okay,.

A

The pr I linked in the um in the chat window, uh it adds the ability to set this um uh uh sliding window for when you're doing, iteration will trigger compaction.

A

So that helps in the case of the SSD file. uh uh uh Tombstone uh uh Behavior. If you have too many tombstones and you're doing it, and it will force compassion um so that Pier will help there, but it doesn't help with the mem table problem. If you have too many tombstones in the mem table,.

B

Okay, uh do we have any ideas, how to at least indicate that cluster is exposed to this issue, Maybe, something like how many tombstones we have or whatever just so right now we even lack any diagnosis tools, so yeah we have. We can see that customs uh in a bad shape, but in fact we are unable to.

B

To say, what's happening so and and Hope.

A

I'm actually hoping that this PR will fix a lot of the problems where you have tombstones in in too many tombstones in the SSD file. Right, because on this one, if you're iterating and you hit too many it'll trigger the compaction and then it should, it should move quickly. After that,.

A

My Hope Is that this PR will actually help fairly dramatically. In that case, it's the other case that that I don't even know how, often that is a problem for us, uh tombstones, yeah.

B

That's the question actually so right now we are able to.

B

To even um say, if something is wrong, with the cluster I mean we don't have any metrics.

A

Yeah and I, don't think Robert might be even like gives you that for mem tables, like the the mem table side and the the SST side are like really separate. I I wanted to see if I could re-implement this um this capability, that they have for this sliding window in rock CB to do the same thing for mem table uh uh uh tombstones, and it's not the same like it would be.

A

You'd have to do it differently, I think. uh Maybe we can still do it there or maybe there's some way that we could expose more information from Rock's DB, like um you know whether or not you've created tombstone in the mem table, we can track it or something, but um Rock's TV doesn't seem like it gives you a lot for for this kind of problem.

A

um The other thing we could do, though this is what I was originally thinking of doing is we could actually track deletes ourselves at the rocksdb obstruction layer.

A

And if we've issued too many deletes um before, uh like a compaction, is taking place because I think we can ask roxyb to do a callback when compaction happens, so we could watch for compactions and if we've done too many deletes before compaction or men, people flush, that's the other one meant people flush. Then then we can say: oh, we might be in a bad State.

A

Let's Force compaction, let's you know, and then we can track how many deletes have taken place since since the last round table flesh and the last compaction, but it's kind of weird with column families, because you have to track every single one.

B

With yeah, maybe.

A

That was what I was originally thinking of doing, but it's complicated in this big, like it's a lot of work, it'd be nice. If we could, just you know, have Roxy, be fixed themselves and or help for us to be fixed this and not do all that.

A

Do you do you have any um like tests that that have the the problem uh right now like I'd, be I'd love to know if this PR helps.

B

Fortunately, no so I'm not worried about the reliable way to reproduce the issue so to greatly depends on the hardware it looks like it depends on the hardware on the access pattern.

A

I mean definitely, if you are doing no rights, you've done a lot of deletes and you're doing no rights and then you're doing iteration. You should be able to reproduce it I think right.

B

Hopefully he has slept well.

B

I haven't seen any not available.

A

Changes for the um the the settings out, but some of those settings actually do seem to help a little bit um with with certain behaviors, so I'm tempted to leave it in, but I, don't know it's okay either way. Really we could split this into two separate PRS and try to test each individually.

A

A

Changes are the ones from the blog article that I made.

C

This thing here.

A

So yeah anyway uh yeah you were, if, if you know I'd be I'd, be happy to get back into this again, if we want to try to make it better.

C

B

I'll take a look at this.

B

um But yeah definitely I'd like uh to be able to reproduce this issue and be able to diagnose it. Yeah remix.

A

I I do suspect that this pier is good. Like I, don't I, don't think you think it's bad we're exposing functionality in roxdb, and you know we we can. We don't need to change the defaults. We can remove that if we want to I think it's helpful, but um but we we don't need to do that. um We could just expose these other options from roxdb, but I think it would be I think it's a win uh just no one. No one's approved it because no one is able to test it right now.

B

C

All right well, does anyone have anything else.

A

If not, then uh thanks for coming everyone and have a great week.

A

Thank you, yep talk to you later guys, bye thanks.

C