Ceph RGW Refactoring, 4 Jan 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph RGW Refactoring Meeting 2023-01-04

Description

Join us every Wednesday for the Ceph RGW Refactoring meeting: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contrib...
What is Ceph: https://ceph.io/en/discover/

A

I want to talk about the replication, our application, consistency, guarantees um and I. What I have summarized here is what I think is true and Casey basically agrees with the most important Point. um We have in this expectation that replication be reliable uh and I. Think that essentially means uh China, actually transactional and consistency, or at least apparently so we currently seem to have it.

A

We currently have a two level system uh and the back of the bucket index, the bucket index operation and operation and the ingest operation that that triggers it are are journaled um so so which is fully transactional, which is so that if, if, if for some reason, we can't log the operation, you know the operation to delete an object, we don't we don't delete it, um so that's and and and then and and and that's so, insofar as it goes.

A

That's that's perfect um what people expect from from our consistency, replication, um however, um the they're they're up, actually the full replication, relies also on storing uh a data log entry at that point um and which, which is best effort if.

B

That data log entry isn't.

A

Isn't updated um it? It may not be it may or may not be yeah. There's a there's a there's, a potential problem. It may or may not surface, but but this potential at that at that point of a stall and replication, because even though the operations, but there were the severe replicating has been, has been reliably uh journaled. um We rely on the data log uh entry to to point to The Shard. That needs to be checked for for instinct.

A

Is that that sort of, as things are currently set up best, that's basically, um that's basically irreducible.

A

um If, if we need and and if, if in some in some situations, it could be to a bug but but it doesn't have to be, it could be due to some sort of cluster uh or environmental error. Whatever cluster partitioning something like some various things, um some sort of of bucker application streams could could stall and that's all could be indefinite now it is. It is the case that they're fully recoverable, um so someone can run so bucket, sync run or further ingest on that.

A

Shard um will definitely uh continue the replication for it and that's good, um but we.

C

Don't know, but it's not guaranteed when or if ever that.

A

Would happen um so we have it, so we have a so we have it. So we have a kind of a an algorithmic problem. um It seems like we would like to solve this uh in order to have, but the uh have replication be reliable. The way ordinary people think I. Think of that it seems like there's two paths you could go down to do that we could either we can either. We can either make the data log up to log of it hit what the data that part of the data log. You know.

A

The part of the data log operation that that we have that has that we care about is is is uh I, mean never mind. um Let me say that differently, um because we can either make the make the what we all call the data log uh part uh something something effectively transactional uh which, which would be, which would presumably be more expensive than what we're doing now um or um the alternative would seem to be that we have they put.

A

So you find some other some other way to put an upper bound uh on on how long a bucket a bucket a bucket replication could be stopped um and, for example, you can just you could you could you could scrub all the buckets and you know, and and science on some and some periodic uh fashion uh and research, anything that has stalled um I. Think that's! That's! That's sort of the story for me. It's sort of related to that. You.

A

You you, there are other signs besides walking into rados, for the data log that you that you could that you could undertake. um You could do more, you could. You could have a belt and suspender sort of a Blog kind of concept, for example, um that's fine, um but it doesn't, but but it doesn't help by itself. It doesn't solve this, uh the sort of problem with it, but it narrows it a lot.

A

um So that's that's the topic.

D

That this is happening.

A

D

Mean I'm not certain, if that, if that's a, uh if that's correct description of the issue, um aren't we getting transactionality from the from Raiders itself like? If we do a write, then either the right succeeds or the right fails. Like okay, okay, hold.

A

On so yes, it did the right to your Sprite fails, but but the delete of the object or the storing of the object or whatever it is has hypothetically already happened by this point.

D

Right, but if the, but if the right fails like there are two things that can happen, it's either a bug, and we should fix that and I'd like to know about it sure or like burritos doesn't like currently in rgw with rights. Doubt time out, like things, should either return success or they should return an error, because there was an error if the OSD.

A

A

I was talking to you.

D

Well is that we would either like never have the operation complete and just keep retrying it or uh I I mean I. Guess we could have something happen if rgw crashes.

A

I mean as long as it's like I mean I mean from the point of view of consistency. Once once you once you once you admit that there's, which is it's.

E

Not a mission, but once we agree that there's a race.

A

A

Committing the actual ingest and then recording this Trace we are, we are, we are open. We agree that there's a window uh for inconsistency all.

D

Right so we can have the we can have rgw craft between one and the other I. Guess, that's I! Guess that's uh something that can happen. I, don't think the OSD crashing should necessarily cause it, but maybe it does.

A

Cases, but we can we get this data we're going to take today. We can call we can. We can call you know if you just say it doesn't matter what it is, there's other things that can do it, but it doesn't benefit us, but it certainly is infrequent. I I mean I'm, not saying Mr. Shame us, but I think, but I think we need to, but I think we need to bound this this. This problem.

D

All right so I think that actually is correct, like we can't have an rgw crash.

D

I, do kind of like the idea of having a periodic sweep that goes along, like maybe at idle times and sort of does the equivalent of bucket sync run on all the buckets.

A

Yeah, the quality to do is put an entry in the air log if I find something odd and then that'll do or or in a data log entry or sorry. In this case,.

C

um So the data log entry would be stored on the source, Zone and I. Think the recovery would be happening on the destination zone, so I think the error repo is the right place to do that. Their repo is what the destination uses to track things that it needs to retry.

A

um Why would why.

C

A

Wouldn't writing why wouldn't this are so into it? You know all all zones that can originate changes would presumably do this. um Why I mean well, it seems to me, like the starstone would do this. Why would why would it be the destination how.

C

We would detect it. The destination zone is the one: that's replaying, the data log of the source, Zone and tracking its own sync status related to those logs.

D

Was that we already do have a situation where we um uh like, when we turn like when we do research and whatnot after the end of restarting, we just write one entry per bucket charge to every chart in the data log and I'm wondering if we could and I'm wondering if uh we actually could do it on the source Zone just by having a timer for like every 10 minutes or every 30 minutes or whatever the source Zone just goes through, and does that one entry per Shard thing.

A

That's what that means I mean all scrub strategies would look like that in some sense, but you, but you should but you but you, but if you haven't, whenever we come up with has to has to is, could could be done? Could it could potentially, if you're expensive I mean, maybe so so I haven't assumed that it would be happening close to the limit of latency yeah.

D

But it does seem like we could, but it does seem like doing that thing where we do the write, an entry for every Shard thing that we do after the end of uh restarts. um Actually you know on a periodic basis, might be the lowest effort and Least Complicated way to do it.

A

I mean, if you say I mean if, if that typical pyramid was something like a day, I mean I, can't I can't imagine enough buckets. That would make that painful on a system large enough to contain such much such a count of buckets.

A

Yeah, possibly you could do better than that. I don't know.

C

So I I think I, like the scrub strategy, the best where we would only um trigger retries on things. If we identify that it's behind.

A

um Because I like that too I.

D

Mean can we reliably identified that something's behind if we've missed a data log entry.

E

Right, so if there are no marker updates on the source, then destination cannot know if it's behind or not.

C

So and then imagining a scrub that Loops over all buckets in the system and Compares their sync status with the source zones and if any are behind, then it adds it to the error. Repo.

E

So, okay I just correct me if I'm wrong. So if there are no data log entry, then the sync status on the source, Zone wouldn't update marker right. It will still be on the previous marker.

B

So so, maybe um because.

E

Yeah sorry go ahead.

B

Anything that that you want to to write in the data log you know can fail right, because this is the the initial assumption that something wrong happened here, um but on the destination Zone, you can say that if you didn't get a record for a Shard for a given amount of time, then you you'll do the sync for this shot.

A

B

But instead of doing that for everybody, uh or instead of of the The Source will just you know, tell to sync everything on a periodic some periodic time, and that could be even if this time is long, there could be lots of shorts. But if, um if the destination knows that they didn't see anything for a specific chart for maybe a couple hours, then.

A

um So this is the second optimized version of the of the of the of the of the visit everything yeah.

E

C

But doesn't it imply that you have to track events on every bucket shared in in the system to.

A

Know at least yeah I think that's right.

A

So yeah that's possibly and then and not durable.

A

C

Yes, so going back to Sonia's Point um about not having data log entries I, um the scrub thing wouldn't touch data log entries.

C

It would only be looking at the the bucket sync status to see if the bucket was behind and if bucket shards are behind, then we would add an error repo entry to sync that bucket side. Okay,.

E

E

C

um So yeah I I definitely agree that there's a gap, especially around crash consistency here, but I also agree with Adam that we should track down any other reasons that data log rights are failing. Oh.

A

Yeah absolutely I I.

C

Understand the.

A

Gap this guy in this in this proposal, or this conversation yeah I'm, not saying you- should fix all possible bugs to make the reliability as perfect as possible.

B

So so, regarding that, do we keep the data log in the same pool as we keep the bucket index I mean? Maybe things are failing, because it's completely set of a different set of osds.

C

Data logs are stored in the log pool.

A

I think I think about a theoretical level. It's kind of a red herring I mean that might that might help you find the bug, but.

A

Raiders are supposed to be reliable, so we have advice at a theoretical level. We have to trust it.

C

Foreign to my knowledge, the only reason that we should be seeing errors from these rights in a you know properly deployed configuration as if we exceed the retry count.

C

um Do you guys know if we've seen evidence of that in the logs.

E

We haven't actually uh looked closely on that, but but Casey, but at least I think in the bug which was filed Upstream by Bloomberg. They did mention once I think that uh even after a retry it fails when there was some races. I think that that got fixed now, but yeah.

E

Also, we treat her for only a certain time period. Right can we extend it like if there is still a data lock failure, even after so many particular number of free tries, we keep it in mem or log it and keep retrying after maybe, with a longer time, duration between.

A

Well, I think I think I got a backup. Log idea is fine and it makes it harder for it. It might be, it might be efficient as long as for certain class of things, but but I don't think it eliminates the the.

E

Gap crash, that is, the crash. Yeah.

D

Yeah I mean I, don't think a property to the um into the uh retries exhausted case like probabilistically. Maybe it can but I'd be surprised, but I suspect it's very, very, very improbable.

A

Well, I think I agree with that, but I think, but I also think we should design I mean I I think we should design the system so that itself, heals yeah.

D

I think yeah I think uh I think the case that rgw can simply crash after writing. A uh after writing an object is a good thing to want to recover from and I think that actually suggests that we do need a scrub analog specifically.

A

Okay, I, don't have anything more to add to that, but I think I think that's something that's kind of where I thought I thought too, but then it seems like basketball and then we can talk about that as a thing to think. Look at in the next in the next cycle,.

C

Sounds good any other thoughts here.

C

How about new agenda topics? There must be more to talk about we've gone a year since the last meeting.

A

Okay, what 110 people are talking are talking about spitting up on doing the new current team asynchrony stuff. Is there anything to talk about about it?.

D

I think I need to rebates after uh I think I need to uh not rebase, but I think I need to edit um the our current neo-rated uh push request, since it does have. um um Let me call it since it does have an incompatibility with the new version of of live fmt.

A

D

A

Also uh review comments from Ilia.

D

I'm going to accept one of those comments. He had two months to review it I'm not going to do a bunch of formatting changes.

A

A couple of hey, you can standard move, this I mean I, guess so it's true I.

D

Mean yeah, you can standard move. This is you know fine I'll. Do that.

C

All right anything else, for this call.

C

Thanks everybody happy New Year, see you next time.

E

Thanks happy New Year.

C