Ceph RGW Refactoring, 15 Mar 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph RGW Refactoring Meeting 2023-03-15

Description

Join us every Wednesday for the Ceph RGW Refactoring meeting: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute
What is Ceph: https://ceph.io/en/discover/

A

B

A

The first topic is regarding test coverage for the multi-part re-upload fix. A lot of work was done on multi-part uploads to avoid leaking data and index entries if a specific part is uploaded more than once that merged for Reef, but doesn't have test coverage to check for Orphans.

A

So there's a Tracker issue for that and Eric volunteered to work on it. So I think we just want to decide exactly what form that test coverage should take.

A

One of the suggestions was just to write a work unit. Just a shell script. Oh hi Matt. We just started talking about test coverage for the multi-part re-upload work.

C

A

The initial idea was just a work unit, script that did some multi-part uploads with re-uploads and then did some scanning of rados for leaked rados objects and scanning the bucket index for elite entries, foreign.

A

A

um I think that would be the simplest thing to do just to get coverage of for that PR.

A

uh More generally, there's an idea to scan for Orphans after every rgw verified job, um but that wouldn't uh that would only cover like orphaned, rados objects and not the index entries so I think that would be less useful in this particular case.

C

Yeah I wonder: I, don't know why whenever I hear that idea about scanning scanning for Orphans after everything, every job, um I sort of wonder if we might have maybe it'll, be more attractable to have a special, a particular particular workflow that that get that that that checks for reference or thank you.

D

I mean I do know that with enlarged sites, you know which it tends to be where people want to clean up orphans. You know orphan list does take quite a while to run but I'm wondering at the scale of these tests. I mean. Are we creating hundreds of objects?

D

um Typically, I, don't know that it would be that um slow to add that onto the job.

A

Yeah, that's a good question. We um we have a specific job right now, I think it's under rgw Tools, that does the SK uses the orphan and gapless tools, and that does take quite a while.

C

More and very more varied workflow workloads to just do other than running or straight tests, and we keep them decoupled would be a more efficient strategy.

A

Possibly yeah, but it wouldn't give us as much coverage I, think I think it would be valuable to to catch issues from any of our testing um and I mean the idea at least with S3 tests. Is that it's cleaning up all of the objects and buckets that it's testing, and so as long as we run garbage collection before scanning for Orphans, then I would assume that there would not be that much to scan.

A

But I I do think it's fine to treat that as a a separate task than just testing for multi-part re-upload.

C

Yeah I I'm right it's doing making an adjustment that makes everything substantially a higher overhead.

D

Yeah I mean: could we do it stochastically? So, like 10 percent of the RSW verify things chosen at random, you know just choose a random number in the script. You know uh run it so um we'll we'll catch things ultimately, but um won't add.

D

If there is an over overhead issue, it won't add to it consistently.

A

Yeah, that's an option.

C

Yeah that sounds appealing.

A

um Maybe the first step would just be doing GC and scanning for Orphans and.

D

A

Seeing how long that takes before merging right.

D

Yeah I believe the rgw orphan lists um test that that's run currently exercises things quite a bit. I mean it does uh S3, it does Swift, but I. Don't I'd have to look and remind myself what the scale of of objects of how many objects there are in it is, but um it does try and exercise every um every type of uh of object possible.

D

A

Interesting. Okay: well, do you think that at least for testing multi-part re-uploads, it would be enough just to change that test to add re-uploads.

D

Oh I see what you're saying um maybe.

A

Although I think we would still need to scan the index for orphaned index entries, which I don't think is covered in uh with the orphan tool.

D

Do we have any kind of tooling which which does that I'm trying to think.

D

um Shoot does any kind of tooling currently look for um orphaned, bucket index entries.

A

Not that I'm, aware of I feel like we would have to do a bi list and right.

D

A

Knowledge about, what's expected to be there.

D

A

So yeah I think a dedicated test. Script is probably the still the easiest way to do that.

D

We might have to even go beyond the bi list. I might have to read in the um manifest from the head object as well. Well, no, never mind I'm getting into orphanless territory, never mind.

D

It would be enough to see if the head object is there. That should be enough.

A

A

um Right and like the the bucket index for multi-part uploads would have index entries for each part until the multi-part complete happens, and then it should clean everything up right. So maybe just scanning for the multi-part entries could be enough for the script uh-huh.

D

Well, that makes sense, although it ultimately, it might be nice to have a tool which you know is the is the correlate rgw orphan list, which would quickly scan a a bucket index, and let us know if it finds any issues.

C

I would enjoy the experience of the existence of such a technique.

D

D

So I mean I'm happy to add that to my agenda, I mean I feel like I've gotten.

A

D

Maybe not yeah anyway, yeah I'm happy to kind of add that to my to-do list.

A

All right, yeah, I, think the most important part for reef at least is just covering the multi-part re-upload case, so I would I would definitely start with that.

D

Got it and that's definitely a case where excess um uh on in completed multi-part upload components appear in the in the bucket Index right.

A

um I I believe that.

C

I, don't think it's the bucket index I think it's a they were I mean in the rate of schools.

A

C

D

A

That it was both operation.

D

But after the re-upload finally succeeds, um we should not find any entries in the bucket index and we should not find any objects in the data pool.

D

Is that a true statement.

A

So it would be the multi-part complete Step at the very end that should clean up all the index entries and.

D

The re-upload achieves that the re-upload uh process should achieve that right.

C

D

Okay, I think I understand the issue. Then.

A

Okay, great so the test script, I assume would just for one or more buckets or one bucket I guess: do a multi-part upload with a few parts and upload one of those parts several times and then do a multi-part complete and scan the index and pool for Orphans.

D

Sure, and is there a way to force the re-upload.

A

D

I mean: do you stop part way through and then API.

C

A

Right, the Moto 3 apis have separate calls for like init multi-part and then.

B

A

And then upload with a part ID, and so you could just upload the same part ID several times before either doing multi-part, complete or multi-part abort.

D

C

D

It okay, so I'm gonna have to polish up my python skills.

A

Okay, yeah there's, there's definitely lots of S3 tests that do this stuff, so I could finally and share those.

D

Okay, so that's okay, I I've dabbled in this python stuff before in these python tests before so um I should be good yeah. If you could just send me your script, uh Matt I'd appreciate it.

C

We'll do but one thing that what that um I think uh you should mention was that you should also test up word. Multiverse same thing happened there: okay,.

A

All right um so as an action item I can update this tracker issue 58780 with um some of the details of the test plan, just to make sure those don't get lost. Okay,.

A

Any other thoughts on this.

A

All right great I see that Shilpa has added a topic about archive Zone, Maybe.

B

Hi, um yes, so uh the topic is about the inditex request and this one is a long-standing request um uh where the archive Zone, where they are saying that the archive Zone generates a new version, even if only the metadata of the object has changed and not really the just, not really the contents of the object itself.

B

um So a very naive approach would be you know to compare the new versions. The new object versions check some with the existing versions that are already existing and and decide to discard or keep it. But but we know that it's going to be a very costly operation.

B

um Is this even something that we should consider solving at all, especially since we already have Matt's uh archive Zone I? Think it was life cycle policy which sort of helps us expire, the objects, and we can keep the number of objects in control on archive Zone.

C

What that's interesting well I've been thinking about this for a while for of late, um I, I sort of think as a quick fix. It's not a great idea yeah.

C

um It occurs to me that and if I brought this up to to a couple people- um and it's been thought of before that at some stage we consider the content, addressable uh representation for objects- the rados either either perhaps is an option rather than a replacement, for we certainly do. But there's a couple of benefits that might have one of them is: it might increase the information density and the object names for rados, um which would be convenient.

C

um But if you- but if you had some some some block based content, controversible blocks mean for our largest objects in the entire say bucket in an entire bucket um or a group, or some other way of notation than you would our existing techniques for for sharing objects or citations. It seems like you'd get you could you could get a fairly flexible, d-doop system?

C

If you use large block sizes like current stripe size, it would trivially handle the the um the three output cases and then it wouldn't be specific or buckets and there's object of the same name. Etc.

A

Right so for this archive Zone case, it would keep creating new versions for each change, but each of the versions would share the the actual object data. Since it's the same yeah yeah. My intuition, is that I mean it seems like they do actually want to create new versions when the metadata changes, otherwise they they just won't, have a record of the current metadata.

C

Yeah, that's right right. If we actually gave them what they want, they would be angry.

A

Yeah because there there are things like said Apple, which foreign which only change metadata, and if we don't replicate that to the archive Zone, then it wouldn't have to the most recent copy of the object.

B

C

Sort of tangentially our Downstream discussions like product management level, discussions and above that there's there's a substantial interest in relying more rather than less on our kinds of so I. Think that that perspective justifies looking at long-term data reduction rather than reading at SSN has some kind of a hack or a point fix.

B

Ed, okay um yeah! If, if we want to further discuss on this, and since this is probably Downstream related, I could uh add this as a topic for our Monday meeting discussion.

C

Yeah, that would be good. Okay,.

B

A

Yeah uh on the agenda, I added a link to a Trello card, that's been on the Upstream stuff backlog for a long time. Yehudo is really keen on this, but nobody is uh kind of proposed the design for it really.

C

A

C

He proposing was he putting.

A

Yes, they do. Okay,.

C

A

C

Rgw intriguing, I think I think that's um good.

B

C

But if we want to resurrect it, I'm excited to find out that it's something before, but in terms of the you know something that or Friedman brought to me was there was. It was actually researched by by Gabby bananek that that he he proved that he seemed to prove that that if we had a contract, addressable uh radius object names. That was such such particular sense that we were relying not relying to the same to the current extent on these these long.

C

uh So these long identical prefixes um that we could improve the efficiency of radios look up substantially.

C

For what that's right,.

A

Okay, yeah I'm trying to think about how that would look and I guess that would have to um be cataloged in the rgw object. Manifest like you would have to have a list of the hashes in order to look up each of the the tail objects.

C

Yeah I think that's exactly correct.

A

Okay, yeah we'd have to think about how that would scale to really big objects.

A

um We get some compression just by relying on numbering of object, names currently.

A

But yeah I think this is a great idea and rather than adding a bunch of special cases to Archive Zone, this would be a much more generally useful feature.

B

Yeah um I just thought: that's right, yeah I would I would actually make it like to make it more generic than just that. Guy, someone specific, because I think we have we're already doing a lot of things. That's changing the semantics romanticide uh for archives.

C

Yeah, that's exciting, so let's talk about it further Inland. We raise a couple interesting. Some interesting points, there's other dimensions to this right: Block, Base versus fingerprinting, other techniques, but I think we should explore it. If objects were getting very large, we could perhaps do some kind of scaling factor for the trunk sizes or things like that, or maybe that would maybe that would be useful or maybe but reduce, but it would reduce due to accuracy.

C

But from the point when I opened up when I was thinking along the line, so it was was something which were large chunks, like maybe maybe striped chunks and, if I think I think my presumption there is. That is that in such a such a scheme, we're not we're not right. What we're doing is eliminating identical we're not seeking aggressive. You know Common blocks, but different strategies probably coexist.

B

um Matt I don't uh completely follow you. uh Would you mind putting your thoughts down on the on the card.

C

A

Yeah I think if, if we're interested in pursuing this, maybe um starting a discussion on the Upstream mailing list about kind of design, ideas to start with.

B

Yeah sounds good.

A

All right any other topics for the agenda.

A

Let's call it here then thanks. Everybody.