Ceph RGW Refactoring, 19 Oct 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph RGW Refactoring Meeting 2022-10-19

Description

Join us every Wednesday for the Ceph RGW Refactoring meeting: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute/
What is Ceph: https://ceph.io/en/discover/

A

Foreign welcome everybody see some new faces. If you have any agenda topics you'd like to discuss, please update the doc.

A

Yuval, would you like to start the first one.

B

Okay, yeah, um so I want to present a small project that I did with some student. It was like their um in the University course uh pretty small, but I think really nice.

B

So the the project targets the kind of iot industrial space where you have mqtt as the main protocol and in in some cases you want to so some of this information is needed for the uh the short-term um management of of data, but some of it need to be stored for later, for data scientists or for machine learning, training or something else um that needs to to have a long-term access to the data and S3 sounds like a good option for this kind of use cases.

B

The problem is that mqtt is really designed for very small messages, so messages in mqt are, like you know, uh probes and sensors measuring voltage, temperature, all kind of things like that, and they usually send messages that have a very small payload. It's a very efficient protocol with very low overhead and if we just naively take those messages and put them as objects in S3, that's going to become an extremely inefficient um process.

B

So the idea is to uh was to write a converter that Aggregates those MPT messages into into objects and um Echo into some configuration policies, stores them as S3 objects.

B

um So this was the project and I have a quick demo showing how it works. Let's first, you look at the at the configuration of this smoke converter. So in the configuration you have couple of things, you need to Define uh the mqtt broker, and here uh the mosquito project has a very nice service for who wants to try it out, don't need to install a mosquito broker on their laptop on their machine.

B

You can just use the one that they open up um in the cloud for for Access, of course, don't send anything sensitive there because it's like everybody can see that um and- and so this is like kind of one leg of this uh of this converter. The other leg is the S3 endpoint.

B

um The students actually um didn't want to compile Seth, so they uh they tested that with AWS suggests what um I I compile Seth every now and then so I just test it with SEF and with the Reds Gateway. um This is the aggregation kind of logic. You can it could be based on the amount of messages uh the size in bytes uh time.

B

I, don't know, there's a couple of conditions there uh that you can, um you can put together to to create how those uh entity messages are being aggregated and- and the last piece here is the mapping between buckets and topics. So here my topic, one in one, my topic- 2 both go to my bucket.

B

So these are the configurations that are needed as a side note, because we want to later on containerize everything and make sure everything runs nicely in kubernetes uh the broker and the S3 endpoint are probably going to be removed from the confile they're going to be automatically extracted um from the kubernetes configuration like from their operator. There's a mosquito operator and, of course, there's Rook for the um for the safe operator and um that doesn't need to be configured in a in an openshift or kubernetes environment.

B

But this is just a kind of a manual Standalone environment. So I have to configure that and I'm just going to run this. um So it's subscribed to those two topics. It doesn't do anything special because it doesn't get any messages.

B

Now here, I'm going to um um to use like the the mosquito project has like a client that can do publishing so I'm going to publish to the test mosquito server on the web. I'm going to publish to a topic called my topic one, because this is one of the topics that my converter was subscribing: you're, subscribing to and I'm going to send a hello world message um now. I also have to remember to create the bucket because the the bridge doesn't create the bucket, so I have to create um a bucket called my bucket.

B

Because again, this is what the configuration said and then I'm going to send the messages. So you would see that it received the message because it subscribed to this topic, but it doesn't do anything with it because it says that you need to get five messages before you're gonna aggregate them into one object, so 25 uh five messages. Then it created the object and he written the object into the bucket.

B

um The object name is is time based because in most cases those are going to be used later on to do some time series analysis.

B

um So this is how we decided to create the um the object and, if you can see here, I have a couple of objects and they all contain five messages same size and the the name is according to the time.

B

um Another small thing that also um was part of this project is that there is the not the best name, the opener, so um you we do need to have some kind of logic that helps whoever wants to read those objects and extract them back to the original messages, because um the person that writes the application, that does the analysis or presented in a graph or something that doesn't care that we put that everything in one object. This is just for our efficiency.

B

They want to see the original messages um so, as part of the project, we've written, like a small python code that, um based on the way that we um we've written the object, do we do the extraction? The the thing that we do is that we create a small header in the object and we put the offsets of the different messages together with their timestamps um in this header. So uh you know with five messages.

B

It's really trivial, but I mean let's say that for efficiency results, we want to create very large objects with thousands of messages, and the reader doesn't want to read the entire object. Then they can use um the um they can read the header of the object and figure out what offsets they're interested at and then use the range command to fetch only part of the object to get to be more efficient in what they're looking for um yeah. This is pretty much it in a nutshell, this small small student project, if you have questions.

B

Okay, thank you that.

A

Is cool the um the subject of how to kind of do bulk, uploads of small objects or figure out a way to pack them efficiently has come up a couple other times, but I think this is interesting in that it's uh kind of an application layer thing that's doing the packing.

B

um Yeah so I mean because, because this is this is specific to mqtt right, because uh those those messages have their own meanings and so on um and.

B

um Maybe, with other more General packing, you would need some other kind of logic, so I think it does make sense to use it here. The original thinking was not to use it not to build their own small header in the object, but to use the actual head object like choose the attributes, but I don't think we had enough space there in the attributes to uh put enough information, assuming we want to be able to have a like a long list of offsets of all the small um all the small messages in the object.

B

So the the first idea was to use the the attributes, but I think they have some kind of a limit in size. That was not sufficient.

A

Yeah that makes sense.

A

Any other questions or comments on this one.

C

uh It appears to me that this is somewhere similar to what the multi-part upload, which is what I've been a plane waste in the last two weeks uh that has the uh dedicated metaphile, which records with objects are just similar to what you all said about where the offset should be for each messages, and you can also has a list of objects that for each chunks being uploaded, uh which may be corresponding to the aggregated object.

C

You always talking about grouping multiple message into a simple object, and in such a structure you achieve the uh with basically one layer of hierarchy from metadata from manifest to the actual object which you can retrieve the individual messages.

C

Of course, many uh improvements and variant can be derived from there, but I find the similarity somehow in between those two.

A

B

Yeah I think the multi-part, though, is kind of um uh I mean just built into the fact that you can't handle you don't want to handle one huge uh one, huge object, and it's pretty generic. It's like later on, you kind of forget about all the parts in the motor part I'm.

C

Not talking about using a multi-part for this purpose, I'm talking about how the the method and how multi-part object is managed behind the scene. You know similar to what you have been doing right.

A

Yeah and in the Swift API there's um kind of explicit support for manifests of um smaller pieces of objects.

A

They call it SLO and DLo.

D

It seems like it might be interesting to have a multi-part like API, that does structured data.

D

So the the um you know the init would Define the structure of the data and then each of the puts would you know, fill in that structure and then you could, you know, list these things and get the the header which would have the structure defined in it and then read out the structured data from that there might be a generic way of doing effectively. This kind of thing.

C

Like a virtual directory, some sort.

C

You basically create directory. You drop the object into directory. The director has the metadata contains all the you know, metadata information for each object, something like that.

D

Yeah I mean I was thinking kind of like a database table, but the direction.

C

D

Mean the the analogy works either way, I think yeah.

A

So S3 select is kind of what I, what I think of in terms of that being able to just write the structured data and then query specific parts of it with select statements.

A

So you could do the initial upload as a Single part or multi-part.

B

I think I think what what Daniel alishin said is that it's not only the retrieval of the data that could be structured, it'd also be the upload of the data that could be structured.

D

Yeah, that was the idea is the upload of the data, could be structured.

A

um But okay with um with multi-part, you can init multi-part and then you can kind of stream in all of the parts in parallel.

A

um So in this strategy, we'd still need to figure out how to get the correct ordering I believe.

D

Yeah I mean I, guess you'd probably have to tag the parts with um a header that describes which of the pieces of structure. It fills I, guess or something like that.

D

I was thinking of it in the context of of what you've all presented where the data comes in sequentially, and so it wouldn't be parallel. But you're right, multipart is parallel and you probably have to be able to handle that.

D

Anyway, just a thought.

A

A

All right shall we move on welcome Andy.

E

Hey thanks for having us um so I'll, admit: I'm I'm uh I have some experience with stuff, but uh not a tremendous amount, so uh take uh anything I'd say with a grain of salt, but uh I was sort of helping to facilitate some contact between the Sith team and the folks at Aquaman, La node. So uh obviously a machine that's been uh looking on a or looking at a possible mitigation for uh a problem.

E

I'm assuming everybody is probably familiar with it um in the case of Leno, the the probably the biggest challenge that we have related to multi-part uh orphan objects is that um are accounting for billing purposes is uh incorrect, I think in general uh you know there are a few different scenarios whereby clients inadvertently upload multi-part, uh Parts, multiple times, I think in many cases it's buggy scripts or or that sort of thing.

E

So for some time we've been trying to figure out how to deal with both culling the Redundant Parts uh in a time and space effective way and then also trying to Implement some sort of fix.

E

That would prevent the issue from occurring, but one of the things we're we're very sensitive to we that has sort of a long history of taking a third party off-the-self software and patching it in ways that are difficult or impossible to Upstream later and so to try to avoid falling back into that trap again are our aim here was to be able to contribute back a fix that uh you know that would be generally usable, and um so we are sort of in the in the the hacking phase on that now.

E

But I think that when I reached out to Daniel, the real thing I was looking for was some guidance on what sort of a fix might be upstreamable, and then you know we later learned in the email thread that there was. uh You know a parallel effort underway and I I haven't seen that code, yet so I don't know much about it. But um so that's that's my my introductory Spiel I'm not sure uh where to go from there.

A

Foreign, okay, so just to clarify we're talking specifically about multi-part parts that are re-uploaded and we're orphaning data from kind of the previous part right.

E

A

Is this is not about um incomplete multi-part uploads that just never get cleaned up right, okay, yeah! So this does sound like what Matt is working on in in that PR, um but I think it just got complicated and and hasn't been finished right. So I I pinged him, but he has a conflict at the moment.

E

All right, so it seems like maybe a um a good Next Step would be to to see if, if he and uh you think you can compare notes um because I think it would be helpful for us to know whether the strategy that we're thinking of it had been potentially ruled out already.

E

um Or you know if it's similar enough to what Matt has been working on, that it might be that we could. You know we could at least assist with the testing or.

A

Yeah um so let's see I I, don't understand the design of his work well enough to to explain it. Does.

B

A

Here: I'm I'm, guessing that but yeah.

B

Maybe you can show the pr.

A

I I did link it in the agenda. Oh cool I can get it in chat too. Oh awesome.

A

So yeah I think um I can ask Matt on the pr itself to just give a high level description of his design, and maybe you guys can take it from there.

E

Okay, yeah, that sounds perfect.

A

And Asian is the one that should be the contact. Is there a GitHub name that we can tag you on.

C

Actually, yeah I have a GitHub account. uh It's a Iman 77, the Y I uh ma77.

C

Actually, a forkl accepts there and to play with this awesome.

A

E

B

E

B

No, no, no, nothing! Nothing too important. Just uh maybe a small note is that we recently added the ability to uh to nicely Trace multi-part uploads. So if the different products go to different rgw's and and so on, then uh we edit tracing abilities using Jaeger um and there are specific mechanism there that is geared toward multi-part so uh because this is kind of a problem usually for many people.

B

um So instead of digging lots of um lots of log files uh on multiple rgw's, uh you can see the the nice tracing. Okay, all the parts and maybe it'd be easy for you to see. Okay, the same part twice or um a cancel or something that didn't finish with it so forth. Oh awesome, yeah.

E

That would be nice. The question I had is that I knew sort of based on the name of this call and then doing a little bit of reading around the the uh you know. The head of the repo right now as it looks like uh radius GW, is getting some refactoring related to like pluggable back ends, so is if we were to devise a fix that I think we'd be targeting like 16x now. Does that code look radically different following a refactoring or it has the refactoring already occurred, or.

D

It's mostly been ongoing for I guess three years now that the meeting is a bit out of date in terms of name, okay, yeah.

C

It started out as.

D

That, but it's it's now just the general Upstream planning and discussion meeting for rgw, okay, um stuff stuff is generally not difficult to backboard. Okay, um there are, there are some specific areas in which it will be more difficult than others, but this kind of thing probably will not be particularly difficult. All right.

E

Foreign I, don't think I have any more specific questions, but um you know if anybody that uh an arcamide does feel free.

C

I'm good uh having a pi is wonderfully helpful I'll. Look into that.

A

Yeah I'm good I think that seems like a really good path forward.

A

All right thanks folks, I know this one has been an outstanding issue for quite a while. So I appreciate your your help, pushing it through.

E

Yeah no problem I.

A

Know get out, I've got a plan in the pr for finishing and verifying the stuff there yeah.

E

Yeah, it's no problem. After you know my sort of naive read-through of the code. um It was I. I came away realizing what a challenge it would be to address it safely and correctly so I I, don't envy you guys for ever to do it.

A

Foreign, any other agenda topics.

A

A

All right: well, thanks! Everybody see you next time, yeah.

E

Thanks for having us thank.

A