Ceph RGW Refactoring, 5 Jul 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Ceph RGW Refactoring Meeting 2023-07-05

Description

Join us every Wednesday for the Ceph RGW Refactoring meeting: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute
What is Ceph: https://ceph.io/en/discover/

A

All right so first on the agenda is the topic of scaling up the number of buckets per user. Currently the list is stored in omap on a single rados object.

A

And so the obvious limitation, there is large omap warnings from rados and as it grows too big that can slow down omap and.

B

A

B

How many buckets are we talking.

C

About well, in fact, it was performances in the state isn't a safe way to describe this. The the Raiders team has sort of sort of sort of sort of relative to where, where for things where the other well, how this feature or home itself was defined when, when this, when the teacher was was being designed and other features like it, they considered there to be a a a much as much as there already is for object data and our an absolute Max, which will never be significantly changed um on this.

C

Both the size of an object and and sort of can map is a maximum number of omap keys. That's currently 200 000., it's never going to be much larger and it will not be allowed to change, because uh the the design of radius, replication and Recovery depends on putting a fixed, fixed up or bound on the amount of data that has to be transferred during recovery.

C

There are a bunch of things that go with that, but that's that's just a that's a fixed point: it's not going to move.

A

um Well, I: don't think they really impose a hard limit other than this cluster warning right, because I I know.

C

That customers are raising the I'm I'm using I'm using it in a mathematical sense. Maybe it's bad mathematician language too, but it's a it's it the thing the point. The point is not really fixed, but the point can never move far from where it is, though, the the the design, the design the design limit will not move.

A

Right and a lot of the cases where this has been a problem are places where we were treating. Omap is an unbounded set rather than partitioning it like, uh for example, bucket index resharding does.

C

Yeah and research workspiring for certain things uh it's and, and but we we, we we, for example, wouldn't consider it I, wouldn't consider a good solution here um and I think I think going forward. I I I've gone through several Cycles on this I I. Think there isn't I understand what I would try next to tackle this.

C

um Basically, a two level low map index uh managed by a library um that deals with the concurrency yeah, with coming with reader writer and and update consistencies, but doesn't but but leaves it but leaves it all in omap, rather than mixing in the problem of of implementing some sort of own or Auto. You know commercial, off-the-shelf or or open source off the shelf uh database technology. That's that's a whole different. Those those I think those I think. On balance, those those two approaches have to be split up bar split apart.

C

um You know in a world that we were not relying on readers for certain things. We could be doing whatever we want um for scalable metadata, but we need some strategy on radius.

A

Foreign I guess: I, don't understand your objections to a strategy of sharding similar to the bucket index, because to me it seems like a very similar problem where.

D

C

It's not well, it is but I guess and no, but but yeah. Well, it is this I am proposing a strategy similar to sharding, but charting based on hash um is isn't, is not um all requirements, so some other strategy uh likes like range splitting uh is probably they need it as an alternate. That's all I'm actually saying.

B

C

Mean but you can support a vision, enumeration.

B

We can, we can also maybe uh take the uh fifo approach.

B

I mean assuming that you're gonna have just a handful of those, then you're going to have a couple of objects like in the list, and each object is going to have their own whole map right. So overall search would be so bad.

C

I, don't think that works uh in searching and sorting um terminology, but um but but I mean a technique. That's the technique. That's used in cockroachdb is simple range splitting.

C

um If, if, if we, if we tolerate it, if we, if we, if we simply if we simply if we came up with a way to to allow, for example, uh a small cache of of lru management or whatever you choose uh at a particular client endpoint uh to to manage a two level index, you know, in fact, in fact it could be. It can be shared um with with that.

C

That's the fun part, but I mean it is I, think it is achievable or worth adjusting radio scriveness which to get there um if necessary, um but largely impossible, especially with return with return, run, read and stuff like that. But um but I mean you can you can? You can then have at least uh 150 000 or, let's say 200, let's say 200 000. If you take it, we take behind. At her word, we have 200 200 000 times, 200 000 uh uh objects.

C

uh Pearl map range that um you do have a different different omap objects, uh supporting it to the one object that names. The group that's but that's over 40 billion. That's all that is roughly four is 40 billion um objects, um which I think is a large enough scale to handle almost anything you want to stick in.

B

Do we need sorting for for the uh for the buckets.

C

Sure we need sorting in general, um yeah I mean I mean. If you don't need it, you could use you.

C

Could you know you could you could use hash splitting and maybe maybe this idea can be converged but but if Eric is here and Eric can talk about the complexity of of converting the the partition indexes uh into into an ordered sequence, but as is done for bucket listing it's expensive and it's been prone to more complex bugs uh than you know than anyone has a right to partly because of other other tricky semantics that were layered in with special kinds of objects in the range and stuff like that, but seriously uh it's expensive.

C

um It's it's complex uh and bug problem foreign.

C

Facility does this particular feature struck. You know completely, you know solve it. You know we're going to mandate it no, but but the general problem space, where you have an ordered sequence, that we want that we want to scale. That's real and the only way to scale into radius uh is is, is to split into into independent partitions um that that will that will be distributed. The ordinary you know in the ordinary rules of Kratos. That's the only way you get both.

C

um You know both both in both arbitrary size uh and also the only way you get arbitrary mapping to pgs, which is needed for for scaling performance, because all PG operations at an OSD uh go go in order.

C

A

So I'm I'm curious, um it's not clear to me from um the API back about list buckets whether that actually requires ordering.

A

If not, then that would be a lot easier to use and performant to use unordered listing and a sharding strategy.

C

Maybe but I wish you would I wish that we would not to collapse the problem into into that and I. Don't think. That's wise.

A

And I mean I I agree that the ordering of charted listings is complicated for bucket index.

A

D

A

Is because it has to be paginated, and so we have to duplicate a lot of requests, but the list buckets API has no pagination, so it's probably we can probably implement it a lot more efficiently.

D

A

D

We might have to list 40 billion, yes,.

C

Yeah I mean I, mean I, I think this is again as I said in my email to you probably are a great way: Adam I, I, I, I, I I, don't I, don't think I, don't think. We've worked from the API from this I also. You also made the case that we don't have to do this at all, because it's because Amazon says that 100 buckets per user is just fine, um our users, don't think so so so it may be that we have to that.

C

We that we have to look at some other way to to to deal with to deal with that, because yeah fascinating a large number of objects that, though they didn't have imagination, because it because it assumes that you're not gonna, have very many buckets per user I, don't I doubt that will be permanent, but I, don't think I, don't think our users customers want to be there. This has been persistently raised over multiple years, um so so so I think it's I think it's relevant.

C

If you want to win, if you want to make it we just want to, we don't really keep now we're not going to paginate yet because Eric complies there. I mean it doesn't scale anyway,.

A

Yeah so obviously we could extend the the API like we've done for other stuff.

C

This may be a case where yeah I think I think for whatever reason, I think I think the Persistence of people wanting to do this is real I. Think Amazon is going to want to solve this at some point. It may not look the same as our solution. I I did not flatter myself that they read anything we do, but if they did, maybe we could even socialize it to Amazon.

B

C

B

That does pagination uh impose uh order, I, don't think so. I.

C

Mean imagination imposes order, hence hence hence General, to hence generalize on order. That's.

A

The case I would uh I mean we. We do Implement pagination for unordered listing. We just take the marker that they give us and hash it to the correct Shard and resume listing from that shard. Is that right, Eric? You.

D

Bet that is correct, yeah, so.

C

D

C

I'll I'll I'll I'll I'll yield if you want, but but there will be a request and a need for ordering and if it isn't already here, which I suspect it really is um it'll return it. It will occur.

A

D

And we, and just to remind me that the operations where we use this would be listing a bucket, our users buckets and then we'd have to also modify for any kind of creation deletion and maybe some modifications of metadata those the operations that would use this structure.

D

C

Yeah writ large I mean in any place. We have. We have to have an order in the 9.9 view, ordered sequence, but this special case- maybe not but yeah yeah without without the constraints of the imposed by single object, omen.

D

But for most operations like reading from a bucket an object from a bucket or writing an object to a bucket. We never look at this.

C

We don't look at this where we do look at this engagement is the details and it has a lot of red has to write up in this email thread that I'm on that's that's sophisticated is the way we manage quotas and some other stuff. I mean this. This bucket, these are bucket sequence, is, is doing more things than than just bucket listing, but it those those are more like credit operations and may not have any ordering constraint, but but then again they're.

C

Maybe they immediately do in terms of how they're the fact that they're there I don't know but but yeah there's a there's, a there's other data being being salted in um and and needs to be managed.

A

Right, um similar to the way that we uh accumulate bucket stats for objects in a bucket, we also at the user level, accumulate user stats for each of their buckets. So there's the um like the stats caches, the quota caches in the background that periodically flush and and do rights to this but I think.

B

I think also rate limits, the rate limiting.

A

B

Is doing something similar right right? What's.

C

Different well, probably not because the rate well I mean no, because the rate limit is as we defined it, and those are entirely transient, um but this is for this is this is for memorized essential effect. You know persistent data, reliable data, reliable mappings, right.

A

You can set rate limits on users and buckets and we store that in the metadata, but the actual counting is done in memory per Gateway, not actually persisted.

C

I mean that's an example: if we did want to make it and make it make it deeper or cluster consistent, we might. We might converge that into into our redis at some point things like something like that, but yeah we probably wouldn't persist it.

A

All right so um I mean maybe we're discussing kind of longer term designs around this, but maybe in the shorter term it would help to add some more documentation around the existing user, Max buckets field and warning about the constraints of raising or removing those limits.

A

C

Totally because I.

B

C

Think we think we should we sort of socializing that and you make a good a positive thing about the Amazon limits. As you know, everyone I think mostly the world knows it the people that care, um but but the customers drive into it. Users drive into it.

A

Yeah um I mean in in terms of implementation for scaling, I I do think the sharding module is what I would look at first, but maybe um the first step would be trying to just document all of the constraints and where we do the reads and writes and which ones we really want to optimize for.

C

Fair enough I mean I personally, just think that the more important problem to solve this or is is is is order, is ordered sequence or is ordered mappings, because it's more more generally applicable than let's solve this. Well.

A

I mean I feel like the current index, starting works well for buckets up to a certain scale which is generally towards the billions, and it seems unlikely that that user buckets would need to scale. Quite to that to the point where that becomes the problem.

C

Well, maybe so maybe somebody that sounds, but maybe it's a way foreign.

C

But we're not we're not hitting billions, um we're hitting about 500 000., no, maybe or not- that's not true, maybe five, sorry, 500 million, uh maybe yeah I, know and I'll. You know I'll Flash, um that's a that's a that's a lip! That's that's! A limit that needs to be surrounded, probably shows up elsewhere, but that's, but it does work up to a point, but but as we could as we get there, I can discuss in detail. I mean the different different Cycles going around it.

C

We we stopped Optimus optimizing at a point at a point where optimizing, the the linearizing of the shot of the hash sharding uh was was becoming, was ceasing to deliver any benefit but was becoming incredibly complex and it already involves distal occupation, in other words, that yeah.

A

Agreed um so I mean, on the documentation side, I'm happy to volunteer, to add a note to the rgw docs just around the user, Max buckets and recommending that you not raise it ever above. The um the large o map um threshold.

C

Yeah, that sounds.

A

Right, um I could also try to take some of the notes in my email and format them and share them with the list, and we can continue discussions on on a design there.

C

I support that.

A

Any other thoughts on this.

A

All right next on the agenda as you've all about Lua package, reloading.

B

um Yeah so I I started to look at that. The idea is that currently it requires a restart of the rgw and I want to avoid that.

B

um So uh in case you suggested that I would look at the watch notify mechanism that is used for the uh um for the zone for the period reload uh and uh I've. Looked at that I mean it it should. It should be quite straightforward and simple to do that.

B

um What I wasn't sure there were two things. First of all um whether I like what is the object that I need to create for the notifying um whose responsibility is to create this object, because the the notifications are coming from the release, Gateway admin and received by the release, gateways and I. Don't think it makes sense that the rest gateways would create the object, but the redis Gateway admin creates the object only when somebody wants to add a package, so I'm I wasn't sure about that. The how the mechanism, the work.

A

So you would just pick some known object. Name, I, think the rest of the watch objects are generally in the rgw control pool, but for realm reload it might be in the root pool instead, but um yeah.

A

Any rados write operation will create it if it doesn't exist and establishing a watch is a right, and so it should just automatically get created. The first time anybody tries to watch on it.

B

Oh okay, so I I should just try to create the object before I I. Do the The Watch so.

A

Just you don't need to try to create it manually, just sending the watch. It's a right and we'll create it if it doesn't exist.

B

That, okay, for some reason, when I call the watch I get uh no end. uh Okay, I'll have a look at that. um My my first thought there was that um I, don't I, don't really need um I. Don't really need this.

B

This watch notify mechanism and I thought I'd just check periodically uh on the on the version of the object and see if the version changed then I need to um to read it and and do the installation so and that I think, because I already have like a background thread which which I need anyways for that for the watch.

B

Notify mechanism to work I need that, because I need to support the realm reload mechanism, so I will always have the right readers uh pointer there so because the whole thing doesn't work, it works in the background. Doesn't work based on on Beast requests so because it works in the background. I need something that supports the whole radius Recreation mechanism. So if I put that in some background thread, anyways, then I'm not sure I need the the watch notify mechanism. I can just do some simple polling to see. If I need to do this, reload.

B

I'm trying to see what kind of value the watch notified mechanism gives me I mean maybe timeliness, but it doesn't really matter in my case.

A

uh I mean: is this really worth the effort compared to just saying when you change the packages restart rgw, the reload versus a restart is not that different. In the first place,.

B

uh It is I mean, restart, really, kills the process and starts that all over. This is an outage in in the system. My reload doesn't give almost any outage, because it's not like I'm not doing the pause resume. Reload I'm, just installing a couple of packages on their local disk.

A

Well, currently, the reload is tearing down the store, rgw, rados and all of its threats and stuff. So.

B

I, don't know, but I'm not using this reload I'm, not I'm, not using this reload I need I need somebody that is aware of this reload so that whenever I go and read the object and try to install the packages, then I will have the right radius pointer and not the the old one, but other than that. I'm not really doing a Reload. I mean this. This package, reload, is, is just uh just invoking some some external command that changed stuff on disk doesn't do anything else.

B

It doesn't, it doesn't block or do anything to the operation of the redis Gateway.

A

Sorry, how do you avoid blocking I thought? The point was to avoid running Lewis scripts. While the directory is changing.

B

Or reloading well, I mean uh I, don't think so. I, don't think, there's really an issue there. I mean the only the only bad thing that can happen is I mean okay.

B

Think about the case where you have a script that relies on some packages and then you change it to a script that relies on other packages that are not installed yet or maybe you install the packages first and then you change the script, so I mean anyways they're going to be uh some period in which the script and the packages are not going to match. So whether this is uh uh because the director is now locked or busy or something doesn't really matter, I, don't think, there's a real issue.

B

There I mean there's an issue, but you're gonna have because the whole thing is very non-atomic, then you're gonna have an issue anyways, so either your current I mean probably the right way to do. That is just to remove all scripts. Do the install and then put the new script in and and that's it so and you can do that without any interference to the operation of the rgw.

A

Sorry, I guess I was misunderstanding. What exactly needs to reload then, when the.

B

World could just change well, the only thing that reloads does is call. It is calling like the lower rocks command line to install a couple of things in a directory.

B

Now, while I'm changing the file on this directory, if law script and trying to read those files in order to because they have dependencies with the external packages, there then they're not going to work, but there's no atomicity between the the scripts and the packages installed on disk anyways. So the probably the right way to do that is to remove the scripts install the packages install the new scripts and that's it.

B

So I'm not going to pause resume. The only reason I I'm concerned about pause resume is that this, the the code that installs the scripts needs to read the list of um sorry. The code that installed the packages need to read the list of packages from uh from some system object, and for that we need The Radars. So it needs the right one. In case somebody called really uh reload on the on the rgw. At the same time,.

A

Okay, so some admin command is writing new entries into the list of packages that we store in rados.

C

A

We need to tell each rgw to reload that list and use Lua Rex to install those packages locally. Yes,.

B

A

B

So if you're saying that the notify command should create the object, um then I'm just gonna, investigate and see why it's not happening um and maybe just just skip this mechanism.

B

Because what I, if the, if the watch command, need to be retried in the background, then it's really pointless and I'll just uh read the object in the background and see if you change and then do the work. Instead of relying on the on the notifications.

A

Yeah I don't know one one thing that watch notified does give you is that it blocks the notify until all clients respond, and you could essentially use that response to know that all all rgws have reloaded the state.

B

A

B

Good yeah, because.

A

If you were waiting to re-enable scripts based on that, then it might be better to have actual consistency around it.

B

Right, yeah and also to make sure that you didn't hit multiple reloads while the rgw's are still reloading or something like that. That would also prevent it. So that's a that's a good good reason to use it. I'll I'll, look deep into that. um There was duplicate.

A

Notifies uh are definitely an issue that I saw in the realm reload stuff, and it has special cases for that.

B

Yeah so I I guess, if we, if we block the the red escape to admin until all where this Gateway is replied, then uh at least from that admin nobody would resend another notified.

B

Okay, um there was another another small thing that I've noticed. There um is the question of zipper, so uh the because of identification. This code is really really specific, but the code is doesn't sits. It sits above the zipper line, so I wasn't sure.

B

What's the right report, there.

A

Yeah, as far as I know, we don't have any um any analog for watch, notify for DB store or or other backends.

C

Well, there's certainly the intent to create it, but whether we have to put the card but put that ahead of other things. I, don't I wouldn't say, but I mean like my goal, is to is to try out you know the the right is as a potential solution for group Communications that you would use for that, but as similar it can have similar semantic. But we can use other things, but some we have to. We have to introduce something and, like it, I think we should look at that.

E

C

Think yeah, but.

E

This is the kind of thing that has to be fixed as part of the modularization I think otherwise. You know because it'll be a build failure, so I mean we're gonna have to API and stub. It somehow I think, but.

B

Right, yeah yeah, so my what.

D

B

Was wondering is not if we have like other mechanisms for for the watch, notify but I mean at least put that under the zipper line and obstruct that and then it'll be like an empty implementation. Like we have many of these implementations.

C

Belong under the zipper line, I mean I understand, is the zip. The zipper provide the the the what the notification services or isn't it, something that composes it with it?

C

We don't need rados to do.

E

It at all, it has to be API right. The question is whether it belongs in zipper or not I mean we have other stuff, like the config store that you.

C

May I mean that's.

E

The same kind of thing, but isn't actually part of zipper.

C

Well, well, here I mean I mean watch that if I happen to use rhinos because rados was there and rados would always be there, but but you know but but looking at it, you know, considering that you know that it's actually a notification between all the nodes in in the RSW cluster routing through the routing through with a back end, is, is acceptable, but by no means required by the concept. The other ways I would do. It would not use radius yeah.

A

C

A

Abstractions to store this list of Lua packages um and the blue Iraq stuff to actually install those is above the line that just relies on zipper to read the list.

A

um I think it might make sense to extend that existing API to also be able to pass in some Observer interface. And if the back end supports notifications, then it could. It could call your callback based on watch notify for rados.

B

Yeah I was thinking about that, but then I saw that the the real reload was le5 mechanism by itself is about the zipper line. It is also using radius directly so.

E

Yeah and again that will have to be fixed by the modularization.

E

So I mean, like I, said Caleb's not here, but he may have already encountered this and come up with a fix for it. No.

A

We don't really we're not really going to full modularization in the first step, we're still linking the rados directly into rgw or.

D

The time being.

A

A

But like The Logical conclusion of that project would would address that.

E

Oh, but yeah I mean realm.

C

Reloader has to be fixed at some point: I got the impression we were very clear, I mean I, guess I was wrong, I'm, not quite sure, then what what the current increment actually does. I thought it was a modularization.

A

It does some modularization, it introduces loadable modules in the case of rados. We would be able to open the cell, but also link dynamically to the same thing which I'm not crazy about, but apparently it's working for him.

C

That's actually troublesome.

C

I think historically, but okay.

A

So yeah you you mentioned, watch notify for realm reloading and I. Eventually that will be abstracted somehow, um probably in the config store, which stores the period realm stuff, um but I think that some store back ends just won't support notifications and won't Implement that I think that'll be okay, but similarly.

B

A

For the Lua list, I think the zipper API we'll just need to have some abstraction for observing changes.

E

B

E

Some stores won't support multiple rgw's anyway, like TV store. Doesn't so you don't really need a watch notified because there's only one possible rtw, you can access the data.

A

I, don't know it is kind of weird, because rados GW admin can do rights in the meantime.

A

um If dbstore had any kind of metadata caching, then that would come up or that invalidate messages, but it doesn't cache metadata currently so you've all has this discussion helped. Do you see a Way Forward.

B

um Yeah I'll investigate a little further on the watchmother fancy. Why it's not working for me and would create those those abstractions at least for so at least the the Lua package notifications will be abstracted.

A

All right sounds great um yeah. It might help to look at the um the use of watch notify for metadata cache to see whether it has to actually create things. I assumed that that just watching would create the objects Maybe I'm Wrong.

B

I guess they know, there's no. There shouldn't be any harm if I try to create the object or um like I, don't want to write anything to the object when I'm just watching, but if I can, if I, just kind of open the the object for rights that should create it. If it doesn't exist right.

A

Right, you can always add a non-exclusive create before the rados watch in the same OSD. App and that'll create it if it doesn't exist.

B

A

Any more thoughts on this either on the Lewis side or zipper abstraction related.

A

Any other topics for the agenda.

A

um Anything new on the Bloomberg side.

B

um Nope I, don't think so. Really, foreign.

A

Thanks I guess we'll call it here.