Ceph Ceph Month 2021, 10 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Month 2021: RGW Update

Description

Presented By Casey Bodley
Ceph Month 2021 schedule: https://pad.ceph.com/p/ceph-month-june-2021

A

Well, welcome everybody to another ceph month uh week, two and uh this time we're going to have a radio skateway update with casey and then uh followed by. We have two birds of a feather session, one on the cefe research and scientific computing group, as well as the gosef get together, which is uh the go apis for seth, though uh casey will you please take it away for the first presentation for week, two.

B

Gladly thanks mike, as promised, here's the rgw update thanks for coming just kidding. We got a lot of cool feature. Development on the roadmap. um First up is s3 select. This was new in pacific, supporting sql-like queries against objects in csv format.

B

For quincy, though, we're filling out support for the query, types and functions, and also working on support for the parquet format, which should allow some extra optimizations, for example, moving some of the query execution into osds, so they can run in parallel and rgw doesn't have to read all of the data.

B

Gall and team are doing great work. There also looking at s3 bucket inventory to help manage listing and searches of very large buckets that are slow and expensive to list.

B

So a background process just builds an index of these large buckets um in a convenient and cheap format to read so it kind of trades. Some of the consistency from our bucket index for performance.

B

Also sse s3 bucket encryption is transparent policy based object, encryption in rgw and, unlike the existing flavors sscc and sse kms, this can encrypt objects without the client requesting it on upload.

B

So we recently started a collaboration with some engineers from flipkart and we're excited to see progress here also request rate limiting per user and bucket. We've got a proof of concept for this working so expect to see that in in quincy, uh we've also been working on some data. Caching projects with mass open cloud, with the goal of accelerating uh workloads with some local data.

B

Caching, and we're also planning to build on jaeger, tracing support in rgw and use it to do some performance evaluation and optimization work uh next, introducing project zipper, which is project led by dan grinowicz, where we've built an abstraction in rgw for the liberators back end.

B

So the goal is to let us plug in other non-rados back ends or stack layers on top of the rados back end.

B

One of the cool ideas is to have a policy layer that enables some lua scripting to control how requests are processed before they get written out to rados, and this interface also can be really useful for some benchmarking and performance work where we drive a workload just against um specific parts of the interface or. If we have like a memory back end, then we can do benchmarking on the on the front end code without the overhead of rights to rados.

B

So a lot of opportunities from project zipper on the horizon and last multi-site. A lot of work in progress here, mainly dynamic, resharing support, which has been a long time coming. We've built up a team of developers to help with that and they're doing amazing.

B

Also, life cycle transitions to the cloud. This is built on yehuda's work for storage classes and life cycle transitions and allows you to to your objects, to an s3 external service and, lastly, uh sync from cloud.

B

You know it is working on uh support for identifying changes in s3 so that they can be replicated back into stuff via multi-site.

B

So that's all I've got I'm happy to take questions.

C

Hey ckcr, this is prasad um from flipkart, so that's an impressive set of features. So uh I was just keen about s3, select, support and bucket inventory and rate limit. uh We we. We have a.

D

C

Flipkart we have.

D

C

Rate limit uh infrastructure, which is a patch on top of rgw, but with a lot of limitations. So I was wondering how uh what would be uh what's the mind about this rate? Limiting I mean, is it based on the dm clock? Is it a moving window and um what are the contours of you know and.

B

Yeah good question, um so we had been looking at distributed um rate limiting so that the cluster as a whole could give a consistent rate. um But I think we're going to just go with a per rgw one, which is a lot easier to implement.

B

So it's not based on dm clock and it's just going to be per rgw for.

C

Now and with a sliding window- or you know, fixed window time slices and it's possible to get a spiky pattern in terms of.

B

Requests yeah, I'm not not exactly sure how it works.

B

Okay, okay. I am eager to learn more.

C

Because you know, like I said you know, we have a internal uh implementation which is again on rgw itself, but we uh to be honest. We are not very happy with it uh because of a it's, not a distributed rate limited it's on a per radar's gateway and it's a fixed window protocol. So the windows, like you know, if you set a rate limit on a per minute basis, then it's possible to front load all the requests at the start of the minute and then see uh you know, get a sawtooth sort of a pattern.

C

Yeah. Those are some of the uh problems that we have and with respect to s3 select, we would be uh having support for only those features which aws s3 already supports, right, like the sql query or the json or the csv parser.

B

That's my understanding that it's it's uh antennas to be compatible. Okay,.

C

uh Would we be open to, I know, I'm having more functions which are not aws s3? I know which are not provided by aws s3, but which is something that the rgw can handle.

B

Yeah, I don't see why not.

C

Okay: okay, one of the things that we were contemplating, although we are not going to work, is to uh be able to select a single page out of a multi-page pda.

C

Yeah, just a straight thought: it's it's not something that we paid serious attention to, but I thought it might in where the objects are. Actually you know huge documents and multi-page documents. If we could just extract on a per page basis, I thought it might make sense.

B

Interesting yeah. I think there are a lot of cool applications for this.

A

Another question that came in chat: is it planned to test project zipper against two back ends with stuff clusters.

B

uh Two different rados back ends, you're, saying.

A

And they just say two back in step clusters, but I'm guessing yes,.

E

Yeah I mean currently, we are using in some clusters nfs ganesha to access the object data and we are wondering if it's possible, to connect two different surf clusters and connect them with project zipper to one ldap, router's gateway and then serve data through. And if it's gonna show.

B

I don't think that's one of the use cases that we were thinking about, but it would definitely be possible.

B

One of the ideas that we had was.

B

To support kind of tiering to other clouds, so you might store a copy locally to the rados database and also mirror it somewhere else. So I could see that working with two rados backhand. Similarly,.

E

Okay, thanks: that's good information for.

F

Us I had a couple quick questions about some. um Some gaps that I ran into um one question is around um the ganesha export support.

F

um Looking at what's documented right now, it looks like it only supports um exporting buckets when multi-site isn't enabled there isn't a way to like specify which zone not sorry, not which zone but which realm um the bucket that you're trying to export exists.

F

Do you know is that like? Is that a known gap, or is that a documentation issue or me being confused issue, um and or do you know who I should where she go.

B

um So the lib rgw, which is loaded by nfs commercial, would run under the context of a zone, and so it could only export buckets from the realm that that zone is in.

F

Okay, and is there a single instance of that per ganesha.

B

F

Okay, the demo is probably tied to it so, okay, but in principle it should work with um uh even in a multi-site situation.

B

Agreed yeah yeah.

F

F

Is is dan the right person to talk to about the details? There do you know yeah, okay, yeah, okay, cool.

A

Okay, uh another question that came into chat: does data caching still foresee to use nginx caching, or would it be internal to the rate of gateway.

B

uh The these projects are internal.

B

I would love to hear people's experiences with the nginx cache, though.

A

Enrico, do you want to elaborate a little more.

G

Yeah, can you hear me first, yes, yeah, okay, excellent yeah! I was just curious about this because actually I'm from cern, we are still running now to lose for our s3 clusters. So at the moment we have no caching in between, uh but we are planning to to upgrade our staff version. So I know that in next one uh the nginx caching layer would be supported, uh but then, from my understanding this this new data caching would be something implemented on the router's gateway itself.

G

So um can you can you maybe expand a bit more on that? Would we be able to leverage and take advantage of this caching just addressing dorado's gateway without any other.

G

Reverse proxy, in between.

B

Yeah, that's definitely the goal, um so the the first step is just having rgw cache data locally, but there's a another layer to this research project level 2, which is basically localized caching, so some objects would be placed or pinned to a specific rgw for the cache and there would be redirects to get to the right cache.

G

Well at the moment, I cannot really report our experience for for what concerns the nginx cache because, as I said, unfortunately, we cannot use it at the moment. Just another quick question: I might have missed it at the very beginning. um Are this new feature coming to um to pacific or you for cities to be also back ported or are for future releases.

B

uh Most of this is for the next quincy release, but I believe that.

D

The first round.

B

Of of data cache work will be backported to pacific, okay.

G

Okay, good to know.

A

A

Being a plus one for the engine x, caching as well: oh, go ahead. Sage.

F

Yeah I had a um a question about some of the zipper stuff um with the non-radius back-ends.

F

um It feels like we have a whole sort of slew of different scenarios that we're looking at here um and with this, the local database for the metadata being one of the first ones, um just to make sure I'm understanding sort of what this initial target is. This is um basically the the local zone um metadata being stored. Is it like sqlite or something like that?

F

It's equal okay. So this would the the constraints there would be that you'd have a single um demon instance for that zone, and it would um those would be makes sense for like an edge deployment or something like that.

F

B

Yeah and I'm I'm not even sure that the the database backend would be something that we prioritize and support. um I think it's more of a proof of concept to get the interfaces right.

F

Right: okay, okay, yeah, I mean that makes sense um as a development milestone. um I think there are edge scenarios where it might make sense too, where you have, um you know a larger deployment in the cloud um or in your data center or whatever, but then you have a bunch of edge sites that are producing data and just want to like write things locally initially and then use multi-site to sync them out.

F

um I wouldn't discard that completely. I think the thing that um I wonder about that scenario that once that you start thinking about what it would look like in production is um there it seems like there would be a bunch of other of additional other steps that would be necessary to make the radio skater run without sort of a sub cluster.

F

That's there with um the way that the stuff configs are managed or um all the random things that we do where we talk to the monitor rather sort of runtime state.

B

Yeah agreed- and I I don't really see this theft cluster ever going away, at least personally, so I think we'll always talk to monitors and and use subconfig, even if the back end is not rados, yeah, okay,.

H

It can be a very minimal subclass direct uh signal monitor on the edge and if, if you're already there and put just a single osd and and do that, instead of the backend of the database, back-end yeah um one gap that the database back-end has at the moment it's got a problem with the the idea of edge deployment. Is it doesn't support multi-site right now?

H

F

Is that just a matter of like having zipper interfaces for all the other metadata stores, or something like that?.

H

Yeah they in the packaging. All that stuff is enough. Yeah, it doesn't happen.

F

Yeah, okay, yeah, that makes sense yeah. Well I mean, if you I guess, if you have, if you have that, like single mom single ocd single manager in place, then it isn't strictly necessary to be more of an efficiency thing anyway. Right satisfy this use case.

F

That makes sense: okay,.

F

B

Thanks sage, thanks.

B

D

One question: it's faster to say it than type it it's dan from cern. um I don't know if you mentioned, because I joined about one minute late, but in the past there were discussions of the the index format. The buffet the binary like using omap for indexes might change in future release. Is that planned in the foreseeable future or are the index formats stable?

D

As far as you can see,.

B

We've definitely put a lot of thought into alternate indexing schemes and they all tend to be very complicated. um So I don't expect a new index format in quincy past that I'm optimistic.

D

Okay, I mean, I assume, I assume that for the for naive users, it would, if you do change that this would be a kind of like online conversion. Type scenario right or or maybe new buckets would be. Okay whatever. I guess you haven't, thought about it too much at all, but.

B

Well, we have an abstraction for resharding that um can track different types of indexes, so you could restart from a bucket index in omap to some other format.

D

Good point: yeah, that's true: okay, cool thanks.

F

One other question on that on that restarting do we have an updated timeline on when um that's expected.

F

That particular piece is expected to be backboarded to specific rate.

B

uh We do have some of the reshared work in pacific, but it's not really. um It doesn't really change how how resharding works and it doesn't add, support for any other index types.

H

Really the the multi-site for shouting the uh support.

F

Yeah yeah: that's what I meant yeah.

B

I I don't believe that we're planning to backport that.

F

ah Okay, okay,.

F

Just from a logistical perspective, that's that's sort of the big chunk that um that a bunch of other um pull requests are sort of waiting on right.

F

That's right: yeah, okay, okay,.

F

Is there a um is there sort of a expected timeline there.

B

uh It's still hard to say: we just recently got some reshard tests to pass so we're making really good progress, but it's hard to know all of the bugs that we've yet to uncover. So it needs a lot more testing.

A

All right: well, are there any other questions before we move along to uh our birds of a feather.

A

A

Okay, let's see yes, thank you, casey, for the update and uh for answering all of our questions.