Ceph CDS Hammer, 28 Oct 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CDS Hammer (Day 1) - RGW: Object Versioning // Object Snapshots

Description

http://goo.gl/U4b70r

28 October 2014
Ceph Developer Summit: Hammer
Day 1
RGW: Object Versioning
Mike Bryant
RGW: Snapshots
Craig Lewis

A

All right on to the next session, this one is a ratos gateway, object. Versioning, I believe the blueprint owner in question was Mike. Bryant Mike, you round pick up on our seat.

B

Well, Mike wrote the blueprint but I because during the implantation so well.

A

Even better all right, yeah.

C

um Yeah well uh with regard to the object, versioning um been implementing it for a few months. Now there been some discussion internally and externally, and there is a wiki page and subscribes the design that we came up with and um basically originally we looked at that feature. There were two different approaches that were previously implemented: one is the ds3 alisone on object, versioning and there's a sweet object, versioning and both are quite distinct and historically we were when we implemented in feature. We made it pretty much agnostic in the actual restful api this.

C

So they there was they doing the internal implementation and then the core functionality, and then there was a RESTful API, but in the a specific feature the Swift API and the SVP are so so different. Then we just chose one and we went to with the 31 and um and that's what I've been implementing ever since. So uh there were a few requirements that we needed to take care of, because the SVA PS is quite different than everything we we do. We used to be doing up until now. So first of all,.

C

The there is need needs to be an ability to delay, to list objects and and their corresponding versions.

C

You need to obviously be able to read a specific object version and remove a specific object version um now. One more thing that we want to avoid is to to need to be required to access the bucket index for each object trade. um We, because if we do that, then the packet index becomes a bottleneck which we wanted to avoid, and now there is the the entire f3 um API.

C

Let's rock Rick surrounds the the object. Versioning, you create multiple versions, you remove a version than I trolls back to the previous version.

C

You delete an object, it doesn't really isn't been deleted, but there's an object, deletion, markers and there he's been created and then you can suspend versioning on a bucket and once you do that the versions seal exists. But if you create new versions, new objects, the objects will be created with them, what they call an out version and if you remove it and in that version marker has been created. So everything had had to be supported. um So we discussed this.

C

How to tackle this because expect, especially the rolling back of the objects to the previous versions, was kind of.

C

Didn't fit our current design because we usually don't have a cross object interaction, because, basically, what you need to do is have kind of a pointer to another object, and if you remove an object, you need to change the pointer and to to set it on a different object. But what what happens if one of the the the operation failed? What? While we were doing it, how to handle it? And what, if you have multiple operations coming in parallel to the same object of?

C

How would the architecture work work? With this? We ended up coming up with a solution that in which we have the bucket index, serves as as the decision maker. So whenever you create create a versioned object, you first go um to the packet index and and basically say: ok I'm now create creating this object, and, um and the bucket index will make sure that everything is happening at the correct or order and provided a log board for the Raiders gateway to to to to reclaim of the play.

C

So it's so very quickly go to to the object this now we call the olh the object, logical hand, which is kind of like um soft link and say ok, I'm about to check to change, to make a change to the to this subject, and then we go to the back in the mix say: okay, now we create a new new version of the the object and and and the back of the neck would say: okay go to the o, LH and now weird it's at this Pacific version make sure the we are at this specific version before changing anything and and and update the the LH and 32.8 are at at the object version that was just created and basically that how it works so create.

C

C

We go update the DOL H at that point, and it might have been that at this point we go to the origin and- and we see that the oil H is already version 3, because it was erasing right. But but everything fits nicely because at the end, the the bucket index is the one that decides on the Persians.

C

Now about the object versions themselves. These are basically regular, regular objects that just are that are just put in as this innate namespace, um so that the there is some kind of a naming convent convention and that it's a version to the object. So whenever there is a version, versioning turned on on the buckets and every object now is being created with some kind of a add version instance appended to it internally.

C

The rate of the gateway had to be changed so that, instead of just keeping a string for for each object, we we now hold a structure that contains the the name and the object instance um in.

B

C

That had to changed like all around.

C

B

I'm, a quick question on the S on the s3 vs. lifting, so the s3 has a pretty. As I understand it, has a pretty rich API as far as listing versions and rolling back two versions and all that stuff. On the left side, it's like. Basically, when you overwrite an object, it first copies it to a different container and then yes, it yeah.

C

That's all those.

B

Doesn't manage their yeah, but it seems like there's no real value in like trying to support that, because it's so basic and silly, especially if you're supporting the yes, yes,.

C

Right right, uh yeah, I, honestly, don't see the value as a splint and supporting any I like it's more of whatever they're the user requirement, whether they need versioning. If you have the s3 one and it kind of encapsulate everything you need, the suite provides, which provides a specific API like you need to list.

C

Don't really remember how it looks like like you need to lick the specific bucket, maybe or I might be mixing up it up with a yeah different feature, but.

C

Yeah, so, okay, as long as long as we I think that it's wrong as I support the Sapa, then it proved it gives another navin function, T for users. We.

A

C

Will be able to provide, or maybe extend, our Swift API to support the s3 object, Virgin, Atlantic's, right, yeah, actually yeah so and there's not much difference ah like you just add a the field and saying: okay, that's the object version that I want to read, and so it should work. Fine, okay,.

B

um And I guess the second thing is the I mean slipped: it's used as a it's, it's sort of wrapped up in the way that you do the object, retention and object expiration. The same thing happens on a spree right. So once you have object versions and eventually you'll be able to define a policy it controls. I come in and keep is.

C

That right, I, don't remember if you evident you have something like that. I think it's more of like how long in along.

B

C

Yeah, yeah and and I think that's gonna. That's the the fact that we are dealing with object versions here doesn't change much. It might might might add some complexity to do the feature, but but at the end the object lab the back lifecycle feature.

C

The way I see there should be some kind of an external um agent that would handle everything and probably using some blogs that will generate and for whether you know whether we we deal with virgin objects or with regular objects, I'm going to change much oh yeah, the affected by the way it's gonna, say, external up and gently doesn't mean that it's really an external, oh.

A

C

Agent, it's just a different thread.

B

C

The trucks and under gateway, okay, um yeah, no I, made some changes to the back of the index. Although originally the bakit index was just a list of a Clane list of objects that existed on on the bucket, then at one point we added some logs into it, but all residing in different namespaces. Now the thing was that, for this specific feature, we needed to be able to list version embedded in the the object listing we needed to put in the list of objects.

C

In in the in the in the same namespace on the list of object versions in the same namespace were where the regular object tree side we which what wasn't really hard obviously, but there there is a new notion. Now now there are two types of giving in the back index ones that are just use being used as for object, listing and in new keys that are used for actual object data so for for an object. Perversion object for each instance.

C

We'd have two entries 14 just for being being listed in one for its actual data and the reasons we have to. If that the version objects need to be listed in in order or from the newer from the newest to the oldest. So we need to make sure that that they're kind of sorted in correct way and there's a way to generate that. But that makes it us.

C

That we we need to to index them somehow, so we needed to create a second entry for each in order to be able to look, look them up, and so there's them um the listing index and the data index. And then um there's the the null version object, which is um that's created on a bucket that has suspended versioning and.

C

There is some solution for that, but the what it means is that for packets that first had objects created um with before there were were versioning on on the packet turned on, and now we enabled versioning we create a week. We convert the old entries into into new entries, and we we mark that as such, so we put an entry we're in in the where we used to have the original key. We put an entry and say saying the this is not not a key anymore.

C

This is a a new new version.

C

And then we we return it, we convert it into nu and a new version, the object, because we need to keep it sorted in reverse order. um So that's about it think any questions.

B

We're so that the branches in progress, what sort of status of where? Where the implementation is at yeah,.

C

Oh ok, so objects can be created with you know. The f3 versioning API has been implemented.

C

Versioning object versions can be created, the object versions are gonna, be listed, the correct ordering you can turn on and off, or suspend and enable um versions on on the bucket that some bugs in that area, but mostly it's mostly working correctly. You can remove versions and it will roll back to the previous to the previous version, which kind of that the main. The main theme here mainly works mostly works, and what what's missing is uh there's no um two major things is there is no timing.

C

Out of you know when marking olh and say: ok we're now about to modify it and then later on. If you read the old age, we need to look at it and say: if there is some kind of an operation in progress, we need to go to the back into index and in quick query eight and potentially play the changes, keeping because the original client that made the changes might have crashed.

C

So we do that that we don't timeout. We don't like. We need to make sure that if it's an it's an oil change, we remove it um so I didn't do that and there's the whole multi-region multi-zone stuff that I haven't looked at it yet so yeah.

C

B

C

B

Seems like that, the biggest thing and validating all this is going to be getting the big set of tests headed to shs right cover. All that all the combinations of adding heads and everything as young moving all versions and listing and making sure that it's all whatever all right.

C

E yeah, at this moment, I'm testing everything manually but and yet and need need to create some kind of test which I created up a new s3 client just for the sake of version, because I wasn't very happy with that, but uh the the other clients, um it's a Python client, pretty simple- reflects what bodo provides. um Mainly you can do most of the basic things like create objects, remove objects and these objects create buckets and do everything we in able versioning.

B

C

Listing versions and everything not sure if that's going to be the as the base for a new test suite or will be, I will be using the will be using the s3 tests, which the existing one the problem with existing one is, it doesn't handle versioning nicely. Oh, if you create version bucket, it will know how to how to remove it, because there might be some object, fighting their little version, but that that can probably fix easily, um but in a more higher level.

C

Question is whether to create how to create the to generate all those test cases. Whether doing it again is some kind of a Python code or doing it in the more higher level or, like you know, creating scrapes the trend that um rut run run the the nuestra, client or I'm, not sure prob, probably going with yes, we test. This is a way to live, though yeah.

B

C

B

A

Down on 34 minutes here, if there's any other questions, do.

B

Site for EM for the next session is the RJ to be snapshots. Do we have is Craig around? We originally wrote the original Birdland.

B

Not sure it's going to.

B

I wonder if he got this was an old footprint that we that's. What.

C

I never figured yeah, but we asked him to reopen it and we discuss it because there was originally a scent. um A pull request for a branch were implemented.

B

Using the full snapshots, I think but yeah.

C

Yeah yeah and we wanted to discuss salesman snapshots instead, yeah.

B

Okay, you see here, though, not in look like you, some blue jeans.

A

Doesn't look like it not on summit.

B

Will just I guess a stupid question? Does it doesn't even make sense to have like a bucket snapshot contact concept if we're doing sort of fine-grained object, versioning or would it make more sense to to if there is like a idea of snapshotting a whole bucket set its it's integrated, with the way that the versioning is happening.

C

B

Just sort of doing the same thing not.

C

Quite well, you know snapshot can give you a few specific view. It's one point on the bucket yeah all right object. You can give you a at an object level, not not in the entire back to travel right. um One point where packet snapshots may be useful: this for doing a point in time, Ripley zone the application mm-hmm. Basically you take a snapshot.

C

Oh hopefully you can you'd be able to do it cross schools. But let's let lets a little kick. That's that problem has been solved, but so you take a snapshot and and and then point the the sink agents at that snapshot, not as they're running a cluster of them.

A

C

This will give you a good approximation of a point in time still not quite because we still not flash all the changes, all the time, I think but.

C

Correctly um think about it for a second yeah.

B

I mean it seems like even then, if you.

B

Say it it at the Sikh engine mirror a point-in-time view of the bucket eventually you're, going to want to go to the next step, shot, which you can see. You need to have like a log of what changed between those two points. Just.

C

Already introduced.

B

C

Yeah yeah, it's it's doing that already just a in it'll uh constrain the thief. So basically, currently what? What happens is that if you have, if they think agent needs a bit behind and there are newer objects, so it might be that some of the objects will will be really new and some will be much older because these um it hasn't gone through the cycle, and- and this is gonna kind of limit limited to all. The objects are not going to be newer than this point.

C

um What whether it's it has any value on tonight, person sure but um yeah.

C

B

Guess the thing that thing that worries me is that radio skateway in general has this.

B

This basic behavior that the object is invisible until you complete the put and.

A

B

You are mixing snapshots in then. You have objects that are half written before the snapshot. I.

B

Guess are part of the snapshot, so that's okay,.

C

Well, I'm going to change much as far as you just say: no just limit the.

C

Changes that you looking at.

C

It'll make it so that.

C

You'll be able to have a consistent view.

C

At at the zone, when doing the replication I, I I'm not sure if that's what Craig had in mind when when he voted out of it, this feature and I know that originally we thought about snapshots for using in buckets for implementing sites implementing some kind of version I mean, but that was long gone time to go before yeah I'm. Sorry,.

B

Yeah I version yeah.

C

Yeah um and well yeah I mean.

B

Okay, so I was one of the two use cases here so that or this for the blueprint that Craig did, he wants snapchat sutter or the entire Rios gateway class turnout per bucket right told by the administrator, not the user.

A

B

The user should be able to use them.

C

Well, I, don't see why not like, if user can connect some access to it to objects and they ago two days ago, that's that has that that is actually kinda useful.

C

And go getting a view of.

B

So, if okay, so if we did this with self-managed snaps, but it's leveraging the rate of snapshot stuff when you take a cluster wide snapshot, own wide snapshot, I guess I, guess the key is that it would. It would work the same way that our BD does, where you would have to eat, send a notify to all the gateways that says there is now this snapshot and yeah are tagging all of your rights, like so yeah.

C

There's no way around it. The problem is that between right on multiple pools and and that that's kind of the problem, you need to be able to create some kind of snapshot, they're all pools which I'm not sure. If it's not something.

B

C

B

Would probably okay, so this is for admin? Oh that's,.

B

I'm granularity, so the first yeah.

B

Yeah, so if there I mean the simple right now, but the snap IDs are allocated on a per pool basis, I think it just means that a logical snapshot, a logical zone snapshot, is actually a map of it's a pull to a step by 80. You.

C

Need to create all all snap IDs on all pools know it sing at the point of time.

B

I mean kind of you have to you. You would allocate all set by these that you're going to use for the snapshot, and then you would you have to sort of tell the the gateways to start using them all at once. Right, though, that isn't anything until the client starts using it right right, you have on here. Oh oh, it's like a.

B

um So I look okay, snap, IDs yeah, so.

C

That didn't work basically um that.

B

You right step is creating notifies you of use, and then we did yeah failed.

C

Alright and need to make sure it needs to be some kind of a two-faced here.

C

And like because what happens is a gateway? Just came up before you know after these were allocated, but before everyone agreed on a tour I.

B

Guess the Gateway would have to let's start listening to the notify thing before it read the metadata. Oh it fun erase it started up. At the same time it would look, read the snap meta de se, it's creating it, so it would start using that I, like those snap IDs.

C

Yeah, well, it's a solvable, a shoe um yeah, okay,.

B

And then it would just use that snap context or whatever the purple one is for both the bucket and the it is it for everything.

C

Well, almost saw everything not for the zone. Metadata originated, 80, AD, I, think that's something it should be snap shut. It not-not-not shouldn't be bundled with with, with whatever.

A

C

Not too sure, because you might do need to manually together because about a part of the information about the zone is like where to locate where the poop, which yeah I mean that's probably worth everything. That's.

B

Probably, where the where the zone would be or that snapshot metadata would be stored, they are in the zone metadata.

B

Then, whatever regions that I'm going to do that, Thanks make sense, I, guess, okay, so.

B

That makes sense for all the objects that are where your striping up the breasts follada fix on top of and with the head object for the bucket index, there's a bunch of stuff that you do on the bucket index. That's a synchronous that you clean up later like um like.

B

If you have a put that failed, it'll appear in the listing, but the object might not be there. Yeah.

C

Well, I guess: yo yo, salgo gadera fide at.

B

C

You look at it later on um if, if the back bucket is wasted later on, there's some kind of a pending and trailer, then.

B

Right so any left at all right when you actually do the listing and there's a pending thing, you actually go check the object right to receive its air right.

C

But but II you'll you'll need to disable all those things for um be cut because then you go and update the packet index and you don't wanna go up update the backing index when you in a snapshot.

B

Right, okay, so yeah.

C

Another thing that you don't want to snap shot by the way the garbage collection, yeah.

B

So that 1p snapshot is that in the same pool as the buckets all the other stuff, no.

C

It's a different different pool: okay,.

B

And then rgw it would just take like another header that says view this particular snapshot.

C

B

A

B

I would have I.

B

Would have a name, the snaps out of times.

C

And there's probably going to be an idea to this accenture.

B

Okay, that seems pretty reasonable, actually yeah. We think it was. Why did we think it was really complicated before I think we're trying to do per bucket once maybe I don't.

C

Think we'd learn you'd want to do anything: okay, yeah at the time we spoke about upper bracket for for implementing, versioning and doing all sorts of crazy stuff, but uh it wouldn't have failed. um But if you take the global approach and.

B

B

Yeah: okay, okay,.

B

C

um Reasonable yeah any questions from anyone.

B

People who are clamoring for our govt snapshots particular use case in mind. Besides and Craig runs a big tease running a multi-site cluster right, LT region, yeah yeah. He does yes, yeah. Okay,.

B

C

Okay, well, we can talk about a bucket lifecycle, don't think we have a slot for that. Yeah.

B

Yeah: okay- let's see that um is this is coming after object. Pershing, probably.

C

Yeah, that's coming up the object, versioning and sadly we didn't really talk about it much, but that it's one of those feature it's kind of like object, versioning in terms of sweet words as three words: we've taken one approaches to retake: it's taken a completely different approach and we're tending to go with as3, a solution which is much more complicated.

C

So first, yes, the Swift approach is basically doing it. A your object exploration. Every object, that's been created.

C

We then kind of with the next paris tiger. Think or I. Don't remember how you are we set it up, but you can you can control per object um exploration and then the there is some kind of an async process that goes and look it at the objects that need to be expired and and sends them to the trash or something like that um and probably object that has been expired and you try to read them eat either succeed so not succeed with it, some kind of logic.

C

There we got to s3 bucket lifecycle, that's how it's cold, um you do it the bucket. So you say: okay on this bucket or objects will be expired after this amount of time or all objects, its corresponding to this filter. All objects start with the letter, A can be can be. Expired, will be expired or will be deleted that the bucket is version, so the version is going to be created for them.

C

Oh they're gonna be trans, this transition into a secondary storage, um but that's a per bucket, so there needs to be some kind of a a whole bucket process, something that looks it at the bucket and knows how to to to do those things on the entire packet in and to it to to add some complexity. You can turn it off and on again and once you turn it off, it shouldn't work and if you turn it on or or it puts put a new configuration, then then the new stuff needs to work.

C

While you don't need to make sure that the old stuff is not happening and you can have I think multiple policies.

C

On a bucket um so yeah, so it sounds like a great fun.

C

B

When the transition to secondary storage, that's the idea that after a bucket is deleted or if it gets old and it's moved into a different bucket- that's, for example, reduce redundancy, okay.

C

um I'm not sure if it's a different bucket, I think I think it is different pockets, um but yeah it's a right either reduce redundancy or amazon. Has that glacier I'm done glacier? You know.

B

C

Move it to some deep freeze.

C

Much cheaper storage.

B

And it at the point of which that happens, it's no longer in at the same bucket right. It's not like you have. Oh, it's a single object in this bucket, that's backed by glacier.

C

Yeah I think it really, I think, bucket um I'm, not hundred percent sure like that, though, I need to check that um okay, it will be impossible for us to implement something. That's not moving it to it to a different bucket. Oh, oh, maybe not because now we have the olh so yeah.

B

Yeah wealthy or complicate that way. Okay for that I wonder if a a low-hanging fruit or human free hammer time frame is just the.

B

Wiring up the reduce redundancy as three api's, though that, but we can already obviously have multiple raters pools, that back the zone and you can write select which one source I guess similar to the wettest Leone doors policies yeah. Well, we can we get map the reduced redundancy, one to a different, specific, greatest pool or whatever right to provide that same API.

C

Yeah we at the moment you cannot, you can already um create multiple bucket placements or possible, Iceland, don't remember ice cold and you can create a bucket in a specific, with a specific placement target using the s3 API other though we have a way to specify it, but it's not not produced to redundancy API. It's just like we say all right, depending on how how the admin named named the bat bucket place on target, um but yeah, we can write up to the reduced redundancy API and should just work.

C

Oh, so it's 3d, not a big big feature we can probably did. We should shoot it.

B

Or any other basic functionality that the Swift storage policies are trying to capture that we can, we can do at the same time.

A

B

Don't know what the actual API looks like four I, just oh yeah, there's have names. You have like the gold policy, the silver policy.

C

B

I figured we'd already right. We.

C

Already do that, okay, okay and we you can have- we can have a way to we already have that. Like you, we have a way to have a user on on the Gulf policy and user and silver policy so that each other can packets will be created in granite, different policy and you can limit a user to only use specific policies.

C

So we already have have that. Maybe what we're missing is documentation. That's problem, yeah! That's something that you can improve on.

C

Then, at the time I sent a few emails, the word questions are um explain how to use it, but.

C

It will be better served as a part of the septic documentation.

B

Saying that somebody could pretty easily do is just take the current I'm going to skimming the the swift docks, but they just added a new header called x storage policy with the name. We just wire that header up to use our current thing because go again different header, preferably.

C

Well, currently, we just we don't hear the header, it's part of the when you create a budget yeah. It's part of the package request, there's some kind of, um but ya, know we. You can wear that to them bakit place and target, and it will just work for the reduce redundancy we need to to make it so that if you specify wittiest redundancy, it go to a specific pick. One yeah just be qin shalini to define it somehow nicely and say which one is the reduce redundancy.

B

Yeah: okay, okay,.

B

Okay, that seems pretty easy and then right and that could be different, a racial code or it could be 2x replication in or it could be whatever we do on that. Yeah.

C

Yeah, well, that depends pretty depends on the admin. I mean a.

B

Shinobi good, it's half the stuff is already there. Okay.

A

um We move on to early.

B

Your- also we're not gon, can't make it either right. Yes,.

A

I know I just realize that we kill this recording.