Ceph CDS Infernalis, 3 Mar 2015

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CDS Infernalis (Day 1) -- RGW: Active/Active Arch

Description

Videos from Ceph Developer Summit: Infernalis (Day 1)

03 March 2015

https://wiki.ceph.com/Planning/CDS/Infernalis_(Mar_2015)

A

Alright, the next one here is the active, active architecture, discussion for rgw, that's you who you dies, you'd, you wanna, try on your camera and take it away. uh I.

B

Don't have to come right in.

A

B

But just extra hot I can take it away, see my seal silhouette.

A

You could you go all.

B

Right um so yeah, the first one we're talking about is the active, active.

B

First, what it means, let's see, what we have currently the edge terribly multi-zone multi-region stuff that we did for dumpling. Originally, what we designed was something like that. We first.

B

Said we want to have regions um in which um in it each region would have data in it and and in each region ii you might, you might have multiple zones and each zone which replicates the other ones basically used for disaster recovery. So you have a master zone within each region and secondary zones that will follow it.

B

So you might have two regions. Let's say USA see us west us eh, dude have two regions are one and two east one and it's two and then we could have west one and one west to and you'll need to to provide a single zone. That's going to be with the master of all for the sake of metadata, because it's the one's going to control automated data- you don't don't know if reprint so everything metadata relate, is going to go through it.

B

But then you can start creating buckets on the secondary region, for example that and the data is going to reside on the camp secondary on the west one and you can create buckets on the east one and when you access, they turned a stone. It's you're going to go to this.

B

So next one you're going to go to it, to ease them, etc, so that that allowed us to create a single global name space that is being used for both both east and west and but then again you can have a localized data, but the several issues with this for first of all, um there is some confusion about what region actually means.

B

Apparently creek region is, there is just a container for the zones, so the actual data resides in the zones, and so there there is a mix-up here between what what is actually, where is the actual data center, for example, for East it, because if you say if the data resides in the East region, you expect it all to be in the East region, but then for disaster recovery. One of you want to put the secondary zone in in West so so saying that the secondary zone for for Easter resides resizing the West.

B

It's kind of confusing, so I think that our first decision and suggestion was to rename regions due to zone groups, and hopefully that is going to help avoiding some of the confusion. The second thing is within a single zone group: now we don't want to have single zone where you can can write into, because you might have multiple zones that, as we said, like people in within a single Sun group can be a both in the Asian and the West.

B

We would want them to be able to write data to to read data from their closest cluster, not necessarily to the one that horse.

B

The ones that that's designated is the master, so the second issue is to be able to do what we call active active 222. You have people, data are going going to both zone within a single off to all sounds within single eyes zone group and then everything we're gonna be synchronized between VIN.

C

B

Now, how that is gonna be achieved. So at the moment we have a single master zone that keeps logs. For you know there are two trip so that you can track it. We have a sink agent that goes through the logs, choose these logs and then sends commands to the gateways and say: ok, here's this object has changed, go fetch that object.

B

So what we suggest is that each turn is going to keep its own unlocks.

B

So that all of the law, all of the songs, could keep track of all of the other zones within same zone grip. So.

B

The sink agent will then need to be able to go through through all the zones within that zone group and to send the commands to each of the zones irrelevant questions. But that's another thing now that the back in the index log now we need to hold some data about what was the source zone for each entry, because you might, you might have three zones. The change originated on someone soon to fetch the change in turn three now Tagus zone to log and see that it has some change.

B

So it needs to know whether it's a change as it needs to apply or whether it's as changes it has already seen, and so since don't three keeps track of both logs log of zone 1 and zone to it needs to know whether the change applies originated in in one or into and what was the ID for that change.

B

So that's another thing: another issue is what to do with.

B

Changes has occurred on the same object on different zones, and here we need to add some kind of tie. Breaker same things, objects that are written are mutable.

B

Basically, we don't have an issue where we'd have one object that contains part of data that was written in object, one in zone 1 and part of the written in the end another zone. What you're going to have is is one of them is going to need to win and- and we can probably apply a basic scheme in which we say: okay, let's look at the time stamps of the change and whoever wrote class is going to be the winner.

B

Now here you can. We can have some issue. What happens if the time stamp is equals with this on is actually chilly like the change is different, so we need to identify first of all, that the object is actually different and then also have another tie breaker and it can probably decide okay, so on one all always lean, so you know comparisons, own names and and one of them is going to win, but that's really an edge case.

B

That's the second thing.

D

Lexi right one question you could on the M on the city up: do you have to make sure you identify which zoned the change originated from that just a matter of like tagging, each item in the log with some unique identifier for the rate of skate way of the zone, and maybe the Gateway that realize.

B

Exactly yeah and.

D

B

When we apply a change, we need to say: okay, that's the tag, that's where you're going to use for this, and also we need to make sure that objects on both zones are going to get the same tag if it's, if it's a copy of the same object, I'm, not sure if it's that's the case at this moment, okay,.

D

B

That we can identify whether they wrote the same object or it's a different one.

D

And in the case of like a if you have like a foot in one zone and then you have a delete of the same object in another zone, well,.

B

D

B

Time, step of the delete, um yeah.

D

So if you so, and if you, which makes sense with you, if you look at that, if you look at the put, it was.

B

Yeah, the poop I.

D

Do if you do a foot, a delete and a put say: no, it's okay I mean. If you look at the if look at the foot and then you look at the delete and you're like say: oh, that delete is old and you can ignore it. But if you look at the delete first and you delete the object, you've forgotten, you no longer have that date about what the version is. And then you look at the foot you're like. Oh, that object doesn't exist.

D

Oh we have to have like some sort of window or something where we remember. Recent deletes.

B

D

B

That's a good point.

B

The question is how how can.

D

A big drink it yeah I mean we on the OS keys in the PG logs. We have like this window of a thousand operations or whatever we keep track of those illrick West IDs. That's how we resolve make things like item put like that, but we don't really know. Maybe the gateways just have to look at the log of recent also just have a window in the log or something as the last so many requests or something alright, that's low, yes, pull it off.

D

Yes, in Swift, there's like a tombstone, even when you delete an object, known, that's recorded forever. Is that true, I.

B

Don't know but we won in object versioning. We have a likeness.

D

B

Cases we have a delete marker, but I don't think we should you.

D

B

Doing that um or.

D

Record a tombstone for some defined window at least I, don't know yeah okay, it might not be the most important part, but the deal with it is a point.

B

huh So yeah, so we have the issue. Now is another question about how to handle object versioning? What? What do you do with the thing with object? Versioning is that we need to keep some kind of ordering so and the current scheme you have some kind of for each object that you you manage. There is an epoch that what is monotonically increasing. So let's say you.

B

Write an object: it gets a puck to then rewrite it again, a perk 3 right again cook for but what happens? If you write it on two different zones, the you have epoch to that they share, but then ebook three, each one has different version of that. And then you need to keep the object versions in order from being able to list them from the newer to the older for the news to the oldest, which is ok, if you doing it on a single zone.

B

But if you have multiple zones and you actually need to be able to list them um not not by the epoch but by some Canada timestamp.

B

So the idea here is to replace the current epoch scheme with something that is both the counter in a timestamp, so that they'll it will preserve the ordering hello, always monotonically increase, but on the other hand it will be will be able to to UM to merge different zones I into a single coherent view.

B

And the end again, we have the issue of what to do with the changes that happened on the same time stamp and in here think that that fix would be to again to head to the epoch, some kind of a unique strengths, para para zone, so that they're not going to get the same epoch and what one is one is always going to win. If they're happen when same time, which is probably a very.

A

D

There's a there's, a question in chat and from abhishek whether a zone scene can be turned on conditionally for a user request and that's what I'm.

B

D

B

D

B

The moment: well, it's definitely in enslaved. There's us, you turn it on for specific buckets and currently what we have is you do it for the entire zone, for all the buckets we can and we discuss in, we might wanna have I'm kind of a configuration where we can turn it on only on the smaller kitchen and not on all data.

B

um Okay, it makes sense in some some configurations. Yeah.

D

B

That's mostly work for the sink agent, not quite um Gateway Pacific issue.

D

B

Wait I. I do want to bring up the changes that would be needed by this engagement and you know. Maybe we need to do some kind of a bigger.

B

We rework to make it easier for us to implement all that stuff, the sink agent we will need to be able to, as I said earlier, to follow multiple zones and not a single zone and potentially to be able to set it up so that it does work for multiple zones so that you don't need to set up like if you have five zones to set up five different sink agents.

B

So you just satisfy okay, you you get! You handle these zones and it's gonna. Do it on its magic, magic.

B

Another thing with sink agent, probably at the moment, getting to do failover two different zones is not the most trivial job. Would it like a good it likely I? Think having active active will solve that issue.

B

Essentially, you you're having complete fail over situation, or you know. The scenario that we are talking about need need that specific failover handling, because specially syncing all the time, so that would probably solve it and I'm, not sure if you know fixing the current sink agent to do failover easily or just doing the the active active work. What's what would be the best way to go? I.

D

Mean it seems to me, like the the place for failover is usually hard is when you have to sort of roll back the stuff that didn't get sent over to their side when you rolled forward, but in a eventually consistent world that that's sort of a non-issue. You just pick the newest thing right.

D

So it seems to me like just making active active, just work would just be the synchronize make it go, and it would just work right.

B

D

Right if I understand right, then in in the case of sayre you're, actually running it in sort of a active, passive type of mode, master/slave mode, where all the rights are going to one side and the other side's just replicating.

D

That just means that the logs on the site, B, are basically that zone are basically empty, and if you have a, if you cut over and you have to fail, then those logs will start to get populated right now, when the master site comes back up, it just will read just those short logs that have only the rights that happen. Since yes, I mean yeah, it seems like it would yeah.

B

But but but you need to do it well, what happens if you do it multiple times and then then you you need to set the Marquess correctly, because you know- and you also need to know whether there's data.

B

You know it it might have been so that data on the original.

B

Master has been written, but you override it with older data from from the sighted switch into, because you know that when they were switched.

C

B

Wasn't like you know, it wasn't done atomically. um That's.

D

That's sort of that. That's the reality is eventually consistent, though, like you can't read something stale and rewrite it. What happens for the M with like the metadata, though, like the bucket in like you, can do sort of the West writer winds, simple view of things from the actual object point of view, but on the metadata thing, users are updating, ack, a.

B

Snake data wise, I don't think we need to change the corn scheme metadata wise. It's probably at this point doesn't make sense with multiple writers.

D

Yeah, okay, so I wish I asks what about bucket names uniqueness when two different zones try to create the same bucket name is.

B

This case, yes, in this case, it's not because they all go through the same self yep yeah.

D

Does your master zone that does that those operations in.

B

A Kia the thing is for loaded to the mustard zone, so.

D

This is a case actually where you do need a defined resynchronize, failover type thing, because if you are doing that on the master zone, you create buckets that haven't been replicated yet and then you master fails and you fail over two tails on a different user. Click create the same bucket and you.

C

D

You tether clear rollback of some sort. Yes,.

B

Yes, flap formatted. It's too is that issue yeah okay yeah, but that doesn't change I think like from our current skin. So that's where we are this way.

D

We don't we don't handle that yet, though, right.

B

D

Okay, you're planning to go look at the look at the master log and everything that happened since that didn't get replicated and roll back those items, yeah row forward or do some careful roll forward or reapply or something yeah.

B

What what you, what we need to do is when switching to another zone is take the current state of the zone that is about to be promoted, keep it aside, then turn off everything switch, turn and and now you're good to go. But then you.

C

B

Back, you need to stop everything and apply changes.

B

The second question here is ETA above case. Both users have got success, which encodes, who will end up losing their buckets. The answer is at they're not going to both to get just successfully turn curses. Everything for net metadata wise goes goes to the same through the same zone and though they're not going to be ah so only one of them is gonna win.

E

One question: yes: if you use timestamps what will be the source for a time from the upcoming.

B

The source of time is a pivotal time in.

E

Digital are getting off the decline or it's a gateway, but there yeah. So if you have mood to be gateways and have to decide which right was the first one, how do you is there any way to assume the time between the gateways.

B

E

Can give you the graduates, I mean not.

B

E

Of a problem maybe yeah, we.

B

Can actually make some kind of time notification between the gateways if we think that's an issue, a big issue and make sure yeah.

E

I guess it is any cure. I would gather, there's an issue because, for example, if you run beat- and we end time keeping it always a problem and for the money talk, we have well for a rest of the class of you have some notifications of the time to talk too much but I guess you need something like that done to the gateway yeah.

B

Yeah we can notice ice the time, drifts, ghost too big, so yeah so gateways send messages to each other. They can can think of a scheme where they they say they notify that you know. What's the current timestamp and everyone checks, and if, if there is some kind of a big drift, then they send some kind of warning now.

B

Okay, the question is, you know you need to look at what is the use case for for for that, the weather multiple writers write is the same object in the same packet, whether there's any scenario I'm adding like I assume it's an shoe and it's a problem. But it's not. You know it's not the most acute issue.

A

E

B

When he has multiple users writing to the same object, only one is going to win so whether user, A or user beef. Now now, if, if the time drift is really big, then it's a problem, but it's a problem with other things as well.

B

um Like you know, but with the time that reported about the creation of the buckets, but by deep by if using the f3 restful interface, you cannot drift more than 15 minutes, because Peters would not be able to to sign anything because um so 15 minutes is going to be or 15 15 times. 2 30 minutes going to be the largest time dress, which is huge, but you know I that not going to be um days.

E

B

B

Okay, um any other questions.

B

Yeah see above care, okay, so this will take more time than other operation, since it uses a one link, I'm, not sure what its referred to um h.

B

Bh enough, oh yeah, so package creation is already happens right bucket creations on on a zone that is not the master is.

B

Doing just that.

B

It goes uh through a when creating the bucket on the on on the master going back.

B

Bucket creation by large is not an operation that that happens.

B

It usually it's not it's not part of the CIO path or not supposed to be at least it's um so some latency there, but.

C

The leader has muted your line to unmute your line, press pound six.

B

Yeah I'm, nothing to.

B

So that might be acceptable. We can think of some other schemes where we have different.

B

At a higher level, we have different settings in which we have different masters, owns four different tenants or something of that. So we can say: okay, this cannon is a west tenant and all the buckets would create create here it's going to go through the West in here. That's a nice tenant, but it's.

B

Going to go into a higher at a higher level configuration, so you know if you create different setups and one setups gonna.

B

Be centered around west and another setup gonna be centered around easy.

B

A

Okay, alright, so I don't see more questions so I think we're pretty much set with this one. We can move on.