Ceph Ceph Code Walkthrough, 24 May 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2019-05-24 :: Ceph Code Walkthrough: RGW Multisite Replication

Description

Casey Bodley walks us through RGW Multisite, particularly the replication path.

A

Welcome everybody, I'm gonna, be doing a code walkthrough of multi-site and I'm gonna be focusing mostly on the replication side, though I'll start by talking through some high-level stuff about how the log based replication works. What the different multi-site logs are, what the reading model at sync is and how the co-routines work and then I'll get into some of the code for metadata and data. Sync.

A

So, to start out, this is log based replication, though, if we're talking about two different clusters and at different parts of the world, we're going to put a Zone on each one of them and link them together, so that one zone a makes a change. It writes that change to a log locally and zombie will be reading the laws from Zone A and each time that it sees something change.

A

It will fetch the latest copy of that thing, and so a zombie may express, through the log of Zone A, it's gonna track its progress and a sync status object locally, so the the main logs for replication there's the metadata log, which records all the changes to users and buckets.

A

There's the bucket index log, which is stored inside the bucket index, which records all of the changes to objects and then there's another data, changes log which records which buckets and their shards have changes in them, because you can't really pull every single shard of every single bucket. So we pull the data changes log to figure out which buckets have changes so that we can spawn sync and the buckets that we're focused on each of these multi-site logs are sharded across multiple Rados objects.

A

This is for two main reasons: reading the reader rights across OS DS in the cluster and also spreading the replication work across different gateways that might be running in the same zone.

A

So the charting for the metadata log defaults to 64 objects that data log is 128 and the bucket index log. Here's the same chart count as the bucket index.

A

So the the threading model is that we.

A

Run one thread for metadata sink.

A

And one thread for data sink against each source zone that we're syncing data from.

A

A

There is a single zone that is, the metadata master zone. All other zones only sync from the master, but data sink is active active. So basically, every zone is syncing data from every other zone and its own group.

A

So in our GW radars here you can see the medicine processor thread which is running the co-routine scheduler for the metadata sync and there's the data sync processor thread here. That's doing the same against a single zone.

A

Now I mentioned that all the logs are sharded, though each of these threads is going to be processing a lot of shards in parallel and that's where the co-routines come in. Basically, we want to do everything is synchronously.

A

So that we don't block a thread where a single shard, so the occurred stuff is all in rgw 13. That HNC see.

A

There's a rgw co-routine class, which is based on a boost, Ezio co-routine and effectively, as just this virtual and operate function.

A

So the the operate function just gets called each time that it's ready to run. It will do some work and then yield back and once it gets rescheduled that way, it'll get called again.

A

The next class here is the kuru teens stack.

A

So a stack is just a list of cartoons where it runs each one to completion before starting on the next one, and then there is a kuru teens manager here, which acts as a scheduler of the co-routine. Stacks, though, keeps a list of stacks that are ready to run loops through them and cause they're operate functions, and so the corrodium manager just has a run function, which is synchronous and it just transfers, control and runs all of the co-routines. Until everything completes.

B

If I might ask a question KC, so the code seen as almost like a call stack, so the routines can invoke other Co routines and then, when the later co, routine returns us back to the calling code routine right exactly yeah and.

A

So I'm going to talk now about that, the two different ways that you can follow other co routine.

A

The first is, is the call function where you pass a new corrodium class to it and that how's, the back cover team to the same curtain stack as the existing one, which means that the scheduler will run that new cart into completion before before rescheduling this one, though it's a good way to follow something that seems synchronous, wait for it to complete and get its return value, though the other one is spawn, which that's the core team to a different cover team stack, which means that it, the caller, will continue running.

A

The barn is a good way to write multiple core routines and let them run in parallel, and then there are some collect functions which deal with getting the results of a spawned core routine. Once it completes.

A

So that's a high-level overview of the rgw cartoon framework, so I'm going to go into a quick example. Protein called the continuously C R.

A

This is part of the strategy for sharing the processing of log shards across multiple gateways in his own. Basically, in order to work on a shard, you have to take a CLS lock on that shared object and the continuously C R is what helps coordinate that.

A

And so here we have an example of kuru teens apparate function, so this this function just keeps getting called by the scheduler, and we have this reenter macro, which is implemented as a switch statement under the hood. But basically it means that each time that you filled the next time you get called you'll you'll resume from where you yielded. So that's what the reenter is for.

A

So the continuously see are the first thing when it tries to do is take over a dough slack. This simple radio slack CR is a co-routine wrapper around a CLS lock call to Rados.

A

So here we use yield, call which means that this go routine will suspend and we'll only resumed. What's the lock completes and because it's synchronous, we can check its return code here to see whether it succeeded or failed and if it failed, we set the error state which will tell the scheduler not to call our operate anymore and it'll unwind, similar with the setting state to done means that we're done processing and we won't run anymore.

A

But the interesting thing about about the continuous lease is that it's a while loop. So we'll take the lock or a given interval, which is I believe something like 60 seconds minutes. Maybe, and so we just lock, we will wait for half of that interval then renew the lock. So the idea is that, as long as the continuously CR is running, we'll take the least and keep renewing it and if it ever fails, then we'll set that we're not locked and it's up to the Collin co-routine to detect that and stop working.

A

Any question about the continuous lease.

B

Before general question, is there anything else that can follow yield other than call or wait are those the only two things that can follow yield.

A

Spawn is another time that you'll see it got it.

A

So with that I'm gonna jump into metadata sync, which is all in rgw sync, CC and I, will start with medicine shared CR.

A

Wherever it is,.

C

There we go so.

A

Medicine shared CR, we spawn one of these for each shard of the metadata logs that were processing and so I'll go through its operate kind of at a high level and talk about how it works.

A

So when you start sync on a shard, we have this sync marker, which checks the status and it starts in full sync.

A

Before we get here, we will have built a full sync map which basically lists all of the metadata objects in the shard and stores them, and so the full sync function here will just be walking over the entries in that map and syncing each thing, and what's that completes, will transition to incremental sync and call that one.

A

So this apparate function is a bit different than the continuous lease here we don't see a reenter function or the macro that that the reenter macro. Instead, we have a different one in full sync and incremental sync.

A

So here we're in the full sinc function and we have a reenter and the full CR, which is a Co Co routine, for this function.

A

And the first thing that we try to do is continuously, though we allocate the co routine here and we call spawn to spawn it in a separate stack and we we keep track of that. That stack for a reference counting here and there's this while loop that basically waits until it succeeds in getting the lock. If it fails, we'll see done here and we'll return the error.

A

Otherwise we just yield until it succeeds in locking and then we'll continue on I. Just.

D

Have a question please.

C

D

What point to think would leaves not feel and what? What sort of scenarios? Okay.

A

Yeah, so if you're running two different gateways in the same zone and the other gateway succeeded in taking the lock, then our request select, the object would fail.

A

Okay, so in that case we will just return an error up and we'll we'll wait on a timer and try again later. So in that way, every gateway is continuously pulling on these lakhs and how.

D

Long was it grade until it tries again they don't I'm out there, though, a.

A

Good question the thing that we, the thing that is responsible for this is called a back off: control, CR.

A

Looks like it's a maximum of 30 seconds and it'll start.

A

Not sure exactly what it starts with, but there's this back off control core routine. That will keep calling a thorough teen until until it succeeds, though, it's just rerunning things in response to these lease failures.

A

Here and we we created this medicine shared marker track. This basically keeps track.

E

A

Could you please mute.

A

So we have a marker tracker here which basically keeps track of the entries that we're trying to sync and the entries that we've succeeded. The goal is to make sure that we update the sync status marker to reflect the complete progress that we've made, so we get into a do-while loop here each time we make sure that we still have the lease otherwise we'll drop out, and then we list some OMAP keys here. We're listing the the full sync map so metadata full sync.

A

It's just a list of all the metadata entries that we need to sync.

D

What does a think mark? Look like everything? What's it initialized to our once, it's updated.

A

That's a good question, so each each kind of sync is gonna: have a different and a different marker format.

A

And for metadata, it's this medicine status, header that shows here we have the a global one that says whether we're in initializing building full sync, Maps or syncing, and once we get to syncing, we'll start tracking a per shard marker for each shard, though each shard can be in full sync or incremental sync, and then it keeps some state about the state that it's in though the incremental sync is mainly just the the marker, which stores its position and the incremental sync its marker and the remote logs listing.

A

So the the sync markers are stored and Rados objects in the log pool.

A

So back to full sync were a quick question sure how.

E

Do you decide what gets called as a separate co-routine and what gets done in line.

A

Yeah, that's a good question. It's kind of a trade-off between design and efficiency.

A

You can do synchronous things in line or you can use yield call.

A

They, yes, it's a I. Guess it's not cut and dry. Maybe.

B

Not a more general question in the on your on the left side: there I see a yield and then without the word, call lease CR go down and then even further down I say yield followed by a block. You can explain those two constructs sure.

A

So the field in a black it wraps a call to spawn so it's very similar to just saying yield spawn, except it gives you a scope where you can have local variables, because the reenter macro is implemented as a switch statement. It's complicated to have local variables because they can't cross cases in a switch statement so anytime that you need local variables.

A

You can wrap it in a black after the yield and that ensures that it won't has been a case got it okay, so you you yield and it spawns, and then, when we come back we'll resume after the yield here, there's a yield lease CR go down.

A

This is telling the the corrodium to stop and we basically yield here so that the curtain has it there that leasco routine has a chance to run and shut down, and then we use this drain, all which waits for all of the spawned coroutines to complete and and then it's safe for us to return.

A

If we don't drain all of the curtains, then we may leak memory or spawn. Curtains may have pointers to this coroutine, and so you could have years after three issues.

A

So collecting and draining cartoons before you exit as as important.

B

God I'm still trying to so if this is likely, that yield is like a yield call in that it's gonna run that the completion before it comes back.

A

Yield basically just means that we're gonna return from the from the operate function, which means that we're returning control up to the car routine scheduler and that allows the cartoon scheduler to cooperate on other co-routines.

A

If our core routine is still scheduled, which means it's not blacking on something else, then the coyote and schedule will eventually get back to to it, but yield is basically a way to let the scheduler run other things.

B

God is so it returns, but then it did ever comes back into the schedule ever comes back here. The first thing, that's run, is least CR go down or what? What's that? What's the sequence between returning out of operate and go down being called right?.

A

So when you, when you yield something that that runs immediately and then then the return happens similar to this black here, everything inside of the black runs and then and it returns, control and and will resume from where it left off later.

B

Got it and, and how would it be different if that first yield were yield college stuff, just a yield followed by the function call will be the the semantic difference between those two well.

A

Let me jump in to go down.

A

A

So here it's just a function, call we're not creating a another Co routine and scheduling it so with.

C

A

And changing the state of other coroutines god.

B

Okay, that's that's clear, clarifies things right. There yeah, though wake-up.

A

Means that the leasco routine will be schedulable, so, though we yield in order for the 13 scheduler to be able to run it first got.

B

It so there was a wait call and that co-routine wake up causes it to be scheduled before that we complete.

C

A

Okay, so we we talked about full sync: we're storing the the full sync map, which is a list of all of the metadata to sink in Oh map, and so we're reading all map keys of the next and entries that we're gonna process here. So we get that in a map of entries and we loop over that map.

A

And because there's a yield on the inside, we can't have local variables like a normal range for loop. Here we actually needed to store the iterator in the class itself.

E

C

One thing I.

E

Noticed earlier in this, there was that there was. There was one of you. It has found defiance of a magic number of 100 euro map keys, yes, which didn't seem like a great number and that's written in the code. Isn't that how that grade? Could we maybe have to sweeping, through at some point and lifting some tags like that yeah.

A

There are quite a few of these. It would be interesting to have control over some of them and the configs but yeah. Let's definitely look back and look at those.

A

So here we are in the for loop over entries that we need to full sync.

A

We tell the marker tracker that we're starting on this marker entry, so it'll watch for the completion, whether it succeeds or fails to know whether it can update our our marker position and then we spawn a medicine single entry CR, which will actually go off and read the remote metadata from the other zone and store it in our local metadata pool.

A

So I'm gonna gonna skip over that one for now, but Matt pointed out. The magic number of 100 here means that we're gonna be processing a hundred metadata entries at a time and we're spawning all of those to run in parallel.

A

So once we get to the end of the for loop will collect children which basically waits on all of them to complete and the outer while loop is over the listing of the OMAP keys and if there are more entries, meaning we're not done yet we'll just continue the listing.

A

So once we exit this loop, it means we've either hit an error or we've succeeded in both syncing everything in this shard.

A

If we still have the lock, meaning that it's safe to update our state, then we'll switch the sync status to incremental and we'll start from the lock position when we started full sync, basically, meaning that we won't miss anything that was added since we built the full sync map, and here we used yield, call write marker CR to write the sync marker into the radar subject.

A

So, if all's well, we already.

C

Marcato stored.

A

So this is in the log pool and we have a shard AB name function, which formats the prefix with the shared ID.

C

A

So, if all goes well, we will drop the lease and we'll drain all of the co-routines, and we will finally update our local variable at the sync marker and full sync will return back and will resume in this operate function. Next time we get calls will enter into incremental sync.

A

So in Chrome Cinque look similar. We we take the lease.

A

We create a marker tracker which will update the incremental sync marker as we make progress, and we resume from the marker position that we have stored in our local st. marker variable, though the loop for incremental sync looks fairly similar, except that, instead of reading from OMAP, we are reading from the remote zones metadata log- and here we use clone meta log goroutine, to read that reader listing from that log and we store it locally.

A

Metadata sync is special and that it stores the log entries locally, and this is important for us to be able to fill over to other master zones.

A

If we don't store the metadata log, then we wouldn't be able to serve metadata sync to other zones in the event that were promoted, though we clone a list of metadata log entries and I will read from that in a loop and process them. Similarly to full sync, where we use medicine single entry see, are a.

D

A

In peril, I guess.

D

So before looking into the most make it into a log entries, does it actually look in its own metadata? Log include? Was it could be primary right and it doesn't exactly know whether it's timely or extended.

A

Okay, so I mentioned that there's a single zone that is the metadata master, that meta meta data master zone is not running metadata. Sync only the non master zones are running, sync against that master zone. Okay,.

D

A

Okay and so we're running a lot of these medicines, single-entry co-routines and parallel we're tracking their their markers and looking for their completions. If everything succeeds, then we can update our marker position to reflect how how far we got and we'll just keep looping over the clone mat of data log read metadata log and medicine single entry, loops.

A

So once we finish listing, we will collect all of the children, meaning that we wait for the spawn stacks to complete.

A

And here, if we've gotten to the end of the listing, which means that our log marker is equal to the maximum marker that we've cloned from the other zone, then we'll just wait for 20 seconds and then pull the log again to see. If there's new changes.

A

So that's that's the the kind of the main loop of metadata sync, any questions about this.

A

If not move on to Data Sync, which you can find and rgw Data Sync CC.

A

And very similar to how we run a medicine, shared CR and every shard of the metadata log data sync shared CR is processing a single shard of the data changes. Log and, let's see, we've got a test cluster running here. I can show what the data log looks like.

A

Really just I have one change here, and it's really just saying that. There's a change in this bucket instance, and if the bucket was charted, it would say which shard had that change in it.

A

Though data shrink, sync shard, CRS job is basically just to look at each of these changes and spawn a bucket sync process on that shard of the bucket.

A

Similarly, to metadata sync we're still doing a full sync and an incremental sync. Full sync is going to build a list of all of the bucket shards and it will spawn buckets sync on each one of those and incremental sync will be watching this data log and spawning buckets sync and bucket shards that have changes.

A

So the full sync I think that the structure is very similar to metadata sync: the list distort and OMAP. We read a batch of keys loop over them, Spahn a data sink single-entry co-routine for them, and here we enforce a spawn window. So if we've spawned more than that, many will wait for the next one to complete and collect its result.

A

I'll jump into the Data Sync single entry, real quick and show that it's really just calling run buckets ain't, co-routine and I'll talk about buckets Inc in more detail in a bit.

A

So if once we get to the end of full data, sink we'll update the marker to say incremental sink and transition down to incremental sink, and this one is.

A

Reading from a remote data log side, but there's a couple of other things going on here, so basically we're we're feeding into a list of buckets shards that have changes and we'll process each.

A

There's also, this error, repo that we have here where we'll store bucket shards, that failed to sink in the past, and we will retry batches of these as we make progress in incremental sync. This is different from metadata sink, because metadata sink will black when it hits an error and keep retrying, but for data sink. We want to keep trying all of the of the buckets in the data log and if something fails, we don't want to stop progress. We just want to make sure that we try it again later most of the time.

A

It's just a lease failure, and so this loop here just make sure that we keep trying things in the air a repo until they succeed, but both of these feed into the same list of entries, which we are processing down here,.

A

So we loop through the log entries and will spawn a data sync single entry, which runs the bucket sync ID. Again, we enforce the spawn window to make sure that we're not processing too many things at a time.

A

And that's about it. It just keeps updating the marker, as it makes progress.

A

So I'll go ahead and jump into bucket sync.

A

So bucket sink is taking a lease on bucket sync status: objects in the log pool.

A

It also starts with a incremental sync and transition starts with a full sync which lists all of the objects in that bucket shard and then once it completes that it transitions to incremental sync, which reads the bucket index log.

A

And in this case, instead of full sync and incremental sync just being different member functions there, they are their own co-routines. So we have a bucket shard, full 6er.

A

Which will loop, through the entries and in the bucket listing and use a bucket sync single entry CR to stick to sync an object in the bucket and then real quick down to incremental sync here we're listing the bucket index log from another zone and processing log entries.

A

Each log entry has a quick example of a bucket index log, and this is just the creation of a single object here.

A

Each bucket index transaction starts with a prepare step which will have a state pending and a complete step, and sync is only looking at the complete ones, though here we have a right to an abject ASD, and so this is what bucket sync is looking at.

A

So there's some some logic about ordering of above entries. If we have multiple changes to a single object will use a squash map which kind of tracks which which change happen last and so we'll only try to apply that latest change instead of all of the ones in order.

A

And so we have a loop over each of these entries there's a lot of kind of early exits here, because we only want to process the ones that are completed, we'll skip log entries if we've already processed them on this zone, don't restore a trace of the zones that have sunk them.

A

And eventually, we'll get down do buckets Inc single entry CR- and this is.

A

What actually looks at the type of aht that the the log entry is about? So if it's in it, if it's a right, then you'll see add here, a write of aversion to object shows up as like olh and then there's a separate block to handle, deletes or removal of version objects.

A

But for creates. We just call this sink object function, there's an abstraction called the sink module, but by default.

A

We use this fetch remote AB go routine so to to fetch an object. We are sending a get request to the other zone, reading the object and writing it to a local object in this bucket.

A

We'll go into much detail there, but this is using.

A

This is using a timestamp check to make sure that the version that the other zone has is more recent than ours. Otherwise we won't. We won't try to copy it and.

A

Similarly, for for deletes, there's a remove object or delete markers there's a create delete marker was there a question.

C

Timestamp, so um is there a necessity that, like would the zone should have a similar timestamp like, for example, what happens if the other zones clock is not in sync and it's in a more advanced timestamp? So do we resolve inversion numbers, then.

A

No, we we do rely on and time skew mapping to large, so we do rely on and timestamps here.

A

So potentially, if there is skew, then you could the think could converge on an object that was written before the write that happened on another zone, but at least all zones would converge on the same version of the object, if that makes sense or.

C

A

So this is, this is the place where we I believe this. They are a repo.

A

No, it's not that happens at a higher level.

A

um Sorry I'm not seeing it, but if there's a failure here.

A

A

Right, that's up at a higher level, so run bucket sink Biff. It.

C

A

So data sink single entry was coming from data sink. If run buckets and curtain returns an error, then we will. We will write the key that failed to an error repo, and it was this error, repo that drove the retries in the data sink incremental sink.

A

If we got this entry from the error repo, then we'll remove it from all map so that we stopped retrying as long as we succeeded and then, if we did succeed, then we'll we'll update the marker position accordingly,.

A

So that's a high level view of metadata sink data sink and bucket sink and I have time for plenty of questions.

A

Was there anything shilpa that I missed from from your list that you want me to cover.

D

Yeah the the multi-part objects is that treated differently and the regular objects. Okay,.

A

Yeah, so if you upload a multi-part upload to a zone, a for instance, Zone B is just going to see, isn't a single index log entry for that entire upload and we'll just use a get request to fetch the whole thing. So if you get a multi-part outlet, object, you'll still just get the entire contents in a single body, so we replicate multi-part objects as a single object and we'll store them as a single object and the target zone.

D

Okay, so if there are any parts that I'm the same, then we won't be able to identify that.

A

That's right. Sorry, sorry, can you repeat if.

D

There's any any multi five object, that's missing, any part might be part that is missing than the object. We won't be able to identify that end distance. Thank you.

D

A

Right so we don't generate a bucket index log entry until complete multi-part happens. So at that point we expect to have alt are all parts and be able to read it.

A

Let's see it, I see a question about, does metadata sync recognize deletes and that's that's a good question.

A

There's a Meditec single entry.

A

And all this tries to do is read remote, that metadata from the metadata master zone, and if we get it you know and well, we will actually remove it.

A

So it's it's not recognising deletes it's just being doing a metadata key trying to fetch it from the master and if it doesn't exist, then we'll remove our local copy. If we have one.

C

A

Intent is just to end up with the same result that the master zone has.

D

D

I just want to understand what is the bucket Thank You pointy.

A

A

So there's there's two metadata objects that represent a bucket, the one that contains the bucket info. Its metadata, like it's apples and number of shards cetera, is in the bucket instance, but we have a kind of indirection and the form of a bucket entry point.

A

The bucket entry point uses a radio set object name of the bucket itself. So if you get a request for a bucket, you look up its entry point first and the entry point will say what the current instance of that bucket is. So that points you to the the bucket instance metadata to actually get its attributes.

A

If you delete and recreate a bucket, for example, you'll get a different bucket instance and be able to tell the difference.

E

A

If you reach out a bucket you'll, create a new bucket instance, do the rashard into that and then update the entry point to point at the new bucket instance.

A

The metadata sink kind of treats these two objects completely independently and it'll. Just fetch the current copy of each.

A

When we looked at metadata sink, we noticed that we were listing a hundred things at a time and doing them in parallel, so we kind of just read them and let them race.

A

So I I see some questions in the in the in the deck about zones and zone groups and in this talk, we're kind of just assuming two zones in one's own group replicating from each other, but it's possible to configure other zone groups.

A

So by default you can sync data between zone groups, data only syncs between zones in the same zone group, but metadata sync- will replicate metadata across all zone groups, so all zone groups will be able to see the users and the buckets in the entire system.

A

Data in those buckets will only be resident in a single zone group.

F

So Casey I have had those questions. So what is the idea behind their having its own crew? What is the advantage by grouping particular zones.

A

Yeah, that's a good question and it's been a little fuzzy for me for a long time, but I think it's more for cases where you want to share users for authentication and kind of a bucket namespace across multiple sites, but you want a particularly famous to be localized within one of them.

A

If you're requesting a bucket from one's own group that is resident and another you'll get a HTTP redirect to the correct zone group. So in that way you can kind of see a list of all buckets and get redirected to the correct location.

F

So, but within a zone group also zones mainly need not be co-located right.

A

Right, yes, the idea is that you're using zones for disaster recovery, you would have tough clusters and different.

A

Different sites here.

F

So we can have a bucket name space, the wand across its own groups,.

A

D

F

For multi-site replication within a zone, how do we elect with charting rgw server, writes in to metadata data? Sync rocks.

A

Okay, so on the and the replication side, there's a config variable called rgw enable pink threads, and if you set that to true, then we'll create the threads to run data sync by default. That's on, though every every gateway in his own would be running these threads and trying to get leases to run the processing, though the idea is that releases just help spread the work across all of those gateways and then on the on the right side. Every right would would end up generating a log. Entry would replicate.

F

F

A

Any other questions: okay,.

A

Great, the timing worked out perfectly just under an hour thanks everybody, oh.

F

They're recording.

A

D

Thank you so much.