Ceph Ceph Code Walkthrough, 23 Oct 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2018-10-23 :: Ceph Code walk-through: Consistency with OSD Peering

Description

Presented by Brad Hubbard

Every month the Ceph Developer Community meets to discuss one aspect of Ceph code, to spread knowledge of how it works and why it works that way.

This monthly meeting will occur on the last Tuesday of every month via our BlueJeans teleconferencing system. Each month we alternate meeting times to ensure that all time zones have the opportunity to participate.

http://tracker.ceph.com/projects/ceph/wiki/Code_Walkthroughs

A

Okay, hi everyone, we might get started, and so we've got a lot to cover and other people can join or catch up on the video later on.

A

Okay, today we're going to try and cover peering bearing is a process by which all the OSTs reach agreement about the state of all the objects.

B

A

They're and the metadata of those objects, the residing the PGS that they're responsible for.

A

The process is driven primarily by the primary OSD almost exclusively by the art primary OSD and it's implemented as a boost date chart sorry as a Buddhist state chart state machine. So it's a state machine implemented using the boost state library and that's that state machine is embedded in HPG. So for every PG there will be a state machine.

A

So, just before we go any further I'll just mention a few Doc's that are worth looking at the first one is the one that's currently on the screen that I hope you can all see, and that's talking about last epochs started the next one. I'd recommend is reading his sages thesis and specifically section 6.3 where he covers a lot of the concepts involved.

A

Then there's the boost state chart library, because the peirong process and some of the associate associated processes such as recovery, etc, are implemented as a state machine.

A

It's a great idea to be able to understand how the state machine works to be able to navigate your way through the code. This tutorial is quite good at introducing you to the concepts of the state machine.

A

The way that the state chart state machines work is you have a series of transitions, though sorry you have a series of states and then the transitions are implemented either as a transmission transition, which can be called explicitly or by suing events within the state machine which are interpreted by various states and can be interpreted by multiple states, but we'll go into that. More as we go through the code, the last document is the the appearing document.

A

I'd recommend looking through that as well I'll, try and include these links in the the description of the video somehow or in a comment under the video at the bottom of the peering document. You'll see this graph, which is a representation of the the state machine and the flows within the state machine.

A

It's not very useful when you look at it as part of this document, but you can download it and and zoom, because it's an SVG it zooms really nicely. But basically, if we look at it here, that's the entire state machine. So.

A

We're not interested in all of it, we're primarily interested in the role of the primary and.

A

Probably more specifically, for today, this peering section so I'll attempt to cover that. But we may transition into the the activate session as well, because part of that's relevant.

A

A

When we enter the recovery machine, we go straight to an initial state, which is this state over here. So that's the default state of the recovery machine.

A

So if we have a look at the pre machine in the code, can everyone see this okay or does it do I need to increase the font size or.

B

It looks fine, but you could just increase it by one more size. That's.

B

Yeah that that's better huh Thanks, okay.

A

Cool, so here we see the the definition of the recovery machine, which is a boost state chart state machine.

A

The state chart boost state chart, uses the.

A

Curiously, recurring template mechanism quite extensively, so quite often when you're implementing a derive structure or class. The first template argument is the actual derived class itself.

A

So that's just a quirk of this particular implementation. So, but what we also see here is, in the definition, we're passing initial. As the second argument template argument that is indicating that that's the initial state, this is in PGH and the implementation of peering is almost exclusively in PGH and PGCC.

A

They will primarily or almost exclusively be looking at those files go down in this definition, there's actually a listing of all the possible States up to date. As far as I know, so we enter the recovery machine and we transition to the initial state, which is the default state for the machine.

A

From that state, so we're sitting in that state and.

A

We, the OSD, will call depending on whether this PG is being created or its being being loaded from disk.

A

The OSD will call handle activate map or handle initialize and in those functions we will create a state machine event.

A

In the case of handle activate map, we create an act map event and in the case of handle initialize, we create an initialized event and we tell the state machine to handle that event. So, in other words, we post that event to the state machine and if we go back to our graph, we can see that when we're in the initial state, if we get an advance map, an act map or an initialize, we then transition all the way over here to reset, and we can see that in the code by.

A

Looking at the initial State.

A

And we can see.

A

A

Reacting to the there with me a second.

A

Okay, we can see here. This is a transition time that when we get an initialized event, we transition to reset.

A

Okay, so if we look at the reset state.

A

A

So the the other instance where we can come to the reset state is if we were.

A

We'd already gone through hearing and we're basically active and clean.

A

When the map changes, a the OSD will make sure that we get to see that map, and so that's worth looking at.

A

We would get to see that in the started state and that comes through as an advance map event.

A

So when we see that we call should really start peering on on ourselves and if that returns true within transition to reset. So regardless of whether we've been created, whether we're being loaded off the disk or whether we're already in an active and clean state, we end up transitioning to reset there.

A

We then receive an act map event, which is this line here and if we look at and we react to that event,.

A

Though act Matt, that's this definition here. So this is our reaction to the AK map event.

A

Though we do some checks and then transition to started.

A

So that if we look at the graph again we're now entering this state here and we can see that we can see the transition we've gone from initial following this line because of an act map or an initialized reset, and then reset has a line here that when it receives an advance map that line, one of these points here actually goes from reset to that started. Box.

A

Okay and the boxes represent a context. Once again, it's a state machine concept where states can be grouped with that under a concept, a context, I'm sorry.

A

So whenever we're within one of these boxes, it means that that state is active. So when we're in primary or peering we're still in started, we're still within started so started can still be receiving events.

A

So that's a it's important to understand that to understand how the state machine works, so we can be in a sub state such as peering here and started, can still receive an event and do something you know based on that event.

A

Alright, so continuing on we're in we've entered reset and then transition and through our return here, I return from this function is saying we're transitioning to started, though.

A

I'd also recommend that, looking at at the logs following the logs with debugging set 20, for example, that's usually the default debugging value.

A

If you follow the output in the logs as we enter each state, there's a notification set in the log and also as we exit the state, each state will put an entry into the log to say that it's exiting, so you can follow the transitions between states from the log messages and that can be handy and understanding how the transitions are working.

A

So if we now look at the started definition.

A

But we need to look at.

A

A

So we go into the constructor of the start state, and then we make a decision based on whether we're the primary or not as to where we're going to transition. If we're primary, we transition to the primary state. If we are not primary, we transition to the stray State and I think we're probably most of us know the stray state is a state would go into where.

A

We are holding relevant objects and we may be required for either backfilling or.

A

Or logs etc. So we stay in that stray state and either transition in in into to become one of the replicas or part of the acting set or eventually, if it's decided we're no longer the lead it needed.

A

The primary will tell us to delete.

A

That we're no longer needed and we can delete any.

A

Objects that or delete the PG basically.

A

So when we post that event make primary that forces the transition into into the primary state, so if we look at primary.

A

And we enter the primary state.

A

And from there.

A

I'm looking for the transition, but it's a default transition, so unfortunately this involves a lot of flicking between the header and the implementation.

A

Is our primary state.

A

Can see on this definition.

A

Come on whoever's typing: can you mute place.

A

So, in this definition, that's slightly different to what we've seen before. We now have three arguments template arguments. The first is the derived class itself, as noted before. The second is the context which is the parent state, so we're in primary, the parent, State or the context state is started which we've seen before, but then we have a default transition as well, so the default transition when we enter PI primary. We go straight into peirong, though.

A

If we look at that peirong, we can see that when we enter peirong, we go directly into the get info state.

A

So we need look at that.

A

So once we need to get info, there is no default transition, though, in order to see what happens in get info, we have to go back to the CC file oops.

A

Look getting further.

A

So in this state we look at the past intervals, so we we work out a prior set.

A

By building the prior set.

A

In check past interval bounds, we're looking at the map or the maps are the most relevant parts of the map to work out what our past interval downs should be, but we know what intervals we have to carry a cover and.

A

A

Again, we're looking at the map.

A

Working out what set of our STIs, we needs a query.

A

A

Then we call the actual function get info so.

A

A

We're sending a query to the other OS DS to get the PG info t structures from them. And if we look at info B.

A

There's a summary of the PG statistics for the feet for the PG.

A

It includes a version T's which a combination of the epoch and the version number.

A

Which is constantly implement in implementing incrementing, so as changes come in so.

A

So we keep a record of in the PG info t HP each PG on each OSD keeps a record of the last update last complete last epochs started, which is relevant to our discussion or very relevant to the peering discussion last epoch started is the last time this PG went active so once this P.

A

Eg this all changes, it will declare itself active and it will update last epoch started also included in.

A

Pgoing info t Daksha is a PG history, t called history, and if we look at that structure.

A

A

A

Okay, so this is a PG history, t structure, it also has a last epoch started.

A

So the reason why there's two of these hopefully get into this more later, but the reason why there are two is when the PG declares itself active on the primary and goes in to activate it will set its last epoch started in the PG info T.

A

But it's only when it knows only when the primary knows that all replicas are activated and all have committed that last epoch started that they will that it will set last epoch started in its history in the PG history T the structure, that's.

A

That's embedded in the PG info t.

A

The reason it does, that is when it calls activate and since last history last epochs started there.

A

It can then accept right, and so as it will then instruct the replicas to write that down and to commit it, and that may be part of the first right that it sense sense to them.

A

So once they've done that they will then acknowledge that, and once where we received the all replicas activated event, we then update history and that becomes more relevant in the process. So we'll continue on, though I call get infos.

A

Look at that definition.

A

We then this is where we're asking for the the info.

A

From each peer, so we've decided on a list of peers, relevant peers and we are probing them for their PG info.

A

So once we have all of those.

A

The get info state will receive a got info event, though we look for their.

A

A

We can see that in get info when we receive a god info event. We transition to get log, though, once all our STIs are replied with their info.

A

We then generate that God info event and transition to get log. So if we go back, not CC file yeah, we can see it there.

A

By posting event got info so.

A

Yeah so sorry we call get infos, we asked for the infos, we get them back, we post an event once we have once post peer info requested is empty and we know that we've received all the peer info from all of the peers. We post the event, god info and we trans transition to the get log state.

A

So let's have a look at that.

A

A

A

We can just we decide, we call choose acting and we find out whether we can actually.

A

Decide on an acting set- and this is the guts of the pairing process really, if that fails, we then need to decide whether we generate a need acting change then, in other words, change our acting set or whether we go to incomplete the incomplete state. So if we look at that in and visually we're now in get lock, we can see that if we generate is incomplete, we go to the incomplete state.

A

If we receive a got log event, we generate we transition to get missing, but what's interesting is the choose acting event choose acting function. So if we look at the definition of that.

A

So we so we're calculating the desired acting set and if it differs from the current acting set will request a change from the monitor.

A

In order to do that, we need to find the best bow and therefore the best log or the most authoritative log that we can find that we do that in this function, find best info.

A

Okay, so when we're looking for the best info or the authoritative log.

A

We're going to prefer a.

A

Log that has a new one. Excuse me an.

B

A

That has a newer last update or one that has a. Secondly, second, priority is a longer tail and third prior priorities: we're going to have a preference for the current primary. So you can see that yeah.

A

We've built a list of the prior set in get info and we've generated our list of infos.

A

Just bear with me a second okay.

A

In choose acting, we combine the info from the peers with our own info and then we feed all of that. As an argument. The infos argument here into fine best info to calculate the authoritative log.

A

First thing we look at is the maximum history dot last epoch started available amongst all peers, including ourselves, though, if you remember history, don't last epochs started is relevant because that's the last time the PG as a whole, when active.

A

And was able to accept rights so that the PG as a whole was able to accept right, so it can be confident, peirong completed successfully and that and that that epoch in that epoch and that all the OS DS in the acting set at that time will have recorded all rights act to the client prior to this epoch.

A

We also know that all acting OS T's for that interval must have recorded all right acknowledge to the client during the interval that begins with that epoch.

A

Sorry, if that sounds a bit jargony, but it's it's a it's kind of a difficult concept to portray and you've really got to think about it.

A

A

If any peer, including ourselves,.

A

As an elastic box started greater than or equal to the maximum history last epoch started that we have found, and we set a minimum last update based on that.

A

But that's what all this is about.

A

So if peering is going to fail, it quite often fails here, and this is where we'll go to the incomplete state.

A

A

A

So we said the minimum last update, based on the last update field from mean foe, that we've determined to be authoritative.

A

The last day, update field is pair of values. The epoch, as mentioned before, and monotonically increasing version number.

A

So we check last epochs started here instead of history, dot last epoch started since any peer with history. Last epoch started set to a particular epoch must have at least that value in last epoch started and may have a later value indicating it is completed activation at the at that epoch, but the, but the PG as a whole did not go active since history, dot last epoch started, shows a lesser epoch, so.

A

So we, depending on how that all pans out, will either return infos dot, end which we're doing here, which will be interpreted as a failure.

A

Or we return best.

A

A

The iterator to the PG underscore info T in infos that represents the best.

A

The most of info PGI underscore info T and therefore log that we could find, or it will info e equal info end, which once again, is interpreted as a failure. So once we've done that.

A

We then ask for that: log is irrelevant if the relevant, although the PG info T, belongs to another OSD, will query that OSD for that log for its log.

A

Otherwise we will, if it's our log, then we'll use our log as the authoritative log.

A

But once we have that log we will get a.

A

Bear with me a second we'll get a got log.

A

Event, if we look for that.

A

So that gets posted after choose acting and therefore subsequently find best info of run so they've run up here.

A

We can then post an event got log if we know that we we being this OSD that we're running on you know this instance of the PG on this instant or this OSD knows that its own log is the authoritative one. Therefore, it posts the event got lock.

A

The other place where that event will be posted is once we've received the log from the other PG on another OSD that we've determined to be the authoritative PG will request a log from that. Once we get it, we post a god log event, then we react to that event down here. This is our reaction to the got log event believe get log and we transition, or we call process master log, so in process master log, we merge the authoritative log with our own, if that's necessary,.

A

A

So we merge that love and within transition to get missing the next state. So you can see that once again, yeah.

A

God we're in get logs, so we were in get info. We got it got a god info event. We transmitted transition to get log. We then successfully executed find best info. We worked out the authoritative vlog. We requested that log, if necessary, from a remote or one of the other OS DS. That.

A

A

John, a blank we've requested from the OSD that we've decided has a copy of the PG with the authoritative vlog. Once we receive that back, we get generated, got log event and we transition to get missing. So we look at get missing.

A

A

A

Iterate through our acting recovery, backfilled targets.

A

And then we work out our missing set and also what we need to do to achieve. Backfill.

A

A

In this condition, we've determined that we don't need well that we need to completely backfill because we haven't with have an empty missing. Sir.

A

We look for divergent updates.

A

We request the missing set and the full log from each of the OS DS that we appeared appearing with.

A

Case we've determined we need it up through so up through is a field within the PG info T. That.

A

We need to request.

A

We need to request the monitor to update our up through in our PG info T, and in order to do that, we post the need up through event and.

A

When we do that, we transition to this Wade pub through date.

A

We've accomplished that we can post the event activate, and that includes the epoch. So.

A

Look at the definition about activate.

A

This is actually activate is an event, though, where we're creating and activate event, and we can see that this activate event, which is here takes us into the active state.

A

So we can see that by looking at.

C

A

Yeah, so this event is actually being.

A

Captured by the peering state, so we're well down into get missing. We post this event and peering reacts to the event. So when peering reacts to the activate event.

A

We're posting the event activate.

A

Okay, here we are no that's replicas active.

A

A

A

We don't, we haven't defined a reaction for that transition. It's just a simple transition, so when we receive the this line, defines that when we receive the activate event we transition to the active State.

A

This is why you've got to look at the headed header declaration for a state, as well as the definition in the source file, because these transitions can be defined defined as a reaction to an event or just a simple transition where we just say: okay, if we receive an activator event, we transition to active. So if we look at the active State.

A

A

It has a default transition to activating, so we now need to look at the activating state.

A

Okay, so it has no default transition, so we want to look at the definition.

A

A

A

A

They listen in sorry, I'm will be confused.

A

Okay, so this is constructor for active okay, so this is where.

A

We're actually activating in the constructor of the active state, even though we transition directly to activate. We do some work in the constructor of the active State.

A

We do a few checks and significantly call activate, and if we look at the definition of that function, which.

A

We can see we once again do some checks. Some asserts. We clear the down state.

A

If we're the primary, we update our last epoch started. So at this point, when we update that last epoch started, we are acknowledging that we have an authoritative history of this PG, and so, if we go through the pairing process again at some point in the future, we will be considered the best or the most authoritative.

A

Osd to query or an authoritative log, because at this point in time we have decided that we are ready to go active and that we are. We have an authoritative history of this PG.

A

Sorry, if I keep repeating the same thing, but it's important to get that concept and it's it. It's find it difficult to wrap your head around, especially to begin with.

A

Once we've done that.

A

A

Update a few of the.

A

Channel values.

A

Including our last roll back.

A

Then we create a I.

A

Forget what we call that? What do we? What do? We call that Josh, that the see that this stands for a completion, yeah completion, so we create a completion that is going to trigger.

A

When everyone has committed.

A

Then we do some housekeeping to do with snaps and recovery, but the main thing is: we've completed that complete we've created that completion.

A

Go on a lot of housekeeping of weirdness, which is basically if we find any values that are a little bit strange. We lock them um or we we create an error, but it's it's. It's just an error. We don't change anything we're doing. It doesn't have any effect on what we're doing so. We just logged the fact that we've found some strangeness, though how often we see that when it's not in a in a testing environment, I suspect we don't see that very often at all.

A

A

A

Where we start start up, the replicas activate the piece so we'll send them a message to activate.

A

Okay, that elsif don't know we're going to ignore it.

B

A

So this is recovery rather than peering. So so then we're finished with that.

A

So if we go back completion.

A

Okay, so PG activate commit committed. So if we look for all instances of that symbol, we can see.

A

That, in its definition, when that completion triggers, we call activate committed.

A

But if we look at the definition of that.

A

So if we've reset since.

A

We basically don't do anything.

A

But if we haven't- and we are the primary.

A

We insert ourselves as an activated peer and we make a note that activation is committed and.

A

If this is the last, if we are the last peer to activate, we call all activated and committed. If we look at that.

A

Point we know that all peers have activated and they have committed the info dot. Last epoch started.

A

Now we can update info dot. History died last epoch started, which, as I mentioned before, is an indication that the PG as a whole, so in other words, all the OS DS involved in the PG or involved in the custodianship. If you like, of the PG for one of a better term,.

A

All of them have activated all of them have committed last epoch started, but that's that's what's required before we commit history, dot last epoch started.

A

A

We do some checks.

A

Make sure we are not degraded.

A

And then we cure an event which is all replicas activated. So if we look at that event.

A

A

Caught or REO or the active state reacts to all replicas activated event and.

A

A

Sets its info got history, dot last epoch started to last epoch started, so they they should always match, but depending on our seas, flapping and and the timing of things they may not match. So if we have an info last epoch started, that is greater than the history died last epoch started. We know that that PG on that OSD went active, but it was not active long enough to receive an acknowledgment from the ellipse P G's in the acting set that they also went, active and wrote down. The last epoch started.

A

So we have to be able to handle that case when we're peering and determining which is going to be the authoritative log.

A

And once we get there.

A

We're active, and that basically is the end of peering.

A

We then determine what recovery we need.

A

We need, we determine what backfill we need and or we determined that all replicas are clean and that we do not require any recovery, and in that case we issue all replicas recovered.

A

That gets interpreted.

A

A

Okay, so recovering reacts to that.

C

A

As which other to then that yeah, that then transitions to the recovered state.

A

Transitions to recovered so.

A

We've transferred transitioned here to recovered and then the next step is to transition to clean. So once we do that, we're active and clean, we appeared we're active and clean we're up and running. Well, we were up and running as soon as we all went active, but that's basically the end of what I was going to talk about today. So we've gone a little bit over time, but not too far.

A

Anyone have any questions, I.

A

Have I put you all to sleep.

C

Tim here, I don't have any questions, but I wanted to say thanks for going through that I've I've not looked at the guts of the OSD code in detail. So it's in some respects it's a little bit baffling for me, but at least I think now, I've got a good idea of I've got I've got somewhat of a feel for how this thing hangs together so overdue after to go. Looking in this code, at least I think I know where to start now which I didn't before. So that's that's helpful.

A

Yeah, look, don't don't feel bad about feeling baffled, I! Think anyone who approaches this code for the first time is baffled. I think that's a very good description of the initials date. Someone who initially starts looking at the state machine.

A

Yes, it takes a little bit of getting used to, but once it's it's once you get used to it. It's quite a you know: it's a good solution to the problem and it's quite straightforward once you understand how the transitioning between states works and- and this graph is.

A

Is key to that the visual representation, without that it would be enormous. Well, it would be difficult it more.

B

A

You know this helps to shed some light on it so handy to have this around if you're looking at this code. That's for sure your thanks, Tim.

A

All right well.

C

Hey our prize, so where can I get hold of this picture? It's.

A

Here so dark start, safecom ford, slash dogs, full /, master, ford, slash, dev, slash tearing it's at the bottom there you can see it in it. It's not particularly readable, but if you download that image you can scale it up.

A

Alright, my blue jeans seems to be a little bit frozen. I can't do anything with blue jeans at the moment.

A

So if no one else has got any other questions, if we can at least kill the recording or maybe stop the inside thing, I can't my all my buttons on blue jeans, not.

B

Yeah I can I can I can take other thanks, Brad thanks, but I'm over this code and I'm sure it's useful. A lot of us thanks.

A

I can't see the participants I'm not exactly sure who's on the call, but if, if you have any ongoing questions, let me know always happy to talk about this sort of thing and I hope. The explanation was reasonably lucid and reasonably easy to follow.

B

Yes, I think it was and we don't have any outstanding questions on the chat bar now. But if anybody has any more questions feel free to reach out to Brad over email, IRC.

A

Definitely all right, we'll call it a day sure.

B

Thanks again, thanks for joining and it's bad very.

A

Welcome thanks. Everyone thanks.

C

Thanks. Thank you. Thanks.