Ceph Ceph Tech Talks, 27 May 2015

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2015-MAY-27 -- Ceph Tech Talks: Placement Groups

Description

A detailed look at placement groups in Ceph. What they are, what they do, and how to use them in managing and troubleshooting your cluster.

http://ceph.com/ceph-tech-talks

A

All right welcome everybody to the next monthly Ceph Tech Talk, where we try to feature some of the inner workings of Ceph at a deep technical exploration level, so that we can enhance the greater technical understanding of Ceph. In the past, we've looked at ray dose and the block device in the Gateway, as well as the calamari API and all GUI associated things. This months we have an examination of placement groups by a longtime friend and supporter of the seafloor fluorine. Would you like to go ahead and introduce yourself.

B

B

Please for groups because, as I found out, it's some of the things it's one of those things that kind of found people about what are they actually good for and what can we do with them, so this tik-tok is called placement groups inside and out and we'll get started straightaway. So let's talk about data placement for a moment and let's diverge or sort of go off on a tangent here for a little bit.

B

What does data placement and conventional data lookup look like in legacy distributed storage systems so the way that this typically works in a conventional fashion and a legacy system very much unlike Saif is this way you talk to a central look up earth to find out where you need to write your data or where you need to read your data from and then you take that information and you go to restore servers and talk to them directly. That's sort of a standard legacy method of doing things now.

B

What SIF wants to do, however, is it wants to actually scale this to potentially thousands of storage locations and petabytes and exabytes of data. So how does essential lookup work in terms of actual horizontal scalability? To what extent does it actually provide horizontal scalability to that pivots, petabyte makes a byte scale level. Actually it doesn't. It doesn't work that sort of thing really doesn't work, because your central lookup facility always becomes a bottom like Ana single point of failure. That's not what Seth wants to do so.

B

Instead, the idea in Seth is that you don't actually look up where data is, but you have every single component in the system, including clients, including servers, actually work out where they can find their data. So that's a standard feature sort of Seth, but how do placement groups actually come in? Why do we need placement groups? What are they good for and what is their specific role now?

B

There's two ways of explaining this one is a very abstract way where you talk about objects and data storage locations and the storage topology, and things like that, but I find that very opaque I. Don't really?

B

You know like to explain in those in those terms so I'd like to use a metaphor and I call it the parable of the tennis coach or, more precisely, the parable of the obsessive, compulsive tennis coach, so bear with me for a moment, while you imagine that I am a tennis coach that has a little bit of an OCD problem and it's the first day of summer tennis, camp and I walk onto the onto the court and I've got five students, Alice Bob, Charlie, Daisy and Eric and I have 300 tennis balls that I want to distribute among them and I want to distribute them evenly.

B

So every single one has the same number of balls. I do that I can use something very, very simple and Eve with something that actually works is I could use a round robin approach, which means I.

A

Reach into the shopping.

B

Cart of 300 tennis, balls I check the first one to my first student and second one to my second student, all the way to the fifth and then I start over at the beginning, and when I'm done with that, then I'm going to have my tennis balls my data distributed evenly among my storage locations. My students well, but actually my OCD, isn't quite happy with that, because what I really want to do is I want to distribute these tennis balls reproducibly.

B

In other words, I want to be able to look at a specific tennis ball and know exactly where it needs to go I want to and when I see a ball lying around on the ground on the court. I want to be able to pick it up and know exactly which of my students it belongs to. So what can I do in order to achieve that and to make this sort of round robin thing? More reproducible, well, I.

B

Just take a sharpie and I number, all of my tennis balls I number them from zero to two hundred. Ninety nine and then I apply a very simple algorithm to distribute them and that simple algorithm that I can use is just effectively division and a modulo algorithm. So I can randomly pick up. The ball with a number 130 divided by five remainder is zero. It goes to my first student Alice I pick up another 170 1/5 remainders one. It goes to Bob I pick up another ball.

B

13/5 remainder is 3 and it goes to Daisy, so I'm, happy I have found and devised the way of simply and reproducibly distributing my data, all of it until disaster strikes something terrible happens at 9:30. In the morning, I have a six student walk onto my court, Frankie slept in and it's kind of still a little dazed and confused, and how walks on to my tennis court now I have a six student and now what do I do well.

B

I think I figure that maybe well we're not doing that badly because, as it happens, I pick up a ball randomly. It has the number 31 I divide by five remainders, one remainders, also one, so it stays with Bob and I. Think we're kind of cool, but of course, that doesn't last long because I pick up another one happens to have the number 36.

B

The remainders are different and now I realize that by the simple modification, namely I, have one additional student I now need to shuffle around and his balls between all of my students and I need to do some shuffling around it actually doesn't involve Frankie my new student at all, as you've seen this example. I have a that's going from from from from Bob to Ellison so forth, so that knave approach doesn't really work.

B

Now, as it happens, what that means is that we get a change in capacity, in this case our number of students that changes and the number of tennis balls that I need to shuffle around it's totally out of whack with the actual change in capacity. So what I'm? Creating with this simple and naive approach, is I get something that distributes very evenly, but that creates an insane amount of migration if there is a slightest change to the system- and this is a problem that every distributed storage system has every distributed.

B

Storage system wants to distribute data evenly, but it also wants to create a configuration that has minimal migration when capacity changes and just about every contemporary distributed storage system solves it effectively. The same way in Ceph, it's called placement groups in Swift, they're called partitions and so forth. So how does this work? What do we do? Well, it's actually fairly simple, and that is we add an additional management layer and in the tennis court or tennis, a camp analogy, it's just a set that we can put our tennis balls into so suppose.

B

We've got 60 of them. I've got 60 buckets, we use exactly the same algorithm, except that we now use a set of different operand for that modulo division. So I can again I can pick up a ball number of 130 I get a certain remainder. It goes into a specific bucket and I.

B

Do that with others as well, and I can uniquely assign them to or I can clearly and reproducibly assign them to a specific bucket, and now we keep a list of all the buckets for every student and we make that list available to everyone.

B

Every one of my students gets at list in I half that list, so the only thing that we now need to do is the original situation was we distributed the buckets among the five students and now, if Frankie onto the court late, we only move a handful of buckets and update the list. So what happens?

B

Is everyone just gives up two of their 12 buckets that they previously had, and everyone ends up having 10 and suddenly we have devised the system where the migration is proportional to the capacity change, which is exactly what we want and now, if we just translate all of that into SEF terminology, the analogy becomes hopefully very, very clear in that the tennis balls that we're talking to map directly to our Rados objects- the numerical IDs they sharpie onto the tennis balls, translate to the object caches that Seth uses this internal unique object, identifier, the buckets that we have are our placement groups.

B

Students are safe, oh is these, and the list which is effectively a set of parameters to this very simple algorithm that we're using translates these safe OSD map. And now it's important to understand that this map, the parameters to our algorithm, that changes only when our storage topology changes. If we remove a storage location or add a new one.

B

That's when we need to update the map and that map is lazily propagated throughout the system and on the whole we get is something that is much more effective, much more efficient and much more scalable than central data lookups, and that's really the whole story. Why we need placement groups in the first place. They are a very, very simple addition to an otherwise very, very simple algorithm, but it's kind of simple but genius. One thing that I didn't mention previously is the fact that, typically, what you should for and again this is true for Seth.

B

But it's also true for other distributed storage systems. Is that the number of buckets that you have in the system? The number of placement groups is 1 to 2 orders of magnitude larger than the number of actual physical storage locations, which is what you also got with the five to six students and in the in the 60 buckets, and the reason for that is, of course, that you want to be able to redistribute them, namely and best avoid the situation where he actually need to split buckets, because that's an operation, that's expensive.

B

So this is why we need placement groups. This is why we have them in SEF. So what can we do with them? Well, we should look first at how we can actually examine placement groups and their status. The way that you're, probably going to do that initially most of time or very frequently, is with the set health command. So let me run you through this here real quickly, so what I'm doing here is there we go that is taking a little longer than you expect it.

B

So let me just reload that here we should be fine.

B

C

A

B

What you and you still have a demo?

B

B

Okay, so let's see if this works better now there we go okay, so okay! So if you do a set hell's command and everything's fine you're, just gonna get it else. Okay back and what I'm doing here is I'm stopping well I'm taking one of my storage locations, one of my OS needs out of the system.

B

If you now do asset Health's, you get the hell swarm and then, if you get, if you do your set Hills detail, what you're getting back is a list of all of the placement groups in the system and their current status. I'm, sorry, not all the placements, but only the ones that are not in the active and clean State, the ones that are currently affected by some sort of outage.

B

So what you're seeing here is I have a handful of placement groups they're currently in the active and degraded State, because I took one of my hoes Dee's out and we're going to go over the various PG states in just a moment. Now, if I put my my OSD back in to action, so there we go.

B

That's my OSD that I'm restarting then my previously degraded placement groups are going to go into recovery state as you can see here, and then it takes just a little while and then we get back to the health okay state. So that's the first way of getting a handle on the status of your replacement group is simply the set health grant. And if you want to get a little more information, then you can use seft PG dump. So let's take a look at what's fpg dump does for us a set.

B

Pg dump simply gives us a tabular overview over all of our all of our placement groups and what I've done here at the bottom, because it's a fairly fair amount of it's a fair screen fold here is I've taken out a quick grip for one specific placement group ID. So let me quickly walk you through the output that you get here, you get as I said, a tabular output that starts with the placement group ID, and then it gives us various bits of information such as what is Group State. In this case it's actively clean.

B

When was the state last updated and also what are the current replicas that are that this placement group is assigned to?

B

And you can see here there is a list 0 & 5, so that means it's assigned to OS DS 0 & 5 from this you can deduce that this is a pool with the size of two, so two replicas for every object and then 0 is the primary OSD for displacement group set PG dump is a fairly helpful command if you just want to get a little bit of extra information about either well all of your placement groups or just a specific one.

B

If you add that right, if you want to get a little more information, you can do that as well. There is the set PG query command now. I found out that, while this command is perfectly documented of the staff documentation, it apparently is not known to that many people, so I want to make a point of mentioning it here.

B

So the syntax for that is set PG, followed by a placement group ID and then the keyword query as you can see at the top of the screen here and then you get a JSON output of pretty much everything that Seth knows about this PG, such as when was it last clean?

B

When was it last on stale, when did the last scrub or a deep scrub run and a bunch of other information and we're going to come back to that in a little bit, because it has one very crucial bit of information in it in a specific error state. So that's set PG query and then, finally, every once in a while, you may want to find out what placement group a specific object belongs to in the first place and that you can do very simply with safe OSD map. Let's take a look at the sets.

B

The map command here, real quickly. What I'm doing here is I'm simply numerating Rados objects in a specific pool and then I'm mapping. One object in this pool and the syntax for that is self esteem, a followed by the pool name followed by the object name, and then we we get. What we get back is the PG that this object belongs to and what the current the current replica set is for this for this, for this object and the primary OSD for it as well.

B

So what you get here is always t7 and four in the primary is PO s t7, so those are kind of the most frequently used commands. If you want to query a placement groups status and one of the most important bits of information that you can get out of this, both out of PG dump and also set PG query.

B

Is the placement group state or placement group status so what's in there and what are the typical states or typical statuses that you can find it so the normal state that we have in a placement group is called active and clean. So what does this mean? It means that the placement group is currently available to process requests, that is to say, we can read from it and we can write to it.

B

That is the that is the active bit and the clean bit means that before this placement group, we currently have the correct number of replicas available. So if it is a placement group in a pool with a size of say 3, then we would have that if we have, we correctly have three replicas available.

B

If we have a placement group that is currently degraded States, then that simply means that that placement group currently has fewer replicas available that are mandated by its full size. So in other words, for example, we have a pool size of three and we only have two replicas available. So what should you do in this case? Well, most of the time, actually nothing. So typically, no action is needed in this state.

B

You will either run to recovery, which will commence when the OSD either comes back or when it's mono is deep down out interval by default. Five minutes expires, and in that case, Ceph will then assign a new OSD to the placement group, and recovery will commence from there. So a degraded State is basically nothing particularly to worry about. It simply means that someone of your o s, DS is is currently down, but the placement group and the data in it are still perfectly available.

B

If a an OSD has previously gone down and it comes back up, then it is catching up from its previously degraded State, and that is a state that we call recovering and that simply means that the two replicas or the replicas that you have are being brought back into sync from an out of sync state, there's a sort of special case of for that and that's called a backfill, and if your PG is backfilling, it's roughly doing the same thing as recovering, but it's starting from an empty set from an empty replica.

B

This, for example, is the case. When you add new, OS T's and data is being reassigned, or your previously up and in OSD has gone down, has subsequently been marked out. Staff has selected a new replica on a different OSD for a specific placement group and is now backfilling, meaning filling the OSD, the new OSD. With the data in that placement group, then we have an incomplete state. Now, that's a little bit of a it's a more tricky one.

B

That's actually a warning state, and that means that in a specific placement group we have fewer replicas available that are mandated by the pools minimum size, the pool min science parameter. So, for example, if you have a pool min size of two- and you only have one replica available, then that PG would be more incomplete. So what can you do in order to recover from that? Well, if you have a PG stuck in this state, it basically means that SIF cannot fulfill the redundancy guarantees that you've defined for the pool, given the current crush map.

B

So what you might need to do, is you either want to bring up another OSD so that the consistency or the replication guarantees can be fulfilled, or you make a change to the crush map?

B

However, you should that's a big beware there, because you should take into account if you make it fresh map, change might trigger some or potentially a lot of reshuffling that you actually don't want at this point, or another thing that you can do temporarily is change the min size of that specific pool and then bring it back up to the previous min size later in order to get out of the incomplete state. The best way is to just find another roasty that that's f kin can replicate to by default.

B

At least if a placement group is inconsistent, then that means that you've previously run a scrub or a deep scrub operation on the it has detected errors now, if the errors were detected immediately from the files metadata in the OSD, and that is something that a scrub would catch. If you have files in there that have the correct size, the correct metadata and whatnot don't have the content in there.

B

That's F expects, for example, do two bit drop, and that is something that a deep scrub would detect by the way the PG would go to the inconsistent state in order to recover from that, is you run a set PG repair on that placement?

B

Finally, the downstate. Now this is one where you have critical data missing and the PG is therefore unavailable. How do you get yourself into this state? Consider, for example, you have a pool with size of two, so two replicas for every object.

B

You have one of your PG s, one of your replicas going down. Then you modify the other one, then that one goes down as well and you bring the original one back up then that one has data that is actually stale. That is if it were to become active. It would warp you back in time and so, therefore, by default, Seth will disallow that will mark the PG down, and that means that you actually can't do any IO to that placement.

B

The way that you recover from that is you either need to bring one of the other OS knees back, namely the one that actually has the last good data or, if you cannot do that for some reasons, such as, for example, you had a catastrophic failure and the OSD.

B

Actually it's discs physically or whichever then you also have the option to declare the OSD lost again beware: this is a potentially destructive operation because you may lose the most recent updates to some of your data and here's where we come back to the PG query for earlier, because the question of course, is always okay, which is the OSD that I need to bring back, which is the one that I need to declare lost in order to get this placement group back up.

B

You also find out with PG query when a placement group is down, then in the JSON that set PG, query produces you're going to see an entry, that's called down OSD, so we would probe and that's the OSD that you need to be looking for- that. You then subsequently need to recover or declare loss again as a caveat.

B

If you declare an OSD loss- and this is actually very well documented in the Ceph documentation, you will potentially lose some data because you might lose the last updates that you've seen on one of your OS DS, okay. So that basically concludes my little three-part talk about placement groups, what they're good for how we can get their status and how to to interpret that before we go to the questions, if you're interested in the slides you're going to find them well, you can actually find them right now on the I/o flash Tech Talk PG.

B

Those are the rendered slides if you're looking at the source code for those and that's at github, org slash is next so safe, Tech, Talk PG in all my slides are always under a creative commons attribution/share-alike license, so please feel free to use them as you wish, and if you would like to know what I and my company are doing about, Seth Seth landing page, that's WWSD, calm, slash, Seth and with that Patrick we'll be happy to open it for questions s.

A

If you have questions feel free to either type them in the chat or unmute your microphone and and ask them as well, the only thing that I might add to Florian's talk is. We also have a handy, dandy placement group calculator on the SATCOM site that, if you're looking to build a new SEF cluster or expand it, you should definitely take a look at that and it will help give you an idea of how many placement groups your cluster, might need and I'm pacing it out.

B

Or in order to verify the PG configuration it.

A

Looks like we have a question from Abhishek who is asking if, during a write, if a client has a newer map than what an OSD does the write fail, or is there a redirect from the OSD so.

B

Basically, let me rephrase that it basically boils down to so how is the map actually propagated.

B

It's very simple, yet elegant. The way this is done is that the OSD map you may have picked up on that when I previously did the the Ceph OSD map command. The OSD map is versioned with an incrementing integer ID called the map epoch, and what happens is that all client to OST and OSD to OSD communications are simply signed or it well actually not signed they're tagged with that epoch, and so what happens is if a client talks to an OSD or talks to any other component?

B

It does so with a tag message that has the map epoch in it and the other OS Deak and then say alright you're, actually slightly behind here's the new map, the new map up Matt epoch is X and by the way, here's the new map or here's the map update and then the client can appropriately redirect, and that is effectively the lazy propagation of the map that I mentioned earlier, which is actually your really really nice and handy feature in Ceph, because it completely does away with any requirement to actually propagate the map out in a push fashion.

B

I'll be shaking. If you could just confirm that we've answered your question in the chat that would be great, perfect. Alright, next.

A

Question here from Eric saying: we've had issues with incomplete PGS, which had to be fixed by increasing the iterations, but he went. That was a guess. He says, there's the easy way to know if you need more iterations.

B

Could you please clarify which iterations exactly you're referring to and in the meantime, I am going on to the if you're, using a cache here you run into the lost OSD, you will note a check to see if it was or non dirty page.

B

Gregg, if you could I, would like to come back to that question separately. If you don't mind, you can certainly shoot me an email. It was in the first slide and my email address is Florian Aztec, so calm and I'll be happy to come back to even that one.

B

It was a change in the crush map. Okay, so Eric, would you mind unmuting your mic for just a moment and and explain to us exactly how you ran into that situation.

A

Sure well, Eric's working with his mic does anyone else have any other questions.

B

I'm, sorry, just by the way Eric for you, the same thing applies. If you want to, then I'll be happy to look into it. That way,.

B

So the question when a client requests has already reached a primary PG and at that moment the PG or OST dies. What happens is that a degraded right? That's another, really good question. It basically boils down to again. Let me rephrase that house synchronous is the replication to an OSD. A client right is only acknowledged by the primary OSD if it has reached at least.

B

So, in other words, if a right reaches a primary PG and does not reach replicas than the primary I'm. Sorry, if the right reaches a primary OSD and in the primary OS, they cannot replicate on to to to a replica. Then, as far as a client is concerned, that right never completes, and if the, if the, if the primary happens to go to to go down, then the the client redirects that right. If it is the primary detecting that a replica goes down, then it will effectively.

B

Well at that point, the PG is going to be degraded and it's going to write just to the the alignment, because unless of course again, we are now under the mid-size, in which case the right will fail.

A

B

Basically, the the consequence of their being synchronous replication from from primary to replica, as these.

B

And yes, if a write fails, it's it's the clients job to retry, the right, yeah I'm! Sorry, if the OS Steve, if the primary fails, then the it's, the it's a client's job to retry the right yep.

A

Okay looks like Eric's. My microphone just came off me. Eric. Are you there yep.

C

We were dealing with recovery from some failures and basically I was sitting in complete DG's and more or less. There was a setting in the fresh two tables for how many times to try and I'm gonna just crank that number up to like a hundred and all the sudden. My plates went away. So I was assuming that the algorithm was such that I didn't give it enough choices in the standard. I think it's 50 tries, but that was like guess why it was failing. There was no real, quick, easier life is going to be.

B

Okay, I'm afraid that that's a question: that's a little bit too specific for the for the for the tech talk because I'm not sure if it's. If it's you know that immediately interesting to the to the rest of the audience, but like I said, I'll, be happy to look into that, and I actually have a couple of follow-on questions to that and as to how the situation came about so again, if I could ask you to just shoot me, an email if Lori and Anna stay calm, I'll be happy to look into it.

C

B

A

Right thanks her the next question here from Brian. When you do a set PG repair, how do you prevent it from replicating the corrupted data to other peopie's.

B

Scrub and in a deep scrub is basically information. That's stored in encoded form in the USD itself, I'm, actually not entirely sure. If that is something that the USD still puts into file attributes, because a couple of releases back pretty much every move to OMAP Patrick, you might actually have more information about that than I do. But the basic idea is that you have a separate set of metadata where you keeping hashes of the of the data that you're trying to look for.

B

That's how you detect, which one is actually the good data set and which one is the bad data set, and that information you can then simply use in or that's that information set can simply use when you repair. So it's in fact it's the scrub or the deep scrub that actually detects, which is the data set, that's good and which is the data set, that's bad and then sfpd repair simply operates based on that information.

A

Okay, other questions. Well, we have Florian here on placement groups.

A

Think that's an appropriate amount of time for someone to type out their question so looks like we are all finished here. Thank you very much, Florian for a great overview of placement groups.

A

I get many questions on what those are and and how to use them and how to plan for them once again, if you're, if you're, looking for more information or more specific ideas on how to use placement groups for your cluster, definitely check out SEF comm /pg calc, it's a very helpful use resource or you are welcome to ask questions on the on the mailing list or IRC. So this is a great talk and I will see you guys all next month for the next Tech Talk thanks.

B

Patrick thanks everybody and have a good one.