Ceph CDS G/H, 24 Jun 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CDS G/H (Day 1) - OSD: Locally Repairable Codes

Description

https://wiki.ceph.com/Planning/CDS/CDS_Giant_and_Hammer_(Jun_2014)

24 June 2014
Ceph Developer Summit G/H
Day 1
OSD: Locally Repairable Code

A

Alright welcome back, hopefully stretching our legs hope to wake us up a bit the next one on the back. It here is another OSD discussion about local, irreparable codes, this one's going to be buying. You look, look, you want to go ahead and take a pass through the blueprint and get it started.

B

Hey um so um did I start now, yep far away. Okay, I will start for the benefit of people who are not familiar with that with a short introduction, which is um at the URL which I passed in IRC.

B

So the idea of local, irreparable codes is basically to take original code and apply it recursively to a subset of chunks. That is, we compute coding, chunks for, let's say 10 blocks and we create for priority blocks or coding blocks and then for each five blocks. We compute one more, which is presumably located in na place.

B

That is the same rack of the five other blocks. So the idea is that if one of the blocks is missing out of the six that you see on to the left of the drawing so X 1 to X 5 + s1, if one of them is missing, let's say they are all in one rack.

B

In theory, you only need to move blocks within the rack. You do not need to cross racks boundaries and go fetch the box that are in the next track, which may save you bandwidth and is that's basically the goal for local, irreparable coat. It previously was called pyramid because well, I, I, don't understand much of all that. But but my understanding is that permit code involves more complex, mathematics tricks to do the that kind of things.

B

But I went for a simpler approach, which is basically using existing area code plugins to do all the encoding and decoding and just engineering them together to achieve this locality, and that's what I'm going to explain and that's in the pad this time.

B

So during the giant summit, I propose a way to do that which was inconveniently complicated, to explain and hopefully I figure that something that is simpler. So in the pad you see chunk, number and 027, but the idea is that, let's say we have an original code crush full set that gives us 80 s DS. So let's also say that half of them are in Iraq. The other half are another in another rack, so we want to apply locality so that we can recover for our within Iraq. The step one would be to do.

B

The global encoding are using the two racks. So when you see in the line under 027, you see addy, it means the original code plug-in is going to use the soil to saw data and then the plugin computes coding chunks, which are see and the coding chunks are stored in the corresponding OSD. That is for step one. We have one cutting chart in OSD one and one cutting chunk in OSD five.

B

Then, once we have that reply, step two, which is to compute an additional coding chunk designed to leave exclusively in one rack, so we assume that the g 0 1 2 3, are in rack, and for that we take the content of the OSD 123 to be data, and we compute a coding chunk that will be stored in OS, d0 and step three. We do the same in the other bag, where we take the last 30 s DS, to be data and store. The coding chunk that is produced in OSD, for does that make sense.

B

C

Like this, so much more than the the Lassiter's civil Z's last erotica and get exhausted, ok is.

B

B

Now, for that to happen and I move to bring my remapping of data chunks, which is a chapter after the references I proposed to shortly discuss with Sam the idea of delegating the mapping of the data chunks to the original code plugins, because- and this is a prerequisite to the implementation of the local irreparable codes, but it is independent, actually I think it it can be applied to any original called plug-in.

B

So at the moment, we assume that, when the crash wall gives us a 80 s, DS the first k ones will be used to store the data chunks that are chunks, but that may not be convenient if we want to do what we want to do so, let's say we: we decide that the code g, our radio code back end, is able to query the plug-in and ask it. Where are the data chunks? We still work in there? The assumption with that? We have systematic codes, that is the data.

B

Chunks can be competin ated together and to reconstruct the data, but only we do not require that these are the first chunks provide the first OS DS provided by the crush will set. Instead, we choose which one are going to be used.

B

We do not go to the extent of choosing a position that is not stantial, that is the D in the crash map that I give as an example.

B

The first d will have the first twenty five percent of bytes of the object of the stripe. If you would like and then the next D will have the next 25 etc, we do not have a way to control that the beginning of the data will go at the end and so on. It does not seem to be necessary for that. I propose the pull request, which is 1911 and contains a small change to the EG back end, which is just in the linked after the public west.

B

Do you have some may be more comments about that? You you didn't like it it you you, but we didn't get a chance to discuss it more. Oh I.

D

Think I remember that you can same flourish. I I did have an objection, but you you finna telling is all pretty good. I forgot, Lil also.

B

E

B

I'm sure I'm going to ask you to repeat but worry about, realize.

D

I I wanted to insist on the first. Edit being, did you pointed out that that would be annoying when sitting on thresholds right yeah yeah? You were right that I was going over.

B

By I can hardly hear you not unfortunate, but but.

D

You will you were right. It would be annoying to set up the trust rules to insist on the first n, acting sub-plots being all editions. Yeah, okay,.

B

B

So the next item is the syntax there, so that story yeah, that's the way it will be used by the system administrator now and that's what I lay up after a locally rhetoric, little code plugin.

B

So they are it's within a profile original code profile. There would be one additional key, which is layers that will contain the strings that I explained first, only within a JSON object and the the string that follows would be the specification of the area code plugin to use. So the idea is that the lrc plug-in does not actually implement anything. It relies on another plug-in to do the actual encoding and decoding I, specifically in an empty.

B

It means use whatever default you have, and then you can also change that for something else, such as your own plug-in that you want to try and then there would be the world step will set steps. So I had a hard time with those I see what what has been added. The thing is at first I thought it would be more convenient to specify something related to the wolf set at the same time as the specification for the coding, chunk placement and so on. But it's it does not map.

B

So I propose that we specified in a separate variable now I chose not to describe the wolf set, but instead stick to what is strictly relevant to the layers. So the idea is is to help the system administrator who want to try something that is not too far from the example that is in the documentation just to tweak it a little. But in reality, if you want something that is advanced, you're, more likely to create your own will set from scratch and not use wolves.

B

That will set steps to express exactly what you want with all the finesse of the syntax, but we could add, take and limit and all the steps that I am balance. Why not I don't know.

C

Yeah cuz I can always customize this profile right. They can always change it or thur will modify this profile for their particular cluster, so they would have liked lrc.

E

C

Racks or something would be their profile and they would specify yeah.

B

But cure you're right, actually I agree with the tech default, because otherwise it would be the yeah. There is another argument which is a failure domain, but it it's a better fit here indeed, but.

E

B

Yeah I I, don't like this yeah Buffy's, what you say: I, don't like it in French.

B

Because it suggests that oh yeah, I know that I can do that in this text. Oh, the syntax must be supported completely somehow and.

C

Well, I mean if actually it's just a rule right. Those are all those are the complete steps. It's a full set of steps. Oh yeah, you could have one of these. That's like you know you could have multiple emits if you had a weird rule and.

B

You could also just use the crush syntax exactly it instead of trying to make something yeah, maybe it's better.

C

Or I mean, does it actually need to be here? Can they just not have a two-step process up here? They create a rule.

B

C

Create simple but let's see and then.

C

Great simple or whatever it is whatever the command is. Oh yeah,.

B

Here it is I, do actually yeah.

E

Because yeah yeah.

B

It's a good idea, because what happens is, if you do not say will set, will set steps. It will just create a role that is fit for original coated pool and it will work only. It will not be spread among fellow.

C

B

Interesting but it will I can't work so for the purpose of just trying it. That's probably just not necessary well.

C

If it is, if this triggers the path and the honor that's out of creating the default tissue rule right now, it's just hard coded to specify whatever the default failure. Domain type is that's in your configurator right. It creates a it, creates a generic and pull a racial rule yeah right, so we can extend that code so that the plug in the erasure plugin, if it sees it there, multiple layers. It's like. Oh, there are three layers well, actually any so how many really domain server, yeah yeah.

B

It's hard coded in the area code plugin not and not elsewhere. So yes, it will be delegated to RC and all our she will come up with something yeah.

C

Okay, so I mean that we could do that right and, and there were we would fight us- have to have a at a config option or something that's that specifies the two different failure. Domains default fail domains right now, there's a default. There's one config option for failure to main I mean the.

B

Put the whole thing host.

C

Yeah Oh there'd be like a host across default, choose leaf, type or whatever to or something I don't know.

B

C

I mean it whatever yeah in the meantime, I think just doing the real set.

B

C

Did you get the implementation in and we might need to make the the create simple command something besides great simple, that makes a two-tier.

C

Instead of create simple, be like create two tier rule,.

E

E

B

Guess yeah maybe I will I will try to to make a fake command and ask people if they understand what it means before trying to implement it. That is what I have in mind is of not what people are not like. This yeah.

C

Okay, that sounds good.

B

Okay, so, uh but that would be exposed to this is admin and you would he or she would just use all our see profile and.

B

Ok, coolio fiscus likes it, which is good Africa.

B

Now I wanted to go over what happens of regarding the remapping of chunks and specifically, what happens when we decode things I think it makes sense, but earth I think it makes sense, but I never actually went into detail about it with anyone.

B

So it starts for recording we take the first layer, then we take the second one and then last one and we apply the encoding to the result, to the results of the previous one. Of course, the constraint here is that the season mean has to know that he should not come up with layers that override the results or the data.

B

The original data in the proposed implementation, I implemented the values, safeguards to prevent mistakes, but this is a little tricky, but but to think about it, it's really simple, izing regarding and coding now what happens when we try to decode is a little bit more involved, so it goes from autumn to top so it let's say we want to decode 143, which is the second chunk and.

B

Then it's missing, but the other ranks from the same line presumably can be used to retrieve it, and we only need one layer to do so. It turns out to be the second layer, so the layer that starts with c DD and then because we have recovered all we need, then we can stop the iteration.

B

So we do not need to consider even the first layer at all, which is the goal of course.

B

Now, if it turns out that we miss one chunk from that, can be recovered from the two independent local layers, each of them will recover the chunk and again ah iteration will stop before reaching the first layer, which is the more generic one. But that's the second case now. It may be the case that chew chunks are missing and they are both in a local layer which is not able to recover them, because in this case we have local layers that can only recover one music Chuck.

B

So in this case, there is no other way, but to come to the first layer, which is the only one able to recover from the last of 2 2 chunks. And now there is the last recoverable kind of error is when you have a free chunks missing.

B

Two of them are in a local layer and the third one is in a local layer. So the third one is a local layer and can be recovered because there is only one missing so because of this local layer.

B

You only have two missing when you climb up the layers, and so when you which layer, one which is able to cover two missing chunks, then you're in luck, because you can do that, and the last case is when you cannot recover, because you have three chunks that are missing from in a place where you cannot combine the effects of the layer together to get all the all the chunks back. So.

C

In that last case, if 10 to 1, 43 and 106 are missing like the one or two is a coding chunk, we always rebuild that one 143 and 106 are one of the data chunks and one of the coding songs from the first step. So we should be able to reconstruct both of those from everything from step. One right like by actually moving forward down the steps instead of in Reverse I.

B

um I heard one word out of two: it scares.

E

C

Because I wonder.

B

If you you hear me the same or.

C

Not oh yeah, you sound fine, but you talk slower than raised to know so the if, if this uh toe in that last case, you're missing, 143 and 106, which is two chunks out of that first step where you have for data into quick coding, so out of those six you're still only missing two, so you should be able to rebuild by going forward by taking what I guess: 177 223, 285 and 207 can't you move forward, starting from the original original data, the original remaining chunks and coding chunk from step. One can't you rebuild.

C

143 and 106, and then after you rebuild those two, then you can rebuild 102 I.

B

Will have to assume that what you say makes sense, because I cannot understand. Okay, start too many got you. One word out of two is missing: that's just serve okay, maybe I should call back or something.

C

Some are you, are you following this, but.

B

Maybe if you write it, I them forever, yeah everybody working up which much give you that that must be made.

B

Let me find another phone.

D

With which that we don't know I assume this is the th locally available use ago. Yes,.

C

A line, 75 is the question so in these, in these other examples of repairing the leak is sort of working backwards with the steps, but you can also move forward like if you, if you lose a coding, chunk from an early step, and you have the data chunks or you have enough of that layer to recover them, but you can use any of these layers effectively to recover at any time. Assuming you have, you have enough chunks, it's.

B

Not necessarily.

E

C

D

I'd like to think about these okay.

B

Can you hear me yes, hello, yeah, yeah, oh I can hear the matter. That's oh yeah, Wow, good.

C

Okay, okay, good, that's better! Ok, so I'm line with online 75 wrote out my question: I guess so: I mean it. Your other examples make sense, but it sounded like you were suggesting that you had to sort of work backwards through the layers in order to reconstruct everything, and it seems like it's it's simpler than that almost set at any point in time. You look at the chunks you have and the chunks you don't have, and you can look at any. You look at every layer and you see.

C

Is this layer able to recover any missing chunks based on chunks that I have and if so, at what cost? And then you just pick the one? That's the least cost, or something something like that. So so in this case, like in your last example, 143 and 106 are missing. But if you just look at just look at the first step layer, one you're missing two chunks and you have 4 remaining there. You can do one recovery in theory. You could do one recovery to to build those.

C

You couldn't you can't use the other layers, because you don't have enough of the pieces like layer. 2 isn't right because you're missing three out of four: they are one of enough. So.

D

Louis I assume you you're, addressing specifically the algorithm poorly written Santa.

B

Week, algorithm all right.

D

Well, I agree, that's age, it does occur to be recoverable, though, is it just not recoverability make certain assumptions or it's.

B

Not for culpable in the way I implemented it I'm.

D

Not saying it's.

B

Not recoverable as all right yeah, that's.

B

C

B

D

Yeah, when will the local recovery code part won't be able to recover a lot of the time? That's almost the point. Actually. So, if you start with the local recovery code portions, you will frequently find yourself in a situation where you can't write yeah.

B

But the same is oh yeah, I. Remember why I went this way and because the the primary goal is to avoid sorry sitting too many chunks, so right.

D

Yeah, so you give it a template or three then essentially or two then take later one right.

C

Or not not, even not necessarily because it could be that you could. If you go in one direction,.

D

C

I get my paper right, so it seems like you should look. You should build a you, should do a search and look at all the possible recovery plans and then I'll associated costs with each one and then decide which one's the least cost.

C

So in layer, if we use layer one, for example, we have to read, you know for chunks, one from the local rack and three from a remote rack and whenever I think weep at a cost function. That would that would associate what the I/o cost is from that subway x, 3, x, 2 plus 1 times 1 or something, but it could be that there is also a way that you could use the local rica like to local recovery codes, say if it was depending on how the layers were.

C

If you imagine like a complicated one, maybe there's a two-step computation, that's all local I/o and more cpu. That would also get you the same result.

C

Does that make sense? I mean not in this example, because its SIA.

B

To erica and actually thinking about it, if we try to recover using layer of 3 and layer 2, which will be more costly than just using, let your one when you miss to changa, oh yeah, I agree right. That's actually what I'm actually.

C

Drafted and that's kind of the point actually cuz yeah. If you lose one chunk, you could recover using either layer, one or layer to do you lose the third slot. There are two different possible recovery plans and you need to put a cost associated with them and decide like I. Think we I think the problem with what you're doing before is you were just assuming that later layers are cheaper and instead you should just look at all possible recovery, Solarius and a sign of cost and and minimize that cost.

E

C

B

Sort the layers so.

E

C

Why not sort the layer? So if you do it in reverse order, it'll always be cheaper and then, if.

D

You have do you can go the other way if you can't resolve it, the trouble is I. Don't think it's that simple, because, even in the case, where you're missing too it might be that using layer, 2, &, 3 separately so faster, it depends on the cost of cross of crossing the recovery yeah. In other words, we want the weighting function to be expressive enough to capture that case, so we have to actually consider it and.

C

I think it's also going to also going to change over time to like the current or the initial implantation is going to have all this done by the primary and so using layer 3 to recover something in layer. 3 is going to be expensive kind of no matter what you're sending it all between racks and then back again. Oh yes, right so I think I think having and so being able to have that cost function and express it. I think, is going to be important.

D

Yeah, but I mean, even though I think you can cost function, we, you can't just assume that, just because you're, using two layers that one that it's cheaper across country now will later be expressive enough to handle to cross, refuse yep right, yeah.

C

Although, probably in most simpler cases, an ordering will be mostly right, but whatever I degree might as well just I. Don't think it's that much work to sort of told I.

D

Sort of all possible- okay, so I, think ok, so there are two things with the recovery plan. We want it to be, not wrong. We want it to be that slow. So first, I think, is to actually generate all possible recovery plans and find the shortest one. Then we adapt whatever not hideously expensive heuristic. We come up with to replace it yeah yeah. Well, by writing the proof read forcefulness, I think the right or so and.

C

I think the brute force is going to be not that expensive too, because in general feeling.

E

At all possible.

C

You only have to redo a recursive descent once you find a layer that will recover at least one missing chunk, and there aren't gonna be that many they're gonna.

D

Come Taylor she's gay frequently be too.

C

D

But reads you friendly.

C

About 13 but you'll recurse and then they'll only be one whatever you'll have one less, whatever I think it's not going to be like a at an expensive iteration, rational I. Guess that's my point. No.

B

D

Don't agree yeah.

B

Yeah: okay, okay,.

B

Makes a lot of sense.

C

B

And did I have something else.

C

It's taking advantage of locality, yeah.

B

The twist it's an open question: what are your thoughts about making? So recovery can happen locally? It's.

D

Actually, extremely straightforward, um all of I don't think, there's any reason to push the ordering consistency stuff out of the primary I. Think the only change is that, instead of pulling the relevant pieces and then pushing them, the primary sends the lack of a better word, a remote wait. What's the word, our movie method call and an urgency to the rapidly with the relevant information that it just does yeah, but I, don't think it's conceptually heard of all.

C

Yeah and though it'll throw up its hands if there's a peering event like everything else, right right,.

B

Yep, okay and I.

D

Think we were asserts about messages coming from primaries. Are we attribute those yeah these messages on before it gets that's the hardest part.

C

It doesn't need to act as some sort of subordinate primary.

E

For its I mean.

D

Like it could be in charge of somehow it's a weird way ordering access to this object, I, don't even see an advantage. I think if primary just doesn't want to do in a dozen yeah.

C

I think that the important part is to remove the data path from rasta center, but, like the game, the primary rattlesnake.

E

C

Calling the shots.

B

Ever connect immediately because you're working up again I get I. Guess it's something to do with the duration of Michael will be back in when you forget.

C

Or okay, hey damn you, you remember when we read the facebook lrc paper, where I mean it's same basic idea, but they used a specific choice of the matrix. That means that one of the encoding matrices, the local coding matrices, can be inferred.

D

Well, remember that what they hang on, Ed I, preach I think it was the Microsoft when actually it was the issue of paper I.

C

Wasn't one of us? Oh, I thought it was Facebook hadoop LFC re, so they be carefully chosen treatment right. You.

D

Have you have nine data trunks and one lrc for each one? Oh no sorry I needs to be more needs to be 12 day trunks, and one lrc for for each group of three, which gives you four LRC's, but one of them.

C

Is fighting to look wait? Okay, one of them can be inferred somehow. I cant rember how they did that, but I'm just wondering if that can be captured in here again I layer, 2 or if it would just call us into one big layer. Basically, do you remember what the wood ends? It's one less! It's one less chunk to store. I know.

D

And do you remember how they did the influence.

C

No, in fact, it's a diagram. That's in that blueprint. Ok, let open.

D

C

Third, third parody block has done it, but it says it's implied.

D

C

Okay, it's little print for this session, put it in a channel.

D

Yeah, the page doesn't list a blue pro. This do you think? Oh okay,.

C

It's in the at this pace of the link near su channel, but that figures from the paper ah yes.

D

The vents under the work.

E

C

I think, basically, it's that that that parody block could be reconstructed from s1 and s2. Also somehow it oh yeah, that's or all the other I remember whatever, but it dad to soar. One less I'm just wondering if not that it's that important, because, ultimately I think the flexibility is probably more useful in the overhead is like going to be like three percent or something. But I'm just wondering if this particular encoding scheme could be captured within the framework yeah.

D

I got the week without money, I'm back yeah.

C

So was also at all clever. Did you hear that? Do you hear the question week now, the so the in the facebook paper, the local, irreparable code that they did? This is the paper where the figure came from that you've got in the week in the blueprint they didn't have to store the final implied parity block because they used. um You know weird linear, algebra trick to like make sure that it was carefully chosen to be inferred from the other parody blocks. Basically, something like that.

C

I can remember exactly what the details are, but I guess the question is whether and that particular tricky scheme could be captured within within a framework that you've described. Oh, I.

D

See the process of living together, p1 through p4, happens to result in something that is applied by the original Ten file box and their low comparative system. Okay, right.

D

We should just use a related level, one level two, you just use algorithms hanukkah.

B

D

Do such that you got the right answer right.

C

Yes, but but they're fine.

D

Because you have two possible ways of generating the same parity black there's, no way to phrase that in this language, yeah.

C

Right, I guess that's yeah, that's kind of what I'm reading it. You're.

D

Right now, well, I mean actually I guess if you think of each layer as a parity declaration, but yeah you can't, but if you think of it as a logical dependency graph, then sure you can, you just add an additional phantom layer and you just make sure that whatever piece of code is invoked knows what its role is going on. I mean.

C

It's kind of like muscle, hate muscle, name, yeah,.

B

I'm sorry can't hear you again, I guess I will have the offa me.

C

B

You consistently, I will have to listen to the record.

C

Oh, we can I mean this is I. Think at the end of the day. This isn't that hard, but it would be kind of. It would be nice if we could capture that, but it's not the end of the day, but we can totally say one Louis.

D

B

You really now I can yeah okay, I'll, try.

D

To talk slowly, can you specify different coding algorithms at different levels? Yes, can.

B

You act around a quote: I each level harm president a plug-in.

D

Right, so to describe the diagram in the blueprint it seems like level one would define 10 data blocks and for parity blocks level. Two would design five data blocks of one parody block of a 3-wood to find five data blocks in one parity block and level 4 would define for data blocks in one paragraph.

B

Right now, you're at now, your break now you're, both again okay.

B

E

This is a very common problem. Mr. phone interface from blue dream song I wouldn't be surprised if getting better evermore wellston over a time. It's very common mething, okay,.

B

I so I'm not sure what what the question is that, yes, you could choose different plugins, a different level, rider player. ah We do here to capture.

D

This one, we would need to be able to define a phantom garrard, which is never actually stored anywhere. It can be used as an intermediate result.

B

B

C

Wonder that I mean- maybe you just you just put something: that's like off the end of the crush array-like crush is giving you what.

D

C

14 and you just put something in the 15th position yeah, and so you can. You can calculate it and use it as a result, but whatever yeah but.

D

Then you can give it never gets to equality to generate awesome.

B

Okay, yeah, oh.

C

Okay, well, that's something we can move on and talk about it when Louis phones, working alright.

A

We're halfway into the next session anyway, yeah.

C

A

So um do we have anything else we want to add to that one. You want to jump into the race you're coding I mean the two kind of seem to blend together their.

C

Other thing, yeah yeah.

A

All right well me.

A

Let me switch it over. We can at least breeze through the discussion, and maybe we can hear from the google Summer of Code stuff.