Ceph CDS Infernalis, 4 Mar 2015

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: CDS Infernalis (Day 1) -- OSD: Erasure Coding Pool Overwrite Support

Description

Videos from Ceph Developer Summit: Infernalis (Day 2.1)

04 March 2015

https://wiki.ceph.com/Planning/CDS/Infernalis_(Mar_2015)

A

Hey welcome back to day two of cds, the day 2.1 the island of osd.

A

So uh here we are with the first session, which is the discussion about erasure coding, pool, overwrite, support and sam. You want to take it away.

B

All right so when we went ahead and implemented erasure coding uh last year or whenever we did it, um we made a choice to limit the interface to uh append only and x adders and delete, basically the things that we can roll back easily without needing to stash a bunch of extra information.

B

uh So that's that works very well for the base tier in a cast hearing situation. It works well for radius gw. It works abysmally for uh rbd, since you need to be able to do partial overrides with rbd.

B

um So there's been some noise about trying to create or trying to add overwrite support to ratio coded pools so that you can run rbd directly under on a narration pool without needing either a cache turing solution which doesn't work that well for rbd, either or needing to put up with 3x replication for your vm images.

B

So the short version of the reason why we didn't do this to begin with is that, unlike a raid controller, we can't um we, we can't just use a you know, uh nvram thing to make sure we don't lose the updates.

B

So, as I see it, there are sort of two sorts of approaches to how we do this either we apply the update in place and atomically maintain a rollback log with the old data, or we journal ahead and then apply the update in place once all um replicas have committed, we'll call those the rollback, log and two-phase commit approaches.

B

So the rollback log would be an extension of what we already do and the way it would work is when you accept a write, you read the corresponding data from the file or from the object, and you create a transaction that writes that data to a write-a-site log atomically. With the update you write into the data in to the object.

B

This is this requires, first, a read of the object of the relevant piece of the object. Either we send the data to the primary and the primary packages, the rollback information in the wrap up or each replica when it receives the transaction, performs it locally either way, we still need to read the object before we can perform the right. This in this adds some additional complications like with pipelined rights.

B

You can't just read the object since it might have a pending right on it and you need the most recent logical version of the object, not simply whatever happens to be on disk. So this implies that we'll need to buffer any unstable portions of the disk of the object in the osd, all of which is kind of doable.

B

But the robot log does require a double write in that we need to write whatever it was. We were going to write to the object and then also to the role backlog, just the old version, so.

C

The other one is the answer to an array or two writes in a read: yeah yeah.

B

uh So the uh the other option is the two-phase commit approach which at first blush seems worse, because we need to perform two commits and they're dependent, but we can reply to the replica or we can reply that the right is complete when all um replicas respond with a with a preparer. So once all replicas are prepared, we know the the right won't be rolled back and we can reply with success as long as we make sure that it's that any reads are blocked until the write is committed.

B

So there's no need to perform a read for this one, but we do need two messages to the client to the clients, even though, or to the replicas, even though only one of them is required for the actual commit, um and it also requires a double write.

B

Unlike the rollback log, though, if the file store does support an efficient clone range, we could perhaps clone range from the right ahead log into the object, though in reality, this is kind of just pushing the problem down to the file system, which may or may not have a way of doing that without fragmentation.

B

So maybe, uh let's see significant piece that I haven't gamed out here, is that this changes, the semantics of the pg info to include a so right now the pg info has a last update, which is just sort of correct.

B

Based on the last updates you get from the replicas, you determine what the current authoritative state of the osd is of. The placement group is during curing.

B

This adds another piece which is the sort of last update prepared versus last update, committed with the notion that the versions between last update, prepared and last update committed we might or may not choose to include, um depending on which replicas have which values so peering would need to be extended slightly.

B

I don't actually think it's that hard.

C

Yeah, it doesn't seem. I.

B

I don't think we already.

C

We already do kind of we already have the rollback logic in there, so just be basically tracking that in an extra field.

B

Exactly yeah, but I mean the rollback stuff is sort of it's almost transparent appearing yeah yeah. This is a little different and I I think, you're right though I don't think it actually makes a big difference. Incidentally, both of these, both the two-phase commit and rollback log only actually work the way I've described if they're aligned, if they're not aligned or if the rights are not full stripe aligned. Then you still need to perform a read, modify right, that's probably a a livable restriction, though, since rbd could always choose to write out pages or whatever.

B

So I I don't think that actually poses a problem.

C

Is that is it actually um a good idea or practical to have erasure codes that are aligned to forklift 4k stripes? No, no.

B

No, I mean sounds like an awfully small, but what I mean is rbd. All already has an alignment that it chooses to deal with and you can just choose to deal with the bigger one if that makes it happy. So when it performs a read, it always reads whatever the necessary stripe size is, which would be bigger than 4k. Or are you arguing that will be impractically large.

C

I'm saying that if you just configured your ec stripes that it actually a stripe was 4k, then every I o is always going to be 4k aligned because that's the block device box. I.

B

Sort of doubt that that's practical but.

C

B

Actually know low equipment over here yeah, so.

C

Much someone actually.

B

Did the benchmarking to find out how small stripe size you can get before um jr razor starts getting noticeably slower, so there is a number and if it's as small as 4k, then you're right if it's bigger than that's a trade-off. Yeah.

C

B

C

B

Be a mailing list thread from last year. Someone actually did the numbers, it's done, yeah, let's see so um option. Three is kind of my favorite approach, which is don't support overwrite at all.

B

Well, hang on I'm I'm saying so both both both advantages or both approaches, above also torpedo. The way we do uh check summing for racial coded pools. So append only has this nice property that you can just maintain a checksum for each of the shards, and you always know which one is correct during scrub which I like.

B

If we go to this sort of approach, then we'll have to maintain checksums at a granularity of whatever the minimum stripe, size or right size is which is sort of a bummer to me um also any way we roll it or any any way we structure it. It's going to perform worse than appends.

B

It will be more expensive, um so the question is: can we get this? Can might it be possible to support rbd without um without supporting partial overwrites? So one way we might consider doing this is by changing the way. Rvd lays out its data into some form of, let's say, for four megabyte block, which is essentially immutable, plus a journal of pending updates.

B

We can also update the rbd osd class to uh do the coalescing so when, when rbd for our rbd would have to always choose to read full full blocks, but the osd could do the job of reading the block and the updates coalescing them and sending back a single block back to the back to rbd. Rbd would then send small, writes in the form of these incremental updates to the osd until it hits a full um or until it hits some heuristic.

B

That says it's time to coalesce and then, if rbd actually has the block in memory, it can just flush the whole block or we could choose to expand the osd class infrastructure to give the osd the ability to um spawn a an additional operation.

B

After a write that reads the object, coalesces it and rewrites it back out, which I think might actually be an interesting capability to add to the osd anyway.

B

D

Are you considering that the file system is going to be doing syncs and things above rbds that you really can't avoid.

B

Yeah, so you those those little updates, get get appended to the to the blocks. So you you, you still persist them. You just don't write out the whole block unless it crosses some threshold where it's worth the effort.

D

B

The data persisted it just it's just persisted in a non-strict format, which doesn't.

C

B

Us to do an overwrite, quite so often.

C

B

Everyone's, like you're gonna, have to.

C

Take this whole big, you know not just four mags, but it's gonna be like 12 megs or whatever. How long you let your journal accumulate, get to like read the whole thing and rewrite yeah.

B

Right, so that's that's! That's the cost. um The reason why I think this might be viable is because there's a there's a paper from microsoft from a couple years ago, and this is exactly how they implemented their ratio code block device thing.

B

C

B

Append only blocks- and they have this uh log structured block device, but not unlike a uh flash translation layer actually well somewhat, unlike a fashion. Okay, very, unlike a flash translation more like what I'm, what I'm just yeah.

C

B

Yeah, an approach like that might might might be viable. It's not clear. On the other hand, it's also easier to prototype, then it's like much easier to prototype than yeah. They roll back log into this commit approaches.

C

Yeah I mean basically the only change would be to make the rbd client call rbd write instead of write and then to make all the all the rest of the code is going to be in that rgb rbd class.

B

Actually we we can, even um so we don't really want, for example, views or the set fuse client to have to implement the same thing. So if we can abstract it into a library, that would really be the ideal thing that encapsulates the relevant.

C

Caching yeah, but I mean, if I'm understanding.

A

C

Go ahead, maybe maybe I misunderstood this- your your proposal is to put all of this in the rbd class right. The class would.

E

C

It's per object, log.

B

No, the client, the the actual client side thing needs to uh cooperate as well, because the client side thing is already buffering. Whole objects. We don't want the osd to read it. We actually, we may want to center it over the wire.

B

I it's not clear to me which one of the those is the better strategy, but it seems to me that it's worth avoiding the read.

B

Anyway, when I, when I said the rbd class, I think I met a library of fight class that talks to this other library, layer.

C

B

Which then lip rb talks to that's too I mean, uh because we do, we don't really want every liberators user to have to implement this I mean so. Another piece is that is it? Is it practical for rbd to or whatever this thing is to choose to cache object, size pieces and is a four megabyte ec object, even a good idea?

B

It might be just too small yeah.

C

That's yeah, it kind of seems that way.

B

Anyway, so there there are some other pieces there that need to be thought about.

B

I think it's it. This offers us a much better chance at getting tolerable latencies, though, because both the rollback log and the two-phase approaches potentially introduce larger latencies.

C

Yeah, although I mean to be fair, if it is, if it's impend, you don't pay for that, it's only in the overwrite case. I guess.

B

No, but in the everything so right, yeah, specifically for the rbd case, there are no weapons, it's always an override, yeah yeah in other use cases. You wouldn't use this, you would you would structure it so that you had write once objects, because it's just easier to do that than rados anyway, yeah everything's, easier to reason about when objects are oh and the other catch is that everything I just outlined for for rbd would only work uh if um would only work in the situation where right back.

B

Caching was appropriate, though you can't have it mounted in two places in cigarettes mode right, right,.

C

Yeah, I mean, I think, that's the.

C

That's the part that worries me.

B

Yeah, I don't because, okay so with steph the file system, though you already have exclusive locks on portions of objects, so.

C

That just works. Sometimes if there happens to be one person, then you.

B

Oh otherwise, it goes into shared mode. You're, screwed,.

B

I didn't realize shared mode happened on data extensions as as well yeah. Yeah, okay, never mind about that. Then.

B

Okay, well, if we actually do if we need to support that, then the primary has to be able to do it, and the primary cannot very well be buffering. Whole objects. That's not good!.

C

Yeah and doing yeah.

B

Which sort of implies an actual partial of right approach.

B

That's distressing.

C

Yeah I mean even if it's doing this, if it's doing this logging thing, then on any read it needs to like yeah. That's.

B

Not the effect someone, someone has to be caching, the uh the dirty state or it's or it's not a win.

D

Just uh spitballing here what, if you were you could take over rights in as like a replicated type of operation and send that out and then something behind that. Let's just say you had one something behind that. Would re-erase your code that and merge it into the, but you'd have to keep track that that's happened, so you can't read- and you know, without doing that, cleanup.

B

That's distressingly similar to the cache here.

C

Yeah or having a caster that does partial objects.

C

C

It really feels, like the gesture, is kind of what you want in general because you you're, you know you're, just anything. You're writing to you're just replicating acceptable, but the promotion cost is expensive. Like that's the that's the problem.

B

Well, that's, that's! So that's the thing it's you're saying, except that the promotion cost is expensive and I submit to you that for a block device use case, the average case latency doesn't matter it's the worst case, latency that matters.

B

I just don't know that anyone's going to be happy with a situation where they have to pay for a promote for a 4k right.

D

Right what I'm suggesting it doesn't have promotes at all.

B

No, I know but yeah, so you which what you're describing is somewhat like, taking the logic we currently use for the cash tier but moving it into the same pg as the base tier, so the the base tier would also handle the replicated tier part.

B

So an object would exist in two pieces: the erupt, the erasure coded piece and any replicated diffs that happen to exist.

B

Which isn't a bad approach.

B

Well, uh actually, that one also poses a bit of a problem, because the problem is the reconstruct on read problem, and if we don't turn the erasure, coded object into a replicated object and then update in place, then you still have to read the array recorded object and the replicated diffs to get back the the to satisfy the read which doesn't really help us.

B

Maybe we can do it.

D

C

D

B

Exactly right, which is what I'm trying to avoid.

C

B

C

I mean, even if even if we were doing say that we could do these small writes on an ec pool.

C

The right latency on ec is always going to be much higher than on a replicated pool anyway, because you're touching so many more nodes right. That's true, like you'd kind of want to do this right. These writes replicated like it feels like that. The right quote-unquote solution to this is really a cache tier. That does partial objects like that's that it's it's super complicated, but that's the that's the thing that actually sort of avoids all these problems.

B

It introduces a lot of non-determinism into the read and write paths, though.

C

Yeah, no, I'm not I'm not sure we should do it but like if we were to try to solve this problem like that seems like the.

B

So what if we could flush an object without writing the whole object out?

B

Oh that's a problem! So, if you do a partial promote um you can't flush, the whole object.

B

You have to flush, diffs or right to an object class that was capable of applying the dips and rewriting the object out.

C

Yeah I mean, I think I think the way that it would have to work would be that you would accept the partial overwrite in the cache tier and then asynchronously in the background, do the complete promote and not until you fully promoted, it would be able to flush it out again, but you would want to do that anyway, like there's, not really a case where you do a small overwrite and you wouldn't sort of do the work, because you expect there to be another right sometime soon, I'm guessing well, maybe maybe, if you touch one block.

B

That's what scares me about block devices? I'm scared about making actually any assumptions about the workload.

C

B

Because people don't really pay attention when you say that it's suitable for black device, but only if you know the workload looks like this.

C

B

I mean the the two-phase commit. One does have the does have a lot of virtues. It's deterministic, it's strict. It does basically what raid does so people are used to the costs associated with with that sort of not all of it. In this case, we only have to buffer unstable objects, only unstable extents, actually.

C

Which isn't so bad um and does it does it really have to be stripe aligned in that case, or can it be.

B

It has to be stripe line because you can't rebuild the parody stripes chunks without a full stripe or a full. Whatever word is the correct one to describe what I'm saying right.

C

You can do a read, modify right on the primary if it happens to not be aligned.

B

I would argue that that's so inefficient, we should propagate it as an error to the client.

C

Yeah, well I mean if, if the oc doesn't do it, then liberty will have to do it.

B

Yeah, but my argument is that any any sane client is already buffering some kind of a line, size thing yeah, and so the only way we would get a request like that is because of a bug in the client, in which case I'd rather propagate an error and have it fixed, yeah, yeah plus supporting it, requires us to actually build that support in which sucks.

E

So one thing that might help is with the mirroring stuff: um since we actually have a journal of rights and it's a replicating, potentially different pool, we can actually do the right back from the journal. Do that modify right at that point when doing the? uh So it's not having the extra latency of that. We modify right cycle to an ec pool visible to the guest.

B

Oh, you could just choose to write back full objects instead of writing back diffs.

E

You still relate to this the journal, but then, when you're writing back those changes to the actual rv device, yeah yeah.

C

B

Yeah, this is also something that only works when you're, not in synchronous mode.

E

I mean you have to have some local caching, it doesn't have. um I mean it doesn't have to be an exclusive writer. It.

C

Has to be an exclusive, exactly yeah yeah, that's what about, but that's! That's like every rbd user.

B

That's going to be true for it, it's going to be true for any clever solution. Anyway, I was just making sure yeah. Yes, that would also do it.

B

Actually, I wonder how well that just works, I mean how I don't know: how would it be complicated to simply arrange for that approach to always write back a full object.

C

B

C

Not today, but as soon as you're with no changes to radios, it just means that it's a huge right amplification. If you have small rights.

E

B

E

Would be an acceptable cost if you.

B

Can coalesce them over long enough? Actually, you can't collapse them over a very long period of time, because the client has to be buffering the.

C

Yeah stable objects.

B

Right: yeah, okay, yeah.

C

Yeah, it can't be too big, okay, so yeah.

C

So probably two phase.

B

Commands worth a try, stripe.

C

Unit I mean it it's the same. It just means that the stripe is the entire object, mostly.

B

Well, mostly, I'm suggesting that the the implementation cost of the prototype would be low enough that it might be worth trying. Oh yeah yeah. I don't know if that's true, if.

C

If the coil is very low, then it might be worth just fine it'll be very low once rbd journaling. Is there yeah? That's that's what I mean yeah yeah. Let's have to wait a little while.

B

Besides, it might actually amplify, writes less than a 3x replicated pool.

B

Right to get the to get the same durability with like a 6-3 code, well, six four code, I suppose you're only multiplying rights by 1.4. You know just for the right itself, whereas versus 3x for the replicated pool, so you've actually got a factor of who to work with there.

C

E

I mean you could also um the journal. Could there be? The journals could actually probably be used under an electric pool itself, since it's all going to probably going to be append only right, yeah.

B

As long as you're willing to append an aligned packets, then yes.

E

ah Has to be aligned to the search side of the damage record.

B

Yeah, you can ask you to, but we added something to liberators that tells you what the stripe, what what the required alignment is for a pool. It's.

E

Like zero for.

B

Replicated well then non-zero for anything where it's a real thing.

C

E

But that might be tolerable. We could certainly uh try it out prototype it, but you know padding to the type size. We don't have enough data.

C

I think for the journal, though, you'd always want to use replicated, because it's not a lot of not a lot of bytes and you care about latency. In that case,.

B

You don't want the extra tail yeah, that's true!

B

You could put the journal pool on something fast.

B

Okay, does anyone have any sort of input like I don't know if anyone's paying attention or has thought about actually using this in real life um half the main reason I wanted to have a session on this was to see if anyone had the opinions on or insight into anything, they want to use it for that, these don't really address, or that one of these does address.

B

C

Okay worth a shot, I mean, I think, the thing that the thing that worries me is that this this is sort of oft requested, but it's sort of in the in the general sense of I want to use ec for everything, because it's more efficient right, um then that's not very helpful.

B

Right, that's what I'm afraid of that's why I like the two-phase, commit approach of these? Well, I like not doing in the first place best but yeah.

E

The two-phase commit.

B

Approach, if we are going to do it, has the advantage of being deterministic and predictable yep.

C

Are there any other problems we would solve with having a two-phase commit, not.

B

Even at all, no for the replicated pool we just the whole thing is well okay. No, if we had a two-phase commit replicated pool, we lose the divergent object bug or the divergent object trade-off thing.

C

B

Which is you know a thing so that would be nice.

C

B

You know if you could reference a previous journal entry in the journal, then at least we could avoid. No, it's not worth it. I was thinking because you you're writing the same extent to two different objects, hopefully close together in time. So if you knew that it was that the previous extent was still in the journal, you could reference it by a unique id, but no there's no way to there's.

C

No effectively, that's what like the clone range type thing is doing just like take those well.

B

No, that that would do it again. So, um oh I I see what you mean. Yes, that's true, but the clone range one does kind of require an f-sync.

B

This only requires that the previous part of the journal not be changing, which is always true. You just have to choose not to trim it, which is the part, that's probably impossible.

B

C

C

B

Someone's probably gonna have to volunteer to implement the two-phase, commit or yeah whatever or prototype a client-side approach. We haven't thought of yeah.

A

Isn't this where you volunteer like because you didn't show up.

B

I don't know that it's going to be done. I I don't think it's a good idea to implement this. Unless we have someone willing to say yeah, I am a driver for this thing and I would consider it successful if it accomplished x with y. You know properties right, yep, exactly otherwise we're just implementing something which might or might not be useful. Just for someone maybe.

C

And it's easily abused.

C

B

Right we rolling right that was a half hour session. Oh look at that was half stuff yeah that was half hour. So no.

A

B

A

Is that all that? All for that one then excellent.