Ceph CDS Jewel, 3 Aug 2015

Previous Meeting

⏯

youtube image

►

From YouTube: CDS Jewel -- Sloppy Reads

Description

http://tracker.ceph.com/projects/ceph/wiki/CDS_Jewel

A

You it now, it looks like Ron too sloppy reads so I guess take this one way Sam and will will make up the balance in the break here. Okay,.

B

uh I'll see someone or yeah, there has been some interest lately in well, as always reducing latency on reads. So beyond the things we talked about before, there are some sort of architectural choices and stuff that make it difficult to guarantee latency, for example, if we can't be sure that we have a proof that we've seen every every right, that's been committed to the client, we have to block reads until OS DS come up that. Let us do that.

B

That would be the period where you're stuck in peering or inactive or down, and that sort of a bummer. That's just the choice of the way we of the consistency model we have so there are use cases where it might be valid to not do that. For example, if you have a use case where every new object is created with unique name and is never recreated and is never modified, then if you find anything with that name, you can be sure that it's the right thing there's no possibility.

B

That's sure that you haven't add a bit copy. So in a situation like that, it might be worth it to allow the client to read from a replica or an OSD that happens to have a copy of the PG that is not currently active for that PG. If the replica returns you know, nth I, don't have a copy, then the client can simply retry later on a different OSD.

B

But if it gets a success and act was actually able to read the object that it knows, it's good to go so I guess what the discussion should be about one do any of you have used cases where this would be useful um to. uh How do we want to make the interface hypothetically for something like this work? I think it would be pro hit a sensitive, redundant, rbd pool, because there's no way to do that correctly. There's I can't think of a valid reason to perform this sort of read on honored our people.

B

So it seems to me that it might make sense to either make it a new pool type which just happens to use the existing code or a parameter on a pool that, when you set it, also disallows mutation of objects. Things like that.

B

That would be nice in that it would require the user to make a decision up front. That a pool is, will have immutable objects at a lot of sloppy reads and will make it difficult to misunderstand or misuse it any longer.

C

Be if you're doing copy-on-write from a read-only disk and then it would be useful for RVD. That's.

B

B

D

Ss 288 to make a new prototype because I'm, except much harder to use with existing images or justing systems like.

B

Okay, so you're arguing that when you, when you freeze an image to be used as a parent, you wouldn't want to move it to another pool. You or anything like that. That will be tedious.

D

Right don't want to if you want to enable this optimization or something for its temples, you'd like to give to do that, instead of having to create a new pool type or if you accidentally create. If you don't know about this yet and then you want to use later nice nice to not have to copy all your data over a canoe pool to do that.

D

It is for any do for your feature that doesn't actually require like a data encoding movement I would really strongly suggest not I'm requiring a new prototype for it. Okay,.

B

You definitely wouldn't well, you probably wouldn't be able to do an object class in this situation, make it difficult to use the parent puppets.

D

Would be annoying but not impossible.

D

And for the for the for the heir, apparent pule actually doesn't matter we're not using class operations there to the read just in the clone that when we're actually doing that writing out the data, so yeah I'll be fine.

C

Could you give.

D

C

Give me an OSD flag, like you know that I'm willing to read dirty I.

B

It would be a flag on the client up for sure yeah.

C

So that's the case, then you might do if the client was able to tell you a I'm willing to take a you know a potentially older dirty copy of this object and you can make it work.

B

Though the other dangers are exposing uncommitted rights, so we'll have to think about that a little bit, maybe just disallow, reads past: what is it last clean on disk or less complete on disk in the log? That would probably do it.

B

Do you guys get a lot of complaints about the copy up process taking too long? Isn't that dominated by the data movement.

D

Yeah I mean people don't generally trick the complain too much. It's.

D

It mostly is done it, but that demon, rather than the real, didn't see your it like their rate. Latency is much higher because you're actually running out the four megabytes. You did several journals in several most use, but in the general read case, we're not doing copy up.

D

Greentech swallow reads: expert in this is a business them as a board. Neither.

B

And our BTW is never going to recreate the same block in a different image right.

D

Mia same exact operating now.

B

Actually, if we did allow class operations, you could add a guard that said as long as the metadata on the subject indicates that it's been frozen, then it must be the correct version. Otherwise return failure right actually go through all the objects and freeze them. Do we.

D

Regress tap chat, so it's natural zabrat event, definition frozen, oh yeah,.

B

D

Trivial for every day- and you know the parent.

B

Your entrance, okay, so that's a good point. As long as you're reading from a snapshot, you can be guaranteed edits. Well, unless you reuse the name, do you ever recreate objects? So your rbd images are a sequence number and a added block number and the sequence number is never reused right.

D

B

Hook up about the sorority, that's interesting.

D

Darrell and it's like you're reading, my cat, you're reading, Franco snapshot, just matter of fact, he has been justly to be created and the snapshots to either would exist or would not exist.

D

C

B

Excited you sloppy.

D

B

This might this will probably also be usable for registry w yeah.

D

That may be the case with perhaps more useful and because they are, the objects are immutable there and they're. Never ever written directly.

C

C

B

In that case, would it make sense to add an operation that freezes an object to snapshots, wouldn't would be implicitly frozen? Of course, right.

D

I will have the green extra cost of you know later on yeah. That might make sense.

B

Cept, hang on with your reading from a snapshot, so you might be no you'd be reading from the head object.

D

For RW young, no.

B

No no 4 on our BD most of the blocks would be reading from the head object.

D

At the file store level, yes, but at the OSD level is still the Raiders request is still to reading from this.

B

The Raiders request is to that, but the head object would be presents that would imply I'm trying to decide whether it's possible for you to be missing an IO that changed the state of the object before the snap, because, if you're reading from head necessarily that object on that, OSD hasn't seen that snapshot yet. So you don't actually know for sure that there wasn't another snapshot, take that there wasn't another I/o taken or that happened between the object you're looking at and the snapshot that was requested.

B

B

You only know that if you've actually created a clone.

D

Yeah I mean if I, really it's taken care of with every level, because it's essentially fleshing out all the rights, but before you take a snapshot well,.

B

That's the problem, um that's that's! That's why we can't tell so if you were half way through flushing the rights and then the yeah things have changed. The old replica would have an object at whatever version and would then receive a read at a particular snapshot that was in the future of its current snap context. On that object. It wouldn't be able to tell the difference between that case and the case where it had fully flushed out all the rights you're.

D

B

Tricky you actually have two for us to be sure: you'd actually have to go through and seal them.

B

Would that be terrible, or just not great, I mean.

D

In general, you also don't want to seal them for every day, because you could potentially do have people like taking some chances and also continuing to write the image hi Sam, you could say, you're sealing this particular snapshot, perhaps as part of doing that snapshot, that's also not great, because then you have to actually you're increasing it, you're, basically making the snapchat or n in the side of the image. Instead of you.

B

Know that's what I don't well yeah.

D

Yeah, we definitely don't want to then make sense, ransom greatly slow. Just what that with you well.

B

Not all snapshots, just the ones you want to able to clone from, but I take your point.

B

You that'll have to you something to think about.

B

Let me write that down because I hadn't occurred to me boom yeah. So then the.

D

D

B

B

Okay, don't have any other thoughts or use cases that will you that you think would be worth capturing.

E

So I have a general question sure, um so this would work only for replicas right when you're, when you're replicating I mean not for a race, you're coding. It.

B

Could be extended to work with a retro coding and okay, so there's no inherent reason why you couldn't do this with an arrest, recording cool? What you do instead is you'd pick some like of the OS DS you know about. You would pick some SAP such that it covers enough of the shards that you can do a reconstruction and then you'd perform you'd, see you send your read messages and assuming you got back versions that were consistent across and you can get back a version number from the read.

B

Incidentally, if you choose to as long as you get back version numbers that are consistent across all of the shards, then you know that you read the same version: whether it was a current version or not from all of the copies, and then you would be able to perform a local reconstruction, but it's that last part.

B

That's the tricky part why we don't support replica reads on a ratio that pools yet we have a plug-in system for the erasure coding, libraries and the monitors and the OSD is have a have the plugins locally, but clients don't necessarily so it would be. A choice to make that sort of thing available to the client and we'd have to make some voices about how that would be configured. It might make sense to have rgw do it. It might not make sense to have look RVD. Do it if that makes sense.

B

His liberal videos packaged with qmu, whereas red CW is more of a heavyweight configuration.

B

Does that answer your question.

E

Yes, yep, thank you and I have another another kind of really want um sure. Would it make sense to UM to create a more generic consistency? I mean to support multiple consistency levels, not just a few and in a generic way. So you could say: okay, I want to be to have read, read my rights or or or have eventual consistency or strong consistency or monotonic reads, and things like that, instead of just.

B

So usually the closest people get to that is allowing you to specify the number of or what percentage of the Quorum group you need to write to that doesn't generalize well, and it doesn't actually give you the properties that Cassandra, for example, claims it does if you lose enough o as to use in a situation like that, you could still get an inconsistent read even though you've read it all of your the fulcrum group, um and that's so I guess the short answer is I have yet to see a generic system that has that much XX precipice.

B

The idea with this one is that we can get this relatively simply. All we really need to do is disable some checks and change the semantics of it and ratos. That's it's not so bad. The next step would be to create an actual AP pool type, which does something more like what Cassandra does. That would be also an exciting project, much more work. We have you have to rewrite, or you have to create an entirely new PG type.

B

um You wouldn't be able to reuse our recovery mechanisms because you wouldn't have a log and you wouldn't be able to reuse, or rather the existing recovery mechanisms. You wouldn't be able to reuse the existing peering mechanisms because they wouldn't be applicable and they be doing things that you didn't want them to do anyway.

B

You could probably reuse the right messages, I guess, but that's not really much of a win problem. I guess what I'm saying is. If you come across such a method, please send it to the to the list. We would find it fascinating. Okay,.

E

I mean there's, there's word that it's that it has been done by microsoft, research where they have this general framework, where you have where you're able to specified on a client session. What's your consistency level that you want to observe, and then the system provides that for you, but I I understand your point: I didn't I! Guess what you're saying is that there's a lot of the consistency?

E

Semantics are kind of hard coded in many places in stuff is always thing well,.

B

Also that well, for instance, F with our pools, it doesn't make sense for a replica to choose weaker right semantics. It's not possible, and it's not so much because of that is because we have a read after write, guarantee a strong, strongly consistent. We have to write guarantee, so it doesn't make sense for a client to do a write with anything less than that consistency, because they can't be sure that other clients haven't requested.

B

You know we have to write consistency, so I guess what I'm pointing out is that even in that regime, it's probably less generic than you think it is, although perhaps not I haven't read that that paper.

B

Do you have a link to it? Panty shots then check out to the pump to the pad.

E

Mm-Hmm I'll light it. You.

E

C

C

B

D

Going back to the M and slightly reads in general, and that's the idea of like a ceiling method, we like guarantee the dream updates and that I would make sense for our GW and perhaps other water use cases for you or you're. Doing no Imperial objects.

D

Just for that general, like over rate cases where you do eventually ever ate at some point or and does make as much sense, and we.

B

Don't need special support for later. All we need to be able to do a support, object class with a guard on your next set of al. You yeah, it might make sense to add support for it simply so that there's a canonical way of doing it and we could return a nice error if you try to perform an override. But aside from that.

D

Mr. prime are a bit more safety.

D

If people do start like using liberators more directly and aren't where the issues yeah.

D

D

Are there any other mechanisms that you think of the night from birth decline during the ceiling explicitly.

B

Well, they don't really need to do the ceiling, although it's just that you don't get useful results back in the car in ok, so the easy one is where you create an object, fully seal it and then never modify it if you're actively modifying an object. The only way this kind of read makes sense is, if you genuinely don't care what version you get back. um That would be odd. I can't offhand think of a reason to do that, but that's probably lack of imagination on my part.

B

So it's not so much that you need the ceiling to write the code. That part doesn't doesn't matter, it's just that it seems like that would be a necessary piece for any sane human being using the interface to actually feel comfortable.

B

Although I don't know, people seem pretty comfortable with mongodb, so.

B

Maybe make them wrong.

B

B

Alright is there anything else.

B

A

B

A

Lotta nothin yeah.

B