Ceph CDS Hammer, 29 Oct 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CDS Hammer (Day 2) - OSD (Tiering): Reduce R/W Latencies on Cache Tier Miss

Description

http://goo.gl/U4b70r

29 October 2014
Ceph Developer Summit: Hammer
Day 2
OSD (Tiering): Reduce Read/Write Latencies on Cache Tier Miss
Zhiqiang Wang

A

Yeah, okay yeah for this one and we invite you to reduce the revivatin T's on calculus in the current code pass in the color code and when, when do a read- and it is missing in the cast year we were promoted and after the promotion we've, we will replicate this object to the and to all of the castillo OST that will pass and when and the read request is- is cued and deserved. After all of the replications, and then we can make some make us an improvement improvement.

A

Here's to that and we can serve the read request, request first and then, after the promotion than before, the completion of the replications know that we can reduce some latencies of the read request.

A

This is for read and for right. I went in his museum cast here and we promote it. We will first replicate this promoted, object to to order Castiel OST and then we also add. At the same time, we will update the object with the data from the kind and then do a lot of replication I I am here we can make we can. We can avoid one replication.

A

Here's to that that- and we do not need to do from replications after the promotion, we must hold the data and then we we updated the object with the data from the kind, and then we do a revocation. Then we we just need to want to do one implication. We we we serve the time for one applications and 44 object right and since we are going to overwrite the object anyway, I think there is no need to promote the object data.

A

We just need to promote the attributes and Abdul might, though, and this promotion that might be since that the size of the attribute and the omap is very small I think the promotion is very quick if we die need to promote object. Data.

A

Yes, if this is the ID and that we also have some I, have some implementations for this.

A

For the read I when had when we handle the promotions in the process, copy chunk and I in redcar chang functions, we wake up the bed and copy the promoting that we can do them to something. The radical chant functions we can cut the and promote we can copy the data. We need for the read request, tutor to the OSD or ping to the out data.

A

Are all data published of the osteo p and so that the data is already in the coop client requests and then and the rep and replication after the promotion's came, can come down at the same time and then, when we, when we unblock the reader request after the promotion I the riddle passed since the the promotion replication is done at the same time, I in the current code.

A

The reader cast is doctor again, because we can, we can get the read/write knock I, but in this new new in this new solutions, since the data is already in the update all data list of the oyster v, we do not need you to block this read write and this region request again. So we don't need to get the rock in the GOP functions: I, ok, and that can go ahead.

B

A

C

So I think that the trick here is that there's there's going to be a sub, a small subset of operations that this will work for. So it's only when the read operation can be completely satisfied by the results from one one chunk, basically something cents or partially satisfied from that one Chuck like it. So it's so we can't.

C

We can't reuse like the duo, ste ops function, because it's like completely general and calls them to the file store or whatever so I think it seems like we would need to write, make a new a new function. That's like try to satisfy up from what is it like results, data or copy residual.

D

A whatever that sweet we have already.

C

D

So this is the set of function. The set of reads for which this will work is exactly the same as the set of ops for which a sink rates are ok and the way a sink reads. Work is we build up a map of the offsets we need? So if we keep that map of offsets as weak as we receive chunks, we can of course filled in them and right.

C

When it's not on, let.

D

Me kick it off. That said, it still only works for the first read. It does not work for any subsequent read. If you receive two two reads in sequence: the first one will be satisfied this way. The second will be blocked, waiting for the thing to complete I, think you're, actually better off proxying the read to the backend, the.

B

One thing they're waiting for that if you're.

D

Waiting for bite, you know four megabytes minus one of a form: a given object. You you'd be much happier if you'd gone ahead and just proxy that read or so.

C

Either way it doesn't matter so so if we, if we sort of turn this the other way around- and we say what is the what's the worst case in the proxy k in the proxy- if our approach is to proxy the read and then promote after that, what's the worst case, what's what's the worst, that could be I guess that would be if we were to do a form agreed that we proxied and then immediately after that we do to promote, which is reading the same form eggs over again.

C

So we send the date over the wire twice yeah.

A

C

At least then we, it would all be cached, so we wouldn't do the I/o twice.

C

It would just be the network twice that would be sort of the is that make sense, but yeah we can always write in the general case, though your reads: aren't the full neg you're, usually reading like 4k or 128k, or something like that, in which case you do 128k read in the backing the proxy hundred twenty-eight you read, and then you read the full form eggs that surrounds that and you'll have some minimal cash advantages, because all the all the file metadata is already cached and you can do the sikh directly.

C

That's going to be, the reed is going to be faster because you won't read the forum eggs but yeah what Sam just said, so that Reed would always be fed every do.

B

We really obviously.

C

Want for my reads,.

B

Being promoted into cash anyway, though, I mean isn't this about. Look where the based here is: hunting they're.

D

C

Question two right.

D

um Yeah I mean maybe like I, said: I, don't know well, maybe because you're gonna have a lot of right locality after a read. Actually, if what you're doing is you're promoting a piece of a file that you're doing right, random, I Owen, then you may be reading that you may be reading a big chunk of the file into the cash on the rpd client and then you'll do in place over rights and those will flush back out. So actually, yes, possibly it depends on the workload. I.

D

Absolutely also is going to say that I forgot nevermind.

D

Anyway, the best-case scenario for a design like this would be where you go one step further and you actually keep that that whole buffer of data around used to satisfy subsequent reads: yeah.

A

D

Hard, though Conan's a lot of data, it's not clear that that's a good use of memory. It's.

C

On disk, at least at least right, where it's in a temporary object that we're staging that's replicating and right, so we've identical the track of the company. Chillin, hey yeah,.

D

Tracking the readability state and that's kind of unpleasant yeah.

C

So also complex I mean it seems to me like again this. The first thing we try should be proxying, the reeds and the second thing we should we try would be satisfying, read operations out of any individual chunk that comes out back during the promote, but.

D

But probably that one's really hard, though yeah.

C

I mean it's a little bit less hard. If you have the async read stuff, no.

D

But even even so, you, you have to have a read that only reads out of that and subsequent trunks yeah that you wouldn't have been better off proxy. Oh right right! So it both of those have to be true. So it can't require any bites that you've already pulled through, because then you'll have to wait to the full object anyway, and it can't be so far ahead in the future that you wouldn't have been better off descending. The small right back right right back to the baseball but yeah. It seems like a pretty narrow window.

C

A

So you are worrying about the async read and in this can in this way to 2 and 2 billion the data of the reader cast in in a in the promotion chunks I since we met we made promote this object several times. I we go in the opt and their country past read request and in several times I or forcing Creed I think we can just I appear in the the doctor. The Vision Quest is held in its held in the we.

A

We hope we hold all of the a product request in in your list right yeah. Then we can get that get those operation from that list and then we throw in the data and for a singly also.

C

A

We all listen families to hold the Opera, the request from kind.

D

A

Challenge for synchronous.

D

Rates is that what you put in the out data may depend on what you read object. Classes can can do. Art can perform arbitrary reads based on information. They get back from a read, so you could read the first bite and use that to decide whether you're going to read the second megabyte of the third night light. No one does that, of course, so for easy pools. We have a thing where it detects when it's able to do an async read, and in that case we can do this as you go through the OSD ops.

D

As long as you do. As long as you don't try to do, a synchronous read where you actually look at the results. It builds up a map of the offsets and lengths that it's going to want to read and then for a racial coded pool. Of course.

D

That then goes off it fetches that from the OS DS for this it would then instead wait until the relevant promotion trunk comes through, but even but even so it would have to be the case that waiting for that promotion chunk to come through is is a better strategy than just proxying it, because if you need to, if you need to wait for two megabytes of chunks before you get to the bites, you actually need, then probably it was better just go ahead and send it to the baseball that way happens really quickly, relatively speaking, yeah.

A

And but am but we need to do a promotion of this object anyway, after the reader request after direction. Read you action.

D

That is no no, you can you actually do the promotion in in parallel you're in parallel, yeah right.

C

A

C

It could be a totally independent decision right, so we have a stream of reeds you're practicing all of them or for redirecting all of them.

C

Whichever and at some point you decide, maybe I should promote this object, and so you start through motion and that's a soil thing that happens in the background, but the read stream will still get proxied until the complete object is present and then you start and then you start using the cached version like that's that that's the simplest, that's the simplest thing and for a lot of cases it's also I think the best thing I think the only case is where it isn't our.

C

If, if the reed is small and and if you know that you already have your part way through promotion and your promotion, that is in flight already, has the exact same data that you need and it fully falls within that range. Then it might make sense to wait for that that next chunk to come and satisfy it from that chunk. But that's going to be like likes. I think that the promotion currently is like 5 12 or one meg chunks. Maybe do you remember Sam what the current chunk granularity is for promote if.

D

I have, oh god, I hope it's not one megabyte they probably low. It's just probably probably.

C

Reading, like it I like far one Meg breeds or some or whatever it is I, don't know, I've been required us to our weakness, which we should fix. Eight I don't know yeah, okay, but.

A

C

They oh good good.

D

Okay; okay, though it's one request them so either either you've got into what I mean, in general case, for our BD, it's going to be 11 copycat and you it's so. The only way you win is if the small Reed came in right before the copy get chunk, showed up. Otherwise, you're waiting for a four megabyte read to perform buddy I.

C

Think it's X! If it always, if that, if that, if that copy get is in flight, then you're always better to wait for it because I'm sorry you're right, because you order because you're not gonna be able to pass it you're not going.

D

To get ahead of it, unless.

C

You have like a different priority or something but probably yeah. Okay,.

D

Actually, that's that's true, so if, if we usually copy get the whole object, then this is probably optimization. That's worth making a little later on.

C

But first step would be to first step would be to proxy the reeds in the normal case, and then the second step would be. If we get a read- and we know it can be satisfied by the copycat- that's in progress and we we do this additional operation, and that would probably we probably want to implement the async infrastructure first so that we can reuse the same.

C

The same strategy, yeah.

A

Okay, this is for read, then, let's go drive.

A

Ok, the in the in the right code, bus I, I, also in the function right copy, chunk, I. We we can allocate operation context to to store all the data promoted from the best year and we do not initiate a repop to rev it to the rare occasions and after that, when the red request is unblocked, I in the function duo p, since we already allocated this operation of the contest, what is a write request? We do not need to allocate another one, we use the previous one and then in the 20 s GOP functions.

A

We have that this RP context with the data from the client and then the replication sandisk up of you can test.

C

D

Certainly, q so.

C

B

C

Things need to happen here. We need to make sure that the that's promoted right. Essentially that says, here's the here's. The new object includes the effects of the right, so it has to be have a new object, info t or new version tea and all the end time. All that stuff needs to be changed and then we need to, and we need to combine the right. The written data into the promoted data.

D

The other piece you may be missing um is that we don't that that wrap up is also what does the local ripe.

A

Yeah yeah, I mean when I say a pop I mean to do the work and posing local and also in the remote moisty. Okay,.

D

In that case, that will work as long as the right does not yeah. Okay, yeah. It's exactly the same set of circumstances under.

C

D

Read here as long as the only reads in the transaction are a sink which there won't be any because it's a write up that will be pointless. Then you're, fine.

C

It seems like this is a similar sort of thing where we want to have a function. That's like try to do optimize merged right.

C

That will understand a very limited subset of transactions like only upright and off rightful and if there any other opps that are in the transaction beyond like set it set at or if there's any other apps it'll just say, I can't do it and then they'll fall back to the degenerate behavior right and in just that case it will it'll, basically take the local right and merge it into the promoted copy data.

C

I think that the high part is that it needs to do all of the same machinery that happens in in my finished context and prepare transaction somewhere in there where it like sets the version, and it sets the M time and it sets the user version, and it all that logic. That's like kind of convoluted, because the object that we promote will be the modified up version and not the original version.

C

A

Yeah, the advancing this is complicated to come Kathy, because since we when we updated update the data from the client, we will also.

B

A

Them time change that version change those sins wait a minute.

D

Actually, are you worried about the extra our contacts? Are you worried about the fact that it blocks because I don't think it does blog a copy that that finished, promote that's not going to hold the read, lock the whole time it's going to drop the reblock as soon as it submits the wrap up. The right would then immediately execute as long as it doesn't try to take a reebok if we're going to take a reebok, EP screwed anyway, yeah.

A

Director report from the kind right request can be done at the same time same time, I step to replicate the promoting object. But for this we need to do two replications right, I.

D

Mean sort of, but not not, really it cost you an extra message, but no extra latency they'd be pipelined.

A

Yeah, it is in is done in parallel, but I well.

D

Concurrently, actually but yeah yeah Piland parts of a motor overlap, yeah.

B

D

Not sure there's much win here, other than actually yeah I don't mean.

A

D

Me a Zonda.

A

It depends how much and depend right to the holes.

D

No, no I see the problem. um We couldn't we block until the promotoras. Don't we all right edge to e I'm, not if.

A

D

Do that's the fix, we don't need for it.

A

Block thanking the the copy copy from is done, and then there yeah.

D

Okay, so that's that's what we should fix so conceptually once we fired off the wrap up that that completes the promote. We don't need to continue to block. We can release the cues and rely on the right on the right. Lock. We we we have. No, that's it. That's a sure exclusive, lock. So any well, it's a shared shared blog.thanks. Anything.

A

Online, that is what we do. Currently we do and when we flat and when we send the rapidly pop to the remote for the remote object, we we just unblock this grant request. I think that's all you're right.

C

Okay, okay, so then I think then it's a then it's really just a question of how big the client right is compared to the promote. So if to promote is for megs and the client right is for k, then it doesn't matter that much right we're losing the cost of an extra message which isn't that big and we're we're sending 4k that we're about to overwrite, which is like who cares so I? Wonder if it?

C

But if it's, if we're writing the entire object, then we're like writing for magazine and overriding for mix and then that's that's obviously a case where it would. It would make sense, but.

D

Every other case, you're.

C

D

I think they all sit.

C

Same as the third, your third example, which is yeah.

D

That's that's the only case we care about I. Think.

C

We're you're overriding a full object and you shouldn't have bothered to promote it in the first place. Basically right so in the rightful we want to promote the object info and the adders and the omap and not the data, and we should just write in the new data, but actually.

D

Yeah also just for the record, with the four megabyte right, followed by the form, megabyte or megabyte right, followed by the four kilobyte right you're not going to do a four kilobyte right to the actual disk. The journal will see both, but the file system won't it'll, be written to page cache and then later on, it'll be flush with both operations, so I yeah I'll show. That saves you anything.

C

We talked about the last case, then so for the full, with the full ones are coming. I think that's the one where that's that's a more.

A

Just pointed to, this might say easy to implement.

C

Mm-Hmm, so that there's still the complexity about when you, when you sort of finally write the object when you're done with the promote you have to modify that yet to make sure the object of T is a sort of the the net effect is both to promote and the decline up, but that's that's doable. Just have to be careful. Actually, the real concern is.

D

If the we we, we don't perform client or client rights for clients that aren't there. So if the client causes a promote and then isn't around, we have to know to actually apply that rightful anyway, which is a small change, but the my biggest concern is that what things actually do, rightful that didn't just previously do do do delete so.

C

It could be well, it is necessarily rightful it's anything that is doing. That happens to overwrite all of the data right. So that's how many things do rightful rgw does, but it it's writing to fresh objects. Always it's ever over writing. So it doesn't help an rbd import, probably would. But how often do you import across an existing image? I think? Actually you never do that and.

D

When you are, you don't want it to cash in that case anyway, like aggressive right, one.

C

But if you I think the only case is really when you have a vm that is doing like streaming I/o inside the block device like inside the vm and in that case you're overriding like your contiguously writing. The entire device is an RBD RBD.

D

Going to see a 4 megabyte right, though, isn't it just going to see a whole sequence of 64 K hundred 28k 192k 64k 120.

C

A bit but the write-back cache is going to coalesce this into one for Meg right. It'll it'll actually go to yeah, ok, so the OSD will so that's the case. For the OSD will see one big for Meg thing. It won't be marked as rightful at least. Currently we could prolly, we could fix it, so it would in fact, actually if we did fix it, if it would, that would help, because otherwise the OSD has to to do it's tile to find out it needs the object info worse.

C

If it's it, if it knows it's the rightful, then it can do it can do a a promote that skips the data portion but gets everything else, although actually that's always going, that's almost always going to be the same thing, because you rarely have data and no map on the same object.

C

D

You absolutely have to coalesce the operations. If you do that one, it is not optional. Yeah.

C

Right exactly so, okay, but the. But the key thing, though, is that if the question is what, if there's a snapshot like if, if this right would have resulted in a clone, then we.

D

B

C

Have to promote the whole thing, and we won't know that until we get the object info.

B

D

B

That's actually.

D

I would argue, that's a pretty common with rbd.

C

Very frequently have some shots and you'll need to.

D

Know I mean specifically the case where it's sitting on the base pool and we're doing a streaming right. Probably that block has been written in a while and snapshots have happened. I think that's actually gonna happen. A lot.

C

Is it possible to oh forget, is it possible to have the the head on the cashier, but no clone.

D

Yes, it is bad ones that actually works. Fine, hang on you're right, that's a non-issue.

C

Well, we need to be clever huh right. We have, we have to be clever about it right well,.

D

You you need to fabric, we don't get to reuse, Meg rideable, but that's that's the extent of the cleverness like it's all like all of the relevant like it's it's it. It can be represented. Just just fine. We just can't use make rideable cuz, there's no yep. That's all.

C

So in that case, in that case, basically the logic would be. If we get a rightful, then the the copy get would have a flag that says skip data.

D

Actually, you'd you'd fill in the data when you, when you do the start, start promoter, whatever you would, you would optionally provide the data payload.

C

Oh and I would just basically make copy get, but and the.

D

Updated version or something that's very hard to know, yeah bit ly. Well, it won't be hard, but it will be fiddly. You.

B

Up to next I, don't.

C

D

Any of this information, yet you have to scan you, have to preemptively scan the transaction to find out whether this is even possible because you don't you don't- have an object info yet right.

C

You have to look at that yet, if you say, if is this a rightful and set adders and as completely undone dependent on any previous state of the object? Alright, there's no class ops or compare exert or anything, and if it, if it passes that test it's a simple enough operation, then the the promote would have a flag that says, skip the data because I'm about to replace it anyway and the promote completion.

C

You would provide the payload that you just got and it would have to construct the final object info that has the result of both the client right and the promote. Then issue a single rip up.

D

C

Handle the snapshot properly.

D

That part's not as hard actually but so I I guess to do this I I plied want evidence. This is actually going to fire. A lot of the time like this is. This is quite a bit of code to handle an edge case, so it would have to be a pretty common edge case. Yeah.

C

So one thing that might be worth mentioning- this is a tangent, but Adam was working for us over the summer and he wrote a trace capture, tool or rbd. That is a bunch of LT TNG, trace points that will generate a trace file from a real rbd workload, though the idea would be that you would run this on an actual cloud with the actual VMs and you would get work blood traces for all of everything that they're actually doing.

C

And then you could look at that and you could say how you could look and see how often a that we do complete rights in the first place and then be. You could try to try to model how old those objects were and are they likely to be in the based here not in the cast year, and would this promote case actually help and if.

D

B

D

I thought I'd be satisfied with just a lot of rightful chatham.

C

Personally, that they happen it off yeah.

A

Well, do we have tested in time no.

C

A

D

C

Data that so the two.

D

Labrum is generated.

C

Yes, so we just.

A

Added a giant I mean the.

C

Tool yeah, oh so we have the tool to no.

A

C

Trace data, but we need to it needs to be deployed.

A

C

A virtual cloud and with the real work lid and then we need to capture your traces. um So hopefully you have these open clouds just sitting around where you can can do that, but the.

D

Real drink that happens. A lot I mean the actress yeah. I.

D

Mean the bulk data copied, I mean it. It would be fine for me if you did like, as if f distribution for a bit on an fi. Oh, uh I don't f io using fio against our BD and then did a bulk copy like just a big zero operation or no zero operations. Terrible cuz, it'll it'll delete the objects, but a piping from deborah you, you, random or whatever.

D

If you observe a lot of right, Foles there, because the whole stack is willing to actually do that coalescing, then that would probably be reasonable in other word. Would it would greatly speed up the bulk copy case, which is something people do and get annoyed when it's slow right? hmm So if we have out that that that would be a good place to start anyway. Well.

C

I mean the second question is: how often does this rightful happen in a case where it's followed by all rights that you want to actually want to have in a cashier? It could be just that it could just be that when we get a big right, a complete block, we just say this is big. The object hasn't been touched in a while I'm just gonna pass it back into here, because it I don't want to pollute I, don't want to cause castien. Yes,.

A

Yeah Bruins jumpin java, and what not after we can just we direct this out. This read request to your right: a spear.

D

We can't do it.

C

D

That's that's a thing we can.

C

Build it yes, and the key thing is that I think the key thing to remember is that for rights we have to proxy them. We can't redirect them, but for reads we can redirect or proxy it doesn't matter and I forget why that's true I can't wait for a particular client.

D

I think we have to do one of the other.

C

You can't do both because you yeah I, don't.

D

Think you can alternate, pin other words you.

C

Have to answer right: okay, but.

D

In general reads you leave that your every box.

C

Key and writes always have to be proxied, but yeah. I think that the first thing is that by just enabling the proxy, then we're going to improve what most of these situations also.

B

So we're about 15 minutes over our session, we're okay, we've we've pretty much eaten the break now at this point, so probably good. So sorry, yeah I didn't want to interrupt it. It sounded like it was productive, but at the same time I also don't want to throw us way behind yeah.

C

Okay, cool, okay, thank.

A

C

Guys, Thank You fixer, yes, good you up, yes, good data and and good suggestions. um I think I'll probably do what I did with those sessions yesterday and I'll try to go through and still listen to. I think that the things that we probably want to do um but then feel free to disagree and tell me that I missed missed your suggestion so but I'll try said any mail tomorrow that what does that?

C

Yeah? Okay, okay.

A