Ceph CDS Hammer, 30 Oct 2014

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: CDS Hammer (Day 2) - OSD (Tiering): Fine-Grained Promotion Unit

Description

http://goo.gl/U4b70r

29 October 2014
Ceph Developer Summit: Hammer
Day 2
OSD (Tiering): Fine-Grained Promotion Unit
Zhiqiang Wang

A

All right on to the next one looks like we're. Looking at.

A

30 here we go the OSD tiering discussion for fine-grained promotion units. Did you Chang I? Think South Sousa did I pass that right.

A

Think so, uh but it was your blueprint if you want to give us a little bit of an overview of what you were thinking and then we can go from there sure.

B

Can't get here me: yes, okay, I like to talking something about talking to ideas on kesteren I, the ad. First, let me show some data. First.

B

Okay, can you see my screen now.

C

Hey one simple comment: here: that's okay! I have seen out it up mmm previously right after last week's weekly performance meeting.

B

Okay, this is some tests. We did before I uncashed hearing performance we actually with. We did for case I the first three. First, we are for cash during and the last ones for without cash during, but we have done to SS TL AG against the journal. I further I did the performance of all of these. Four cases are shown in the table in this chart. I, as you can see, without the castori. Without the right drive performance is about 1000 hours, the really about 1500 hours and for the random it is.

B

It is 1200 hours I, oh ok, first set between the with castor in case one case to constrain the case. Why is too? We hold? We said the dirty ratio and flat took the flow ratio of the cat cheering so that we can hold all of the data in the cache tier and we don't need to do flash and then we don't need to do evict, and there is also no promotion in case one and in case too I. There is flash work, but there is no promotion and in kayseri we have both.

B

We have all of the eviction and flashed and promotion in case three and, as you can see from the result, the case, one with that and the case to result is, is good compared with the without question result a especially for the read, but in the case three and which we have promotions, we have flash animations. The result is very poor. Is it's about Eiffel for the right? It is just same percent of the no. No no is a compared with the without capturing result.

B

It is just about mmm thirty thirty-five percent and put for the real it is about sixty percent yeah, the so the result for the history is very poor. Instead.

D

E

Random reads: across a uniform random reads: across the full data set yeah: ok, we drank.

B

E

That's always gonna be slower, because you're trashing your promoting and demoting stuff that it because it doesn't your cash- is too small. I.

B

Know she's needy small if.

E

B

The case for the cash is big.

E

F

We need you need something like as if F distribution, where you have like really hot data and really cold data, well.

G

Hang on it's just totally right when.

F

You say that you're single big.

G

Enough do you mean that it's big enough to hold the entire workings up, because that's true, it should be the same speed as whatever the cash jurors, though, what.

C

Are you doing yeah yeah that ass kiss one ok, I, see.

E

Ok, so k3 it's a small cash.

C

E

C

Can refer to the detail, configuration the following pages? Oh wait. I have added it to the backup set. Ok.

E

E

Ok, so I think I think that if the, if the, if the size of the cache tier in case 3, is smaller than the size of the complete data set and you're doing random reads and writes, then it's always going to be bad right, like the cash. Never.

D

Works, if you have a uniform.

E

Distribution of data across all of your data- it's it ideally wouldn't be that bad. It wouldn't eat that much slower. But it's always going to be worse than just using your based here, because you're doing this useless work with putting stuff in the cash well.

G

E

Was a big enough, the.

G

Best case scenario is that it's it's as fast as you're backing full right. So r is. Is this thirty-five percent compared to the bet to the backing pool or compared to the SSD awkward.

B

And companion compared to the base based here, ok.

E

The pier is the first column now.

B

In the best team, we have two and two SSDs, as acting as the journal. Okay,.

G

So sage then like, if the cache has no overhead than we would expect hundred percent here, right yeah, it would be as fast as the backing here. So this is saying that this is a just just having the cash enabled, even if it doesn't help cost you. Sixty-Five percent photos yeah.

E

That makes its yeah so it deep. It would be better if it weren't so bad, but it's always going to be slower than the base year.

B

Yeah then mixing this yeah.

E

Because you're you're doing more work to cash it and then and you're you're gonna get your kit rate on your cash is going to be bad. Yes,.

B

E

Because we disappear.

H

B

Don't have a real wood, what notes and we use the day, if I own, to do to run the test? Okay, okay! In.

F

Addition to the the tests in case three that you did do be interesting as well as to have as if F distribution with fio, which would let you look at when you have a combination of hot and cold data. What um what kind of performance you see my guess is. It will still be bad just based on the test. I've done, but it may be. It will be quite as bad actually.

H

Maybe they don't so.

G

We're trying to figure out the old room, add a caption.

G

That's me: I'm gonna, put on those all.

E

E

It looks like fio has support first, if F now, yes,.

F

We have here we do have some some data on that, but yeah, so we it better not too hard under it. Yeah.

C

Yeah yeah go to the Destiny. We can try this later I. Think: okay, okay,.

B

Okay- and this is the ops data we can't look- you can also look at some of the latency data is in his show in the picture, but these four cases I the first one- is that without cash children with that and the latency is about I, it's around 200, milliseconds and then, and then for the case. One case to the reader response time is very good compared with the without culture result, but for can swing. The latency is huge. It's above 1000, milliseconds, I think this is. uh This is not acceptable. Yes,.

E

Yes, yeah, so, okay, so there's there's one thing: that's just a cheap hack that might help with this that there's there's a pull request or a branch that does this, but right now when we promote due to a read.

E

So this is up the random read case when we promote due to a read: it does the promotion first, it reads it from the backing pool and then it writes it to the cash pool, and then it does the read um and this patch will do, will forward the read to the backing pool and then and then promote, and so the read the first read at least on that object won't block at all and only the only a secondary to that same object. That's right after it yeah out.

C

That I talking about of the promo town, the Signori to the patch no.

E

It'sit's, actually different than that, it's once you decide to okay, it'll, both it'll it'll, promote and it'll forward. Instead of promoting in them and then retry the operation and so I think it'll in a random read case. It'll probably help a lot, because you're get usually going to touch that object once and then you're going to touch other objects in other workloads where you have to reads the same object. It only help with the first one, but the second one will be slower, but it'll eat the lid least pipeline them a little bit.

E

So you'll have less of a penalty, but that'll that'll up a little bit, but I think your point remains that promotion is expensive. Yeah.

B

Yeah yeah, yeah, okay, but.

C

D

Think the list.

B

Yes, we did some latency program data for the promotion. We divide that this I the the whole time in the cache 0sd we divided into a for the promotion with you add into promotion, coffee and promotion replication and then the other. The total. The total time is the Castillo SD latency as he is showing the table we can see and foreign read and when q-dubs is one I.

B

The promotion time is it's about ninety percent, it's above ninety percent of the total time and the four cubes 16 it is about 15, alright and the time is even higher for 2 times 1 and the promotion time is above ninety percent ninety-two percent and the four right and four kidnap 16. It is above 76 the prom date, so you can see the most of the time we spend on the promotion in.

F

Women, whenever you've got like a cash pool, that has a lot of writing amplification and you're doing reads: I mean it and you're doing over the network. To me, as can be expensive right, I mean as.

G

Well, okay! So well, so that's that's true with this. With this design, I I think all of these slides are basically just saying. Cache misses are more expensive than they should be with this design right. So that's! That's! That's true! It's it's! A random, just yeah! The random distribution is actually a good way to measure the overhead of a cache, miss.

C

E

So I mean it seems like there there there there are two directions to go or two questions or whatever I think the first one is the thing that people care about: isn't the cost of the cache miss its?

E

What's the actual performance impact on a real work load and the thing that we keep running into is that all of the tools are terrible at sort of bottling over a real workload because they tend to, like you know, write a bunch of new data which is sort of a worst case for the cash or they tend to do usually do a random distribution, which is also a worst case for the cash but normal like a normal vm workload or whatever isn't a random distribution, and it's not just a big ingest operation.

E

So it's a lot more work to set up the the test, so that actually sort of are realistic enough that they actually are put you in a situation where the cashier should help um you.

F

Know or my or my AG think.

E

That's part of it, you.

F

Didn't even in situations, though, we've got like a really really skewed. Read this read only distribution. We still see bad performance even with like minimal promotions, how bad our promotion, both.

E

So yeah so per oceans. Yes, so promotions are bad, but you shouldn't I, think we should be careful about focusing just on promotion without also having some effort to to look at what the actual impact on a more a more realistic workload is. Unfortunately, though, what more is little realistic means has been just hard to sort of figure out, because we don't have good models for what you know a typical you. What I don't know what cloud vm workload looks like and what what is the actual distribution? How skewed is the is?

E

Is it a good model and how skewed is it and and how? Because it that mean, the key variables are the whatever the zip parameters. That is basically how how skewed the distribution is to the hot data and and then the size of the cache relative to the base or the total data set like those are the two key variables and there's obviously going to be some combinations that are good and bad and again, ideally make a little graph.

E

That, like shows where it's going to be good in where it's not going to be good, but just being able to do that that experiment would be would be helpful, but that'sthat's one direction is just to like understand, but then obviously the other direction is yes. When you do promote, you wanted to go as fast as possible and its extensive right now so I think we sort of need to pay attention to those that make sense.

B

Okay, I sweet-talked in the emeritus, and we also tested that the promotions don't read the read that right. This is a small feature we did before and I. The picture show that to teachers shows that without when we have this feature, and then the the ask improves about two times from about 202 and 250 hours to about around 1800 house and the latencies improves about I to three times and to the improves about sixty percent.

E

Yeah, okay, so I think this is. This is a good example, because if you really have a random workload, then you actually never want to promote ever like. If your code actually was random, then you would want to you got. You would basically want to promote just a random subset. However, big your caches say, your cash is twenty percent of your total size. You just take a random twenty percent and promote it and then never change the contents of your cash.

E

That would be like the winning strategy, and so you would actually would not want to promote second read what you want to promote on, like the 100th read or something so again like yes, it's great it promote on second read is a really good idea and it helps, but that the workload isn't like the right. Isn't the right one just to show the situation where it's really going to help and how much it really helps in the real situation is actually.

G

Nice, so when, when we do promote on so I can read, we don't do we proxy it or not? No, we set a redirect back right.

E

And we, the first read we just redirect and the second read we currently promote and then read, but with this new patch will redirect and promote in parallel right.

G

So I'm I'm, I'm wondering whether the the whole thing here is to see how many operations we can serve while the promote is going on, so we can serve weeds in the general case, while a promote is happening right by redirecting the back end. If we want to make sure of ordering, all we have to do is proxy. It right the rate that is yeah right.

E

All right here even-even, while it's yeah as long as the promote, is still in progress and no right has started. Then we can. We can tell them to redirect because the reed is in also in progress, but we can actually keep I think ferreting until remota done. That's.

G

The trivial step, one I, think that that, if, if just promote on second read helps this much then proxying reads during promote will probably help a lot. It'll, probably turn your you know: 20x overhead and latency on small reads down to something: manageable, slow, but manageable, um because usually.

D

E

G

On a promote, write or.

E

Read we can also, we can I mean so their teeth. We can forward reads or we can proxy regents. So far, we've only been forwarding, but just being able to proxy reads in general will probably also have a sort of across-the-board improvement in latency, although it'll cost some cpu on the cash tier 0 s DS right, because their base clubs.

G

Probably a reasonable trade-off for a 20-fold decrease in the 99th percentile agency right well,.

E

But it's going to be a trade-off that you want to make on a per coaster basis. It depends on no.

G

I mean it can be configurable, but I'm saying that seems like a choice. Many people will choose to make ya.

G

Rights are a little trickier, um I think you're getting to that later in this talk, but we would need some kind of partial promote where we keep deltas of the object. That's.

E

Harder than I think it's where we're going or where they're heading but yeah, okay.

E

All right side making a couple notes in the pad, but I think so step step. One would be the whip promote forward, step two or just refine. My that is to always forward when to promote is in progress. Actually, that should probably just I should apologize. That there is to do that and then a third step which would be more complicated, would be to proxy the reeds instead of redirecting the reeds but that'll be a little bit more code. I don't do that, but yeah they look complicated.

E

Sorry to interrupt you keep going.

B

Okay, okay and yeah- and this is the data I am going to show under there. Let's go back to the and the ideas. Okay, I will stop sharing.

D

B

Okay, at the first idea, I'm going to propose is the two to use a fine-grained promotion unit. As we know in the in the current set, the default object side is four megabytes and the one we do promotions when we promote the whole object and I and I think that the phone megabyte is too big for promotion, thus I I and we promote to to to use a smaller promotion unit and we we can divide this object into a 4k unit, and we call it call it unit page and I.

B

When we need to do promotion, we just promote and the required pages. We do not promote the whole object and we we can use a bitmap to attract the usage of the object.

G

It's a solid idea in principle, but the interaction of that with snaps makes my head hurt.

G

mmm I'm, also wondering how much of it you get for free just by proxying the read because keep in mind, even if you're, promoting a 4k object. There are a 4k page instead of a 4 mega byte object on a 4k read you're still having the read weight behind the promote and commit, which is a network round trip at a disk operation. So you see you kind of still have to proxy the rate. That's not optional!.

B

Yeah, but we can't yeah, we can catch this pocket page in the in the cast here so next time, when we and we dissipate and for kapiti again we just we can. We can serve the request in the cast to it. We do not need to go back to the best year.

G

So now we have to replicate a piece of the page cache to Yeah.

B

G

Also, how much read locality do you think most workloads have at the OSD level? That's my other question. I. Sorry, how.

H

G

For our BD, for example, how much read locality do you think actually makes it all the way to the OSD.

B

hmm I'm not going to buy this I.

G

Think you're duplicating the clients page cache. If you do that.

E

F

This is a tricky issue right because lots.

G

Of white look at what he survives, but I'm, not sure a lot of read look at what he does yeah.

E

I think that the crummy part is that um the to read the read latency with promote, we can do these much simpler things that I think address. That will help a lot, basically proxying the reeds and forwarding, while we're promoting and so I think there are simpler things. We can do first, that help with random reads a lot, but the random writes are the hard ones, because we can't in a simple way either Ford them, while the promoters in progress or or do them locally and I. Think you can it's.

G

Just hard um this idea, if they're, if there's no read in the transaction, then you can once you have the object info, you have a license to perform any rights. You want as long as you don't actually serve any reads, so you need to offer all of the rights you haven't applied yet and then once the once, you use you continue to. What do you do you forward?

G

The read then overlay, the bytes that have changed in the meantime on its way back, that's complicated and then, when they promote is complete, you have to apply the deltas to the new file Robert.

E

I think the thing that worries me is that the way that that promotion works right now with the in a cache dear during the promote or writing it to a temporary object, and it's going to be a series of reeds, and only when we're done do we move it into place and actually complete the promotion, whereas in this case we need to sort of we need in order to persist the client right. We have to have something that's permanently persisted. That has enough data to overlay under the object, threesome.

D

G

E

Could be with samples.

D

G

An empty file and write to it as the I mean just an empty representative. The file with the correct object info and have the object info contained up evidence.

E

Of vino continent, yet no, no, no.

G

No other way around an interval set indicating which, which file ranges are correct. Yeah, when you complete the promote you apply, the deltas.

D

At the readout.

E

G

E

Or, alternatively, we would just we would persist the right in like a in a rightist in a separate like right aside thing that says these are transactions that we have persisted and will need to apply once the object is present. I.

G

Mean that's pretty much what we're doing it's just that we'd be doing its / optics instead of not proper yeah I'm.

E

Just sticking out by how it would be implemented on the file store in XFS, it's going to be painful if we have to yeah rewrite the temporary file into and do a non-contiguous set of ranges well,.

G

It's clone range, followed by delete, followed by Giambi or.

E

G

Of negative, instead.

E

Of a rename, which is what it is right now well.

G

It'll still be a rename at the end, but first it's a sequence of clone Rangers. Then arena you copy the extents over into the correctly copied object. Then you, renate at a place, got it okay, yeah, yeah, yeah! It's it's just very complicated. It's all! Yes,.

G

And again, if you try to serve a read in the meantime, you have to overlay those deltas on top of whatever you for, on top of whatever you um proxy, to read our.

D

G

E

G

Out that classes too yeah.

E

I think probably any of those operations that are non-trivial partial rights. We would just a block until promotion completes and listen until we decide that that particular category of operation is so important in common that we should add the complexity to deal with it, and once you block one of those apps, then you have to block all apps, at least when I same client yeah all the same Walters yeah, a client for calling private, whatever I'll, still remain right.

G

Okay, um the other thing is, we can be more choosy about when we try to do the promote like if you do have a random, more clothes than there's no point for wedding ever right. Yep.

E

So we could forward rights to into the base dear exactly it's no no week. Sorry, we.

G

Can't write you out with Roxy, you have to promise to proxy rights.

E

Yes, this is actually easy right, so right can be proxied that.

G

Should in iraq were do that should even interact correctly with snapshots without any effort at all? That's that's the kind of thing I like because we're sharing the snap, I think, they're the same that you just work.

E

F

Then yeah random, possibly stupid question, but for promotions for reads: is there any reason that those objects need to go to a pool is replicated versus rights that are coming in.

E

Sorry start again so.

F

So right now, if you're doing a cached here, you might use like 3x replication right. So you got a library about amplification. When motion for a read. Is there any reason that you can't do something like separate out, read: promotions versus rights so that read promotions are going to some kind of pool? That's not replicated.

G

No because then, if you write to the.

D

G

Have to move it over that hmm I mean you can get it like if you happen to have that information a priori right now. All you have to do is configure your cash pool to be one runner-up and.

E

So actually, this is a suggestion that somebody had in the read-only. This is totally tangent, total tangent, but to have a read only cash where we promote on read. But if I write ever comes, we just delete it from the cash pool and then for the right and then the cash pool could be 1x replication.

G

F

That is our suffering, a really.

G

E

F

Kind of something along the lines of.

E

I was just leave cash only okay, some.

F

Some way to make it so that you can reduce the right application specifically for reads, and you know, I, don't.

E

F

Something yeah.

G

Actually not so sure, because um well I mean how many people well okay, so if you are using ok, so r BD does not use will will pretty much never use an r BD or a cash pool for read, cashing there'd be no point. It's got page cache is all over the place, so RVD will always be a write. Caching situation, whereas rate of CW may genuinely be a read. Caching situation a lot of the time and for a retailing situation that might make sense. Yeah.

F

Presumably, you may want it if you somehow think that your memory reasons are not right, but I, don't know. It's I mean.

G

Maybe but the difference in latency between pitch cash and an SSD cache pool is such that I doubt that an extra layer in the cache hierarchy is going to help that much yeah.

C

F

E

Okay, so sorry, coming back to come to coming back to your actual proposal, I think, but I think that the general feeling is that that page granularity write, caching is possible and there are situations where would help, but it's really hard- and there are. This is my this is my take, and there are a whole bunch of other things that we can do that are going to help a lot of the same cases that we should do first before we do the really hard thing.

E

So I'd try to make a list.

B

E

B

E

Think yeah I'll repeat yes, so I think the doing block or page granularity write. Caching is a good idea and it would help for many situations, but it's also very hard and very complicated, and there are a bunch of simple things that we can do. First, that um that'll also help right.

E

D

To make a list of them in.

E

The in the pad I don't know if the order makes sense if you want to exam, but.

E

B

From our point of view, I.

C

B

Think that you sing at the block or and to promotion on the block or page granularity I, don't know that we does think that is it's too complicated. What is the complicated point.

G

Of this, the interaction with a lie, the interaction with snapshots comes to mind is the most difficult for one thing you have to be able. So when you receive a read on an object, that's in this partial state, let's say: you're you're reading the first megabyte of the object and you've written random 4k chunks throughout that one megabyte piece when you perform the read, the primary OSD will then have to perform the same. Read on its it's. The part that's been promoted on the pit on the blocks have been promoted than overlay.

G

Those results on top of each op bedded proxied. So that's not impossible, but it's it's tricky! That's the thing we don't currently do.

G

Object classes can't do asynchronous read so that that's a thing, but we can get around that by just not allowing it um snapshots are tricky because you might receive an a right with an updated snap ID, while you're in the state, so you'll have a snapshot at partial object. Know then, when you finish the promote? Well, let's, let's say later when you, when you go to write them out, you have to write out diffs or if you chose to promote the whole object, you'd be promoting into an otherwise immutable. Snap bow! Oh yeah!

G

That's a thing! So you clones are immutable currently and we exploit that a lot.

G

So if you only had parts of the clone written in because that was the state it was in when you got the new snapshot, then now, when you receive reads and other pieces of that clone you, if you can't satisfy them, you either have to keep going to the cash pool or, if you choose to remote, the whole object, because it's really hot for some reason, you'd have to modify the clone when it's where it, where it's sitting in the cash tier which doesn't sound hard. But it surprisingly, is um it's guy. It's all doable.

G

F

Say I'm assuming you could do all this? Is there any advantage to doing this relative to just using smaller object sizes? Nobody from the Vic, oh yeah,.

G

There are other reasons not to use tiny objects. Exitos doesn't really like tiny.

H

54 55 is support at, for example, we have problems for a mobile and the weekend and the plant o P is the 4k rate, and we can. We can issue a promoter okie and we can get 40 talking to touch touch by the side, and then we can get a 4m of data, so we can refer to plan for 40 data at foster and then feel feel for my car incredible.

G

D

Have detected that a minimum number of lines are connected to this conference? Do you want to continue your meeting press one to confirm if there is no respond? The conference will end I. Think.

A

It's because we had people at Eildon via intercall. They just need to hit the number one if you're dialed in by phone, huh but we are. We are actually a few minutes over on this one walking.

G

A

G

A next thing schedules kind of the same thing yeah, except.

A

Blur so yeah, maybe one of the sense you should talk about I.

E

Made them I made a few notes in the in the etherpad on, like the simplest possible approach for doing for doing this and I think that's to allow allow a partial object to be in the cache tier with a interval that says which ranges are written and a flag that says its partial and then, if we get anything that isn't trivially satisfied like a non-trivial write or read that we we don't the full thing for then we just fail.

E

We either block and wait for a full, promote or do another partial, promote or whatever and then and then we just have to then. The only thing you have to implement is the complexity of the partial object and the and the to promote that sort of merges the two together. That seems to me, like kind of the simplest thing, and then you can, you can always add more on to like that, affect more complicated types of operations, but I think, even even with that sort of minimal strategy.

E

I think these are the things we should do first, because they're they're way way way easier and I think they're going to help a lot, and so that's the ability to that's changing the read behavior to forward you during the promote that'll completely avoid the latency costs for promotions on, read, I, think that'll basically go away right and then and then also we can have the ability to proxy rights. So we have the choice not to promote on a single or k right.

E

We could wait until we have multiple rights on the object like promote on second right type of thing right now. It also. I think that would help. I think.

G

I think those are those.

E

Are comparatively simple and I will get us a lot of mileage with minimal effort.

G

Philosophically, I would argue that if we don't have, you know for megabyte objects worth of locality, then promoting to the cash wasn't useful in the first place, most of the time so strategies where we hide the latency of the promotion, I think, will be more valuable than strategies where we promote less.

E

Okay, should we select the next try to look at the next one yeah.

B

Okay, let's go to the next one and the last one is, and it was simpler than previous off yeah.