Ceph CDS Jewel, 4 Aug 2015

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: CDS Jewel -- Cache Tiering

Description

http://tracker.ceph.com/projects/ceph/wiki/CDS_Jewel

A

Alright, the last and probably short session is the cached earring here so is Alan still around Alan. Yes, and yes,.

B

I'm here anja.

A

King and from Intel and welts I think you two are it youtube groups are hit, I think Narendra was supposed to be here and then sage and Mark had a at a blueprint. Also mark.

C

Asked me to fill in for them: okay,.

A

D

I took shisha and take the needle to heal. Are you.

A

A

Yeah I'm, trying to browse my way through for blueprints really quick, but oh we write, you were on the same blueprint is allen. Also you the whole Sam discuses group yeah. um Damn you want to kick yours off. First I think yours fell top to bottom you're, the first one. Yes, then, what was next? It was yours and the Intel one, then the sandisk one so we'll run to them in that order, and then we can beat each other up on cash, steering as we go along.

A

Dive right in Sam by proxy, you.

C

E

C

All right looks like so the.

C

Yeah right, okay, so what seizures mark had was an experiment where they tried, rather than always promoting in the case where the hit set jodha. They should they set it to promote only n % of the time, in this case, something like 11 percent.

C

My understanding is that this pretty much allows them to measure the cache miss cost associated with reading directly from the cashier or from the base to you rather for reading proxy from the based here and what they found was that for replicated pools at least on random reads. The performance was actually pretty good in that case, because it turns out that a lot of the poor performance we were seeing was because of extraneous promotions.

C

Eating up all of the cached here throughput possible medical journals.

C

So this suggests two things, one that the cache miss overhead is actually not super bad as long as we're not having to do a promote synchronously before we from the reader. This also suggests that when we get the right proxying merged, we should hopefully get a similar improvement on rights.

C

So right would suggest that what we really want to do is focus on when we want to do promotes. So what they're suggesting is and is a remembering more explicitly the most recent operations devoting some memory to remembering the most recent end operations so that we can focus our efforts on promoting objects that really have been accessed a lot of times recently. The hit sets have limited granularity, since their main purpose is to detect cold objects, but they don't do a great out great gray day.

C

They don't do a great job at differentiating among the hotness of odd objects, so they suggest we add a cue NM, r uq and trigger a sinker oceans from the head of that list. We basically want to throttle the number of promotions such that we're devoting a certain percentage of the cache tears write throughput to promoting from the filter.

C

So um it occurred to me that one way we could do this is we could tune the promotion or not promotion threshold. Based on how many promotions we've done recently. So if we've used a lot of I/o recently on promotions, we could become more selective and vice versa.

C

So being that we do want to model want to let the current caster set move with the hot set of a distribution, but we don't want to randomly chase actual random. Irb won't get any more benefit. That way.

C

um One thing that another thing that came through is that, with this approach, the quench Alejo from the eraser coated tier or if you have an eraser coded here as a based here, was much much worse, which is not terribly surprising, since a lot of reads had to go through the dc-based here before we were willing to promote it so, but further suggests that, even in that case, we want to be very sir. We want to do a good job of detecting, obviously sequential AO.

C

All right, I think it covered all the points from our questioned.

C

B

Now it seems fairly straightforward, yep.

C

I I think the biggest takeaway is that the proxy reads really did have an enormous effect. So that's good! You.

A

You seems relatively easy.

A

Now I guess the next one we're on to was the the Intel blueprint.

A

A

During our name to apologize.

A

You had a blueprint on improving improvement on the cash to your eviction.

F

Yeah, you can't hear me yep.

F

F

Okay is sometimes ago with you, some testing it on the cash children using the FIO DPF distributions a we have the we have. We have for mega 400 gigabyte data of the total total data. Size is 400 gigabytes and when you from the deeper and deeper, if I, have a tool to estimate I, how many? How many data will be hot, using a tool and call the jenti path and I from this tool? We use the parameter 1.2. Then we can get a line. T % / over.

F

Ninety percent of the data will be hit in no no I'm. Sorry over ten, a ten percent of the data will behave, will be hit over eight percent of the time. So from this point of view, I, if we have the cast eyes setting to because our data size is a hundred megabyte. So if we set the case has to be I to be 40 40 gigabyte, then we will have a good performance from the castro ring and in our testing we said that we set the calcite cache size to be any stigma.

F

Beads, we said to it: Eddie Eddie gigabyte. That is larger than body, and we expect that we have a good performance, but from the result I it is marginal within 10 our expectations.

F

So this lead us to believe that the immigration algorithm is not doesn't perform so well, and then we we want to propose something to to improve this.

F

Ok in the current implementations of the eviction there are some. There was something not that good I.

F

In the current implementations, we I we we do the eviction based on the age of the object that is, we have hinted to cover the we have several he set for each PG to cover the time I of the object age for those object which is which is in the headset I. We we use the hill and hisses he said timestamp to calculate the object stage, but there is problem with this, since we, since the hinge that will cover a lot of objects, then all of the objects in this headset will be get the same.

F

Get the same time. I mean the same age, yep same age and then I when we use this age to calculate the to calculate which one should be built, then I, we didn't get get a good result from this because I we have many objects in the hills and he said then we have same age, and then we have the same I this that this object will be averted at the same time, I mean and the photos object which are not in the headset I.

F

Currently we use the same time to to use them time as the age I, but as well as we know that the aim time is, the eternam is a multiple time it is. It doesn't change when we doing read or or some and which we read operations on this object. We do not change them time I. So this means that I read, operations doesn't have doesn't have effect on the on edge.

E

F

The object access, so both of these are saying that the edge calendar to age calculated from the current implementations is not that accurate.

F

And also when we, when we want to when we doing vixens I, we calculate something like that. Now we use the power of two histogram 22 to locate the object age. We can get an upper bound under low bone and then we compare the upper bound with the something called a big effort to decide. We, if we should do you, get this object, and this event effort is, is a global value to a PG I.

F

So from these calculations, maybe some some object which may be accessed later, then then some other object, but the early object of the later object may be evicted before the earlier object. I, don't think this is quite reasonable. I think both of the disc lead to the to the not ethical performance of the cash to ring.

F

Okay based on this I have some sort, although it's not such a clear idea yet I just want to bring up some ideas so that we can have our earlier discussions and then we can can get more from here.

F

The first one is at two to calculate that and the two to improve the way to calculate the age. I, ok, let's, when we stay at the age, we actually we are using the recency 22. We have some have a word called recency CH we use the date is something something called at recently and then we use the recency to to make the divisions.

F

I as well, as I said before, that recent is not that accurate using the m time or used in the instep. The first idea is to maybe we can keep track of the eighth time. May Tatum is the excess time. Then mmm this updated pose for read and write a.

F

But keep this in memory in the sim snack? It's not nice it! It's not! It's not good to keep in memory because it maybe we need a lot of memory to keep giving that info, but persistent with together with the updating for, is also not good because I. When we do a read, we do not run when we do it. When we do read operations, we were all going to persist in anything to the to the disk, and so maybe maybe persistent the item into the.

F

mmm Into the qumari story, so maybe it's a good idea. If we have this air time, then we can use this item to calculate the age. That is that recency of the object and then we can make more accurate decisions to give it the object.

F

The second idea is to score: something is something called reuse distance. Either we used the reuse. Distance is something define eyes duh, I we we access the object and then later we works as some other object and after some time we access the first update. Again we call the reuse distance as the lumbar access is between these two excess of the object.

F

Oui oui, the there is an algorithm called AI. Is this algorithm is and proposed some some years ago, and they used the reuse distance to to to make the make the pit actually doing the page replacement policy investigations, and then we can. We can also use this idea to in our cache replacement.

F

The idea of what this is that I we have many hits that we have. We can set the heat the number of his set for PG and then we eat hpg, which he said covers some time or it covers a lump of object. And if we have a laugh his debt, then we can calculate that reuse, distance negatives, I say we have for his that one Underhill we have 650 set and the object is accessed in the he set one and his Essex.

F

Then it he said one his his ethics has all have two times them. Then we can get the reuse distance of this object access as the timestamp of the SS one. He said one and the timestamp of there after he said six. The there we can.

F

The difference is that the reuse distance for those objects which are not in the in the hill set I.

F

If we can, if we can find the ones if we can find the object once in the inner of the heat set, that means that the reuse distance of this object is infinite. If we can, if you can't find it in any of the history we it is infinite. Also so, for we can make this to calculate the reuse distance, and then we can also use the power of two histogram to make the evolution the eviction division.

F

Okay, if we also have the air time, I will can also calculate the reuse distance based on their time.

F

The third idea is to use the the ARU list, as we as all of the people doing in their systems. I I.

F

There are some of the some of the annual list, which are impractical use. What is called a RC, that is, this one is, is used in the IBM high in a storage system, and there are some others called some, such as AI is and they're also using some doing some systems. I think we can also make use of this these algorithms to make a better eviction division I in the linux kernel. They also used to some. They also use the tool list to check the the informations and then make the eviction tradition.

F

I think we we can also employ that to do the same thing, but the problem is that if we use the are you and we need to keep all of the informations in the memory that that would be lets, say yeah that need to consume some of the memory.

F

But I think that memories affordable for us. Let's say four: if we have a 1-1 southern gigabyte cash and then each object is for megabyte and then we have, we have yeah. We have.

F

250,000 this number of objects and for this number of objects we, if we have four there- are you cash for each object?

F

If we have our you cash and I, you list hold all of the object and, let's assume that the side up for each object is it's about 40. For these bites, then the total size will be 10 megabytes, that is for one start and gigabyte cash. We need to use a lot of 10 megabyte for the air. Are you list.

F

Yeah and then and this this 10 megabit can be spreaded into several OSTs. That depends on how many OSU's do we have yeah and so I. Think of this one is affordable too, but it's a little complicated than that to the first two ideas.

F

Any questions on this.

B

You do you have a good feel for how much of the problem is caused by poor handling of the information. That's there, you talked about using a time, maybe having deeper headsets with the different distance versus how much of the problem is being caused by individual bits in the hit set essentially merging the behavior of multiple objects. Because of the resolution.

F

For the first two ideas, I think it's it would be not that complicated the the overhead would be not that too much I for each operations and, let's say further a time I wanted to avoid operations. We need to update the a time all right. This is a case since we need to to write to persistent the object in for two entities and a ladder we put if we put Adam in two key value stores: mmm I, don't think that is too much overhead.

C

F

All right, yeah for read all right: Yeah Yeah, right, yeah, sorry that we we need to pay ya for V that we need to add. There is some lessons overhead. We need to persist the the atom into command store mmm that depends on then, because this information is is is much smaller. I think the time is, is affordable. It's doable.

C

To turn every reading to a right.

F

Ya know and not I Trentonian reach right into right. I would just maybe make it a sink. I think we still keep this realize the sink operations, and then we just either sink operations to write the atom in to give an extra in two runs: DD or lamb. Boti be sure.

C

Ok, so if it's part of the actual object metadata, we need to write it to all the replicas, which means the right. um It's true. We can choose to buffer at an unstable, a time and write them out later, but right.

E

Yeah secrets assistance we.

C

Would need to issue an actual wrap up, so that would be difficult to a batch another another approach would be to write them to a right aside, a time buffer, maybe of the most recent and it conducts but I wonder if that's really different from using something like a tell. Are you variant or I briefly started? Reading the IRS paper, whatever that is huh combined with a hit count, entered a time and periodically snapshot that object.

C

12 the replicas I'm skeptical about the value of keeping it only in memory, because I worry about after an interval change having drastically different cache behavior, which would be bad. So it would be nice if period if at least periodically we were able to snap shot our current Cashin or current um recency information. I. Don't.

B

Think we need a snack or.

C

Eight times explicitly, though, if we just periodically snapshot out the information that would be better.

F

Snapshot what informations.

C

With either the LRU or reuse distance approach, there would be some form of in-memory cache.

F

C

Information, yes, so periodically. We.

F

Have enough data.

C

Bad information out to all the ruffles, maybe a snack.

F

Dan right just mecha meeting.

C

F

The haystack snapshot yeah.

C

That's right is yes, that's that's exactly right, so you would be analogous to at a time, but we would make no effort whatsoever to keep it actually. Recent.

E

C

Not massively out-of-date.

F

C

Yeah I think something like this is necessary and you're right. We can keep thousands of objects, a thousands of object, long list in in memory for at the cost of only a few tens of megabytes.

E

C

C

More if we reduce the H object to sub form of hash much longer one than the third tip it when we currently use you.

F

Yeah, I think the area where rents can make them that the accurate decisions, so I would prefer this one bill.

C

We've had some patches floating around that try to use the object context. Cash for this, which are pretty strongly against I, would vastly prefer this approach object. Context.

B

C

Taken expensive, these would be small and cheap yeah.

F

F

By the way, I read some some something from the website. The internet that I see latest kernel is also you to have some. They already have some air. You list adapt the apt, you least one to hold the ho ho dose and pages which are accessed once recently, and the another needs to hold I hold those pages which are exist more than twice recently and that they still have some ideas to improve proof decide. This are you list. I are also considering using the algorithm. I I also use the algorithms neck.

F

The AR I, yeah I is, and also a I, see ya.

C

So I roll I will point out. The linux kernel is often optimizing for work. Floats will never see because they were already absorbed one layer up in they would external page cache.

C

So I would I would favor keeping as simple as we can, although, if there's a an obvious demonstrable benefit to one of these than we should do that.

F

Yeah, okay, yeah: we can go from the simplest. The first.

C

Yep, I'm just pointing out that some of some of the complexity in the page cache is because they say specifically need to be very fast at specific things that happen in servers and workstations that we will never see and that they have a lot.

F

C

To deal with specifically those to those cases, so if we also will deal with work, looks like that, then it would make sense to copy them, but not only.

F

A

You alright was that everything for that one.

F

A

Okay, then, the the final blueprint up for this evening / anchor noon, is the the sandisk blueprint, Alan you and shear, and forget who the third name on your blueprint was you guys want to take it away? Sure.

B

Sure sure you want to do it.

D

Shorten, oh, can everyone see the slides are.

A

E

D

D

Okay looks like I'm not able to share the slides, but I'll just you know begin with it, so so, basically, what what? What we are trying to do is enhance the cash tiering of to support user-defined policies, which the primary consumer would be rgw and objects.

D

The primary goal would be to support eviction and promotion, scheduling for objects and also support creation objects to bypass cash tears, if required. This would coexist with yehuda's algebra bucket exploration, work that his proposed in the previous blueprints, the what we would be basically coming up with is a new policy engine which ties all these together. Okay, the policy engine is what would manage the rules. The policy engine, the rjw, would talk to the policy engine and figure out.

D

What is the policy to be enforced and these policies would be set on the objects through etcetera and discipling would call them stamping or tagging with policies. We could also come up with different loadable policies. To reduce classes would have support for that too, but the default policy in thing that we would come up with should be sufficient to support most of the workloads in any questions, please, please feel free to interrupt me.

D

The the work in the clearing agent would be to understand this new, extended attributes of objects to make sure that the tiering agent would be configurable to invoke a crawl on the cache tier 2. A pass all the objects which require these rules to bed first on also, the tiering agent would be enhanced to avid an object to not just the base here, but to named TS in case there's much more than one layer of tearing the end.

D

If the user would want eviction to not the best here, but the entity r, we would support that to identifying either the pools by names are through. The pool IDs in the future. Promotion of objects should also be possible for the simple reason that there might be workloads where you would need certain objects to be presented, the cashier to improve the performance. So we would try to tackle that clue in the next installments. The rules basically would be set for objects, buckets pools and a global one. The object.

D

If you want to specify a particular rule for an object, it would be through HTTP headers in rgw, when you do a put. If, if the headers are basically not specified with rules, then the rules set on a bucket or a pool or a global in this hierarchy would come into effect and we were and the policy into its tag or stamps the objects yeah now again identified by name sir ID and look up of a particular pole or ID fails. Then currently we would resolve to resort to erroring out the request in the future.

D

You could also define a default pool to which object would go to in case the pool name is not specified right. We would identify rules based on pools. There is something like you know: poor lame, followed by Reed and the duration, which would say: what's the maximum duration object can leave in I live in?

D

That pool are based on read once the duration time south and during the next crawl cycle of the clearing agent, you are actually evicted and we would evict it to us named pool, which would say something like pool read and a big pool.

D

Similarly, we would have rules for right, duration, right I, with pool creation so on and so forth, and- and this would be a much more exhaustive list which we will share as and when we would. You know, get more in depth idea of the different rules that we want to support.

D

So so that that's that's kind of what we are proposing, where, where we would give the user a much more fine-grained control of making sure that the objects exist in the cache layer for a certain particular duration and after that, you would move it to a particular pool. Or you know even market for differ deletion that that's our oral idea. Any any questions, suggestions.

D

You and then you want to add something to this.

B

No I think you did a pretty good job of explaining that you know the the goal is to let people you, administrative Lee control, where the objects go. I think the two principal things that people would want to do would be to send a subset of their objects directly to the base tier skipping, a cached here. You know you could see that being done for large objects or objects with certain filename patterns that was in the proposal. The other thing would be to administrative leics. Pyre objects, out of the tier.

B

You know people would be able to say well. Okay I want this data to be around for 30 days, and then, after that you can just banish it.

B

A

Sam Josh socks.

C

I tried to say make sense, but I was heated to make sense.

B

Seems like it's fairly straightforward, not terribly complicated they'll. The only thing that seems really is the directly writing to the based here and with the proxy right stuff. It seems like we might be able to reuse that that's.

C

How it'll work it doesn't we? We don't want it directly write to the paste here from the client, because it makes the consistency part hard, but the proxy rights almost as good yeah.

B

No I made directly right I. Think proxy right is the way to deal with that right.

C

B

C

That's exactly how we tell we do it the part during the process of deciding where to redirect it. We would consult of policy here right.

B

C

Level policy based on the object and I suppose orbis the metadata in the OSD up. If.

B

We are meta data in the in the object. The naming stuff would be handled by the on-ramps, like our GW. Well,.

C

Annoyingly, that means we need to inspect the OSD op to find out what the metadata is going to be. But that's that's doable.

B

Well, you get there, we probably that's a little annoying. It's.

C

Not that annoying it's entirely doable, it's just a tedious detail right.

B

B

Good, probably well they're, obviously ways to make that eat less easeus, probably not worth.

C

One thing we could, instead of making it a well-known Exeter, we could make it an explicit ratos operation they'll, make it really easy to pull out of those d up right, we'd still start as a net setter, of course, but it would be stored in the OS. These private namespace and not in the public.

C

Bryant was one that.

B

Would probably be a better way to implement that.

C

Yeah I can see it both ways.

A

All right is that wrap it up for us. Does it I think.

B

There was some questions about the interaction with the bucket exploration stuff that was going on I'm also.

C

Yeah so well, that's that's another thing. I would like it if this stuff were also a queryable ground scrub, because it may be that we want to trigger other operations as well, although come to think of it, I guess so. Object, operated, X, expiration excessively, would be a scrub level concept rather than a casher a casting process, but only because you might well do object expiration on a pool that does not have a bass gear and therefore no no cash during agent scanning. In the background, that's not a big divergence.

C

It's just a matter of making scrub aware of the is.

B

The ones having as is having that so is the object expiration going to update the bucket index. Is that going to cause a problem for the crawler.

C

Actually I haven't talked about how he plans to do it. I hadn't thought about that either it must happen. Perhaps he intends the object expiration to cause a call back out into an external service like registry w.

C

That's one operator one option: yeah another option is the OSD itself could trigger an operation which updates the bucket index on the way out. That's possible yeah.

B

Well, did we did the caching that we did the tearing stuff carefully to avoid that problem? It should be executable totally locally without having to interact with any of the other, pg's, etc. Right.

C

Yeah not sure I plans to that of to ask important there's, no reason in principle why the OST couldn't simply remove it from the the bit that the PG could initiate an asynchronous operation which the OSD does on its behalf. That simply runs whatever ray does GW would normally do that is I registry w class operation, that's capable of making object, recalls, I, don't think, there's an inherent problem with having the OST data and do that.

B

No I agree not not not from the scrubber. You should be able to do that and you.

C

Well, in this specific case, because ratos GW already has all of the machinery it needs to deal with the OSD failing part way through, because lions.

B

C

Fill our way through, but in this case you don't need any of that, because a cached hearing system already has the relevant primitives for promoting and protecting.

B

C

Demoting two different pools is going to be a little bit more challenging off. To think about that.

B

When we tried to plan the tearing machinery to allow that to be specified without getting into the we know, the mechanisms that the plan would be the assumption would be that, whatever, when that comes into existence, the mechanisms will exist, and it's just apply more policy. Then.

C

Ya got that pertussis. Certainly true.

E

I'm, what about the idea for more things on the bed? Yeah I'm.

C

Sorry, I Kevin. What are you.

B

Chaitanya, I don't think we can hear you if.

C

The volume is low, it's like correct, it's just low. If you can tweak up the gate on your mic, I.

E

Think I can you view me know. Yes,.

B

Better, but it would be better to do some more.

E

E

I just talk, you can hear me now right, yeah.

C

E

Yeah, so the question is, we were proposing something like based here pushes back data into the cash here at some point of time. That's one policy, but currently there is no mechanism where the Facebook writes data back to the cash budget. It's always a pull from the cash test here, yeah.

C

That's true: we that's.

E

C

Enough to get around oh so you're right, none of none of that exists, yet it's not really a blocker that the base tier or that the cash tier is always full. It's just a matter of we set up a an operation with the base for notifies the cashier that it wants to pull. Is this object and the caster? Does it the other piece that part would be done in scrub?

C

I think unless it needed to happen on some actually I, can't think of any reason to scan a pool without scrubbing, so I think you just scrub more often, but.

B

Agreed I always envisioned this to be synonymous with the scrubbing agent. There's no point not to be yeah.

C

I was trying to think of that because you wouldn't want a deep scrub necessarily, but the shallow script part there's no reason not to check the metadata while you're there exactly.

B

C

Yes, there is because checking the metadata involves talking to the replicants. That's why the cast your the tea reagent is separate from scrub, uh which is not to say that you wouldn't necessarily also want to scrub, but it means that, if you're just trying to apply policy decisions to the objects, you don't need to talk to the replica. So you can just have the primary scam. So with a 3x replica pool, you do one. Third, the work um yeah.

C

That going either way we could, we can either create a shallower scrub that runs more often and doesn't talk to the replicas and just as trivial metadata checks and also applies the policy engine or we could create another engine, not unlike the casting agent.

C

Or just run the cash derangement, my peas are.

B

You know I think the tiering the expectation was is that it would might be only done say once a day or something like that so doing a shallow scrub at the same time, at least up for initial implementation doesn't seem unreasonable, yeah.

C

I'm thinking about the other direction, though it's always kind of bugged me, the deteriorating agent is a separate process from scrub. If we.

B

C

We phrased the Turing agents as a very shallow scrub. Then we would accomplish both goals, but yeah both are reasonable.

C

C

Yeah, okay, there are a few other differences. Scrub goes to some effort to ensure consistency of the result, so it locks the extent of objects that it's scrubbing at any particular point which you definitely don't want to do, for a cheering enter or for a policy agent come to think of it. So that's that that's argument for not using scrub yeah.

B

Okay, yeah, I think the eyes have it.

C

Well, we'll see how it shakes up.

A

A

All right, you guys.

B

Like it, wad yeah.

A

Sounds like a fair bit of quiet there, so excellent did you guys want to run through Narendra sprint at all? Is there anyone here that would like to represent that or chat through the cash cheering efficiency of read, miss operations, or should we just let him catch up with sage? On the back end, I.

C

Mean this looks like a description of proxy reads, which we hadn't reached. Oh.

A

C

Mr. something you say, someone from you let.

A

Him know cool alright. Well, it sounds like we can just beat him up offline somewhere or on the lists or whatever all right. Well,.

C

Rather, this is an alternate thing you could do instead of proxy reads, but proxy bastard answer. Yeah.

A

I guess that concludes it. So thank you, everybody for another great set developer summit. Obviously we'll continue these discussions on the lists and in IRC and then obviously tickets and pull requests as normal. So if you have any questions about the summit, let me know otherwise. The videos should be posted sometime next week and we go look forward to seeing you for the k release summit in a few months thanks, everybody.

A

Have a good afternoon and or evening.

A