Ceph Crimson Weekly, 8 Mar 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Crimson/SeaStore Meeting 2023-03-08

Description

Join us weekly for the Ceph Crimson/Seastore meeting: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contrib...
What is Ceph: https://ceph.io/en/discover/

A

All right, I'm, guessing at the end of this meeting, we'll.

B

A

um Someone's, echoing, not everyone, uh we'll discuss joyhound's um document.

B

A

Okay, all right, let's see uh for me this week, I'm still working on I was on PTO. Last week, I'm gonna be working on classic scrub and then Crimson scrub for the next several weeks. I think uh Junction how's it going.

C

A

C

I have some updates from changing. He he's not joined today. So the final final grain cache is available for review and uh changing also works on the max application size, and there was uh bug block him, and probably it is skill C or what is going on here with the max allocating size feature. So uh my my work this week is is mostly about reviewing the lba3, optimization and I. I will look. Take a look at the code here proposal this week and I also plan to allocate more efforts to the multiple messenger.

C

In probably the.

D

Following weeks weeks,.

C

That's all for me.

A

Okay, let's see harsh.

E

Hey uh sorry, I do not have any major updates foreign.

B

A

Thanks: okay, go ahead.

E

Yeah, so what I was saying is, uh since the last couple of weeks uh last couple of week, all the Crimson images that I am getting from Sam and have this particular issue where, uh when we are trying to configure the cluster and uh when we are adding the nodes, that is throwing an exception, I've already raised up some ticket for this and spoke with nedson. He informed me that this is not in Crimson underlying issue, but rather uh issue with the reef reef build itself, and we are not yet sure.

E

When exactly will this will get fixed because I haven't found a extreme ticket for the reef. Can.

A

You can you link the bug, really quick.

E

Yeah sure- and that is the reason why there was not much Direction on the testing side. I- have sent it and chat I believe it is for everyone here.

A

Weird um yeah, he also touched anything about. It must be a problem.

F

E

That's about it.

A

Okay, uh Kevin.

D

Oh hi so uh no match out there from my side, so I just want to locate the pace of the Upstream development yeah. So.

A

G

um Last week, I'm doing the vertical um system modification and the other. All the modification is done and I do. The riddles bench uh write reader reboot uh for the CN storm system install and just the corresponding you'll need to test it. So all works. So please review. Okay, that's all.

A

Okay, uh Rocky.

F

Oh hi, are you sure, there's no much updated and the pr mentioned the last week is mostly about self volume and surveillance. So so Adam King is reviewing my my PR and actually I'm still new being in Crimson. So maybe I will spend more time making it.

A

Okay, that's right.

B

F

This week,.

B

I was really uh sorry. Okay, uh this week, I was mainly uh modifying the LBA point of viewer, uh following in inches suggestion um uh that that's all that's all uh I, also assisted my colleague Johnson, to uh write the design documents for the uh my volatile data, cache.

G

B

G

A

uh Does anyone else have anything else to discuss before we continue on talking about the hot data? Non-Volatile cache talk about.

A

All right: uh do you want to put a link into chat to the document.

B

uh Okay. Okay, sorry, no worries.

A

Art, do you wanna, you wanna, give us an introduction.

B

uh Okay, um the purpose of this uh this documents is to describe a Machinery that we propose for implementing uh uh for for implementing the hot data cache function in systore, uh in which we cache data data that are frequently accessed or were frequently read in the uh part. Here. uh That's that's the purpose. So that's the purpose of this document.

B

um The design principles uh are as well as first on first I, think to castrate or to cast it in hot tiers. We need to uh utilize the locality of the applications, access pattern on this data and um I. Think uh in Seth object, uh upper upper layer, applications logical address space are so are partitioned into objects at the radius level, the red is level and those objects are hashed onto a set of pgs, which is scattered across the cluster.

B

So uh the locality of access is of the locality of access is across the uh in the routers objects are uh some, some is someone lost, so in sister, uh the locality we can see is uh is on the is on The Logical address space of uh extents within the same owner, so we think that's. um We need to construct this cache uh based on uh based on The Logical addresses of the logical address space of within the sink within the same object. That means the cache lines can't cross the boundaries of redis objects.

B

um So that's, that's uh uh that's the main principle um and insist, or we think uh there are two types of data. One is meditative in the metadata of syslord itself. It contains a.

D

B

Stands back, Graphics turns and all node instance, and the other side is data from upper layered applications uh like object, data and or map data, and we think that um the access pattern of these two types of data uh are relatively different, uh because uh the data is the data. Access is issued by up layer applications and they should uh show uh obvious locality.

B

Well, the metadata may not be the same. The reason is that um uh or the metadata are all all B trees and um the keys of these batteries are basically either fiscal addresses or logical, addresses that um create created based on the crush hash of redis objects. So we think that some, uh although extends within the same object within the same radius object, um will have contiguous logical addresses uh well, but the but the extents about body stands from.

B

Different objects will have completely different logical addresses and they are evenly distributed across all the leaves of those ldatres or Ono trees. So um and um The Logical, address space of the upper layer. Applications are are usually relatively large. I'll take RBD Tech Harmony as an example.

B

An orbitic image can have several or tens of terabytes and um in red in red, as objects are relatively small uh I think that's um basically uh 32 megabytes at the largest, so one RBD image can have hundreds of thousands of objects, rattus objects, and if so, if only say, five percent of his space is hot, then that hot space will still contain tens of thousands of objects, which means 10 tens of options. Tens of thousands of LBA Leaf nodes will be accessed equally frequently, so we think that so this is the.

B

This is the main difference in the access patterns of metadata and data So. Based on this assumption, we we think that perhaps we should um we should cache data and evict the data, a cache data. We should load data from culture to heartier and evicted from hot tier to culture based on uh based on the the data's heat. Well, we can rely on the current uh Turing Machinery to uh to evict metadata and um I. We think that's the metadata doesn't have to be loaded back to uh back to the the hot tier.

B

uh This this process can be I think they can come back to the part here on the right or mutation. uh That's that's the main principles of our design. I, don't know uh if they're. If there are any questions about this.

A

So the observations about objects are essentially correct, although four megabytes not 32 most of the time. However, it's.

G

Still the case that adjacent lbas.

A

Oh I'm, getting a huge Echo from you join. It might just be that your microphone is equal to your speakers, but anyway, it's still the case that LBA he he's close together in the LBA tree are more likely to be accessed um because they will belong in general to the same data. Portions of the same objects, so I actually think the LBA tree is going to have just as much locality properties as the data blocks and references most of the time.

A

um I'm not super fond of having a separate system for data blocks, I'm, okay, with having a separate policy, but we already have it hearing system and I. Don't know if Junction wants to jump in here and disagree, but I would be more inclined to find a way to phrase this in terms of the existing tiering design.

A

Okay, like the primary difference.

B

A

As far as I can tell, is that you're maintaining an online in-memory structure to control eviction? That's the only difference. I see between this and the current tearing implementation.

B

Okay, okay, I think we okay, please go ahead.

A

B

I'm I'm asking if there's another substantive difference.

B

uh Actually, they are okay, let.

C

Me uh share my thoughts, um so the current design with tearing is a code here and hotel is based on the generations which the goal of generation is to minimize rights, the overall right amplification, but during cleaning. So so in order to make make painting efficient as possible, and it is not designed for uh accelerate for the hot here to access the rate, the frequent access or frequent reader. The intense, because I saw that if we extend the memory enough, the voltage have memory enough. It has absorb it can cover this case.

C

So I think the shahan's proposal is to uh to also add this responsibility to the hot tier. So we can promote the frequently access the extents put hot here.

A

Right, but even that's just the difference of policy, yeah, yeah and again, an online structure to speed eviction which I'm fine with but I, would be more inclined. I would encourage you to find a way to fit this into the existing tiering implementation. So.

C

That's as much as.

A

Possible it's shared, for example, it could be that you make the the eviction policy pluggable, but different kinds of extents belong to different eviction policies. Something like life bet.

C

I'm still not sure of how to or maybe a way is to make policy plug platform, so we can choose different policy. For example, minimize your identification or or accelerate frequently accessed extends, read, or maybe they can co-locate together. So yeah well,.

A

I mean they seem pretty orthogonal right. So with your existing implementation as I understand, it extents only go from hot to cold. They never go the other direction. Yeah.

D

A

There's no there's no reason why that has to be true. There is no actual on disk state for for tiering right. It's just what the LBA tree points at correct.

C

Oh sorry, I I didn't follow your last point last point.

A

The only on disk State wrote like that tiering actually has is that the LBH free can point at more than one tier.

A

So when we relocate something from a hot year to a quality here, we simply update the LBH tree with its new location. Yeah. The opposite motion should be semantically identical.

A

You simply rewrite it to the hot tier and then update the obhray, so we'll be adding that capability, as well as the appropriate hooks in the read path to allow it to happen, and then orthogonally to that, and this could be done separately.

A

There would be this um extra in-memory structure for speeding eviction from the hot tier, but that may not even be necessary. You can pursue these two things independently.

A

You can first add promote on read that promotes things back into the hot tier and then separately.

A

Look into this different eviction strategy, I'm a bit skeptical of this eviction strategy, because I think it's going to use quite a bit of memory and reconstructing it on Startup strikes me as mildly expensive, but I'm prepared to be convinced. Once you have, you know more numbers, but it's it's it's possible that once you've implemented.

F

A

Promote part, it will turn out that the eviction part doesn't matter, because the existing eviction policy is fine.

B

um uh uh I have one question: um the eviction of data used to be based on its based on the data's heat right. um We need to we, we we don't, uh we won't. We. We probably want to avoid evicting hot data to culture, so.

A

Well, a couple of ways you can do that one of them is you can, as the engine said I, don't think we currently have it. Does the c-store read cache currently keep an arbitrary amount of data in it or have we made that change yet? Do we actually keep in in memory state in c-store.

C

uh Which members state.

A

uh That is, c-store has an lru that keeps a fixed minimum amount in again cash right.

C

Yeah, it's it's cool.

A

C

A

So that's going to absorb some of the some of the locality. It would be interesting to measure how much um the other piece is. There are strategies other than an explicit linked list of addresses. You can use um time, decaying Bloom filters to do a sort of a lossy estimate of whether extent ranges have been recently read and there are other tricks like that. You lose them on Startup, of course, but that's not necessarily a big deal.

A

Anyway, so I think what I'm pointing out is there's more than one way to do. The eviction part and the heat estimation part that use more or less memory to get more or less accuracy and I'd encourage you to think about that um when you're doing the design and try to make it plugable so that we can change it later.

B

Okay, okay, I think I. Think I have to give more thoughts on on that part.

A

But I think the promote on remark is probably just a good idea and that's probably worth doing. First.

B

I'm sorry, um I, didn't I, didn't hear the last sentence.

A

The the piece of this, where we promote an extent on read, is.

B

Probably worth implementing.

A

B

A

I want to suggest one other thing that may not be totally obvious. um C-Store is not.

G

The first cash in.

A

Any system it'll be running and it's the second one, so any RBD block device that isn't sitting behind fio will be in a virtual machine with probably Linux running. That Linux will have a page cache at its own. Read cache. So it's actually. You may find that in real life, it's improbable that you get successive reads on the same block, because the first reads going to load it into the virtual machines cache.

B

uh Yes, um but what these the scenarios that we are considering is we put um set processes and the application process all within the same machine, and that will make the memory very um very expensive, because the single machine can own, maybe several hundreds of gigabytes of memory, but it will. A lot of applications will have to share those memories. So the memory is that the page cache might be using could be very limited.

A

It's the other way around you'd want the vvm to use as much page cache as possible. You would never prioritize the store's memory in that scenario,.

B

um No I'm not saying sister's memory, I'm saying that, since the memory that can be util that can be utilized by Page, casual or any kind of cache is very limited, uh we will need to. We will need the uh the hot tier to serve as a secondary level. Cache uh is that right.

A

I see what you mean, but it's still, it's still the case that the page cache will be non-zero, so the most frequently accessed objects will be the ones you least want to Cache it'll, be the minimum it'll, be the middle ones.

F

Okay, yeah, okay,.

A

um None of this means that you shouldn't Implement a more clever eviction policy than what already exists, I'm, just suggesting that it will pay to make it pluggable.

C

A

And I think it's going to be less important than the read than the promote on read uh I. Think that's not my thoughts on. uh Do you want to cut him.

C

um Yeah and uh I think I.

C

Have dealing my deliver, my sauce and I will put more in the document later.

B

uh Okay, uh so uh shall we go on.

A

uh Does anyone else have any thoughts on this proposal.

A

All right sounds like we might be done for the evening. They don't have anything else. If not have a great week.

F

Okay: okay, see you bye.

E

Thank you, bye.