Ceph Crimson Weekly, 13 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Crimson/SeaStore Meeting 2022-07-13

Description

Join us weekly for the Ceph Crimson/SeaStore meeting: https://ceph.io/en/community/meetups

Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute/
What is Ceph: https://ceph.io/en/discover/

A

All right, let's get started, I think everyone's here, let's see for me this week, it's been some reviews. The second multi-core pr is very nearly ready. I should have it out tomorrow um and then I'll start working on the last one which actually that's multi-core uh how's it going.

B

Yeah no hi everyone. So uh last week I have been looking at the review comments, thanks for the review comments in leo and sam, um so I have been. I have addressed most of those uh comments, I'll push the changes with that, and on top of that I was able to you know.

B

Like I mentioned last week, I was able to get the osd boot on the dns uh z, the null blocks dns device, um and then I was looking at how to run ios on top of that and then I you know, ran radius bench right, workloads on that and- and I was able to- I was able to do the rights uh on on the zns drive.

B

I was able to see the uh rights going on to the drive and and everything and but I am kind of hit with another issue where, after I'm able to run the rights for about two minutes like on a 16 gb device, the device size. I'm keeping is it's a memory, backup device and it's like 16 gb. uh So I'm able to run it for two minutes the rights and then once it uh after two minutes, the I think the gc is kicking in and once the gc starts, I saw that.

B

I see that the actual io from the radar's bench kind of stops in the sense that rados bends, uh its ios, are not happening. But the gc starts doing all these reads and then writes uh from it starts reading from these others. The filled up zones and start writing it to newer zones and that that seems to go on for kind of forever, like it's not kind of exiting from gc.

B

So um that couple of questions like is it expected that when gc starts the because I see the code that uh I mean that we block on gc? uh Is it uh like we when gc is happening, uh do we stop the ios one thing.

A

No once we run out of space, we stop ios, which is what happened here. You.

B

A

B

All right, but I did not run out of space in the sense I still had around.

A

Yeah, I know, but the gc has to reclaim space before you could do rights uh again. So it's gonna block, client io until it has space available to do writes to does that make sense.

B

Yeah that part I get, but so what I'm saying is I have. I have a small device of each zone of 64 mb and then 256 zones are there and then the when I start redos bench. It starts writing and then it goes until up until like 190 zones, it starts right. uh It completes writes to 190 zones. I can see that in the you know, zone there are certain tools to check the zone information.

B

uh So with those tools, I can see that the number of these number of zones are full and um once it reaches like 190 zones, then rados bench, you know star stops reporting that you know current um mbps that this one and uh it goes to zero. uh Until then, I can see that it is writing at a certain speed and then it goes to zero. I still have around 50 to 60 zones free and then gc. So I can see the debug log messages of gc happening.

B

It is trying to read, extends and then write, extends into uh yeah.

A

There's a soft to the hard limit. If you've looked at the gc code, did you hit the hard limit.

B

A

You'd have to check the configuration. I don't remember what it is, but if you check the g so basically no that's a bug. It's supposed to make some progress so you'll have to debug it.

B

Yeah yeah yeah yeah.

A

B

That uh yeah, so I um no. I saw those parameters which I forgot, the exact numbers, but I I saw that you can tune the gc parameters and uh yeah, but currently I'm debugging that so that's where I am. um If you need more help.

A

You'll probably need sorry.

C

Sorry, um I think ceo or dns pr. There are some configurations that are discovered by the uh vms uh segment manager, but that information hasn't been propagated to the cleaner stamina. Cleaner, maybe there's a gap there, possibly.

A

So aravind, my suggestion to you would be to take some notes on what you see in the logs and create a bug, but what? Basically we can't debug it from this amount of information. It's not enough.

C

A

My recommendation for you would be to read the code, try to identify a divergence between the behavior and what the code seems to be trying to do and then create a bug with that information or you know, fix it. One of the other.

B

Yeah yeah, so I did try to do the same experiment with a normal block device. I mean same null block device instead of a zone device. I created a normal block device of 16 gb and then I kind of saw the same behavior after after certain number of writes the the the right stopped and um and from the the locks I could see that it has written.

B

uh It has not completed the you know, device uh capacity totally so, but I did not run those experiments for a longer time size so well.

A

Just a warning: it's not going to fill the device. The hard limit is below the total size of the device. It's only going to let you fill up to 80. Maybe I can't remember.

C

A

Like that's not an uncommon strategy here so, like I said you want to look at the code if you're using raido's bench, it will eventually use up all of the available space, at which point it will not let you write anymore, even though there are still available zones.

B

No yeah, I I uh kind of expected that I wanted to you know fail with the out of space uh error right. I wanted to see that. uh So that's why I was trying to.

A

That's not gonna happen with radios, or at least not easily um crimson. I I don't think crimson interacts correctly with the out of space handling in seth and steps out of space handling is much more complicated than that.

B

A

um It's it's not enough. Like an osd, that's out of space doesn't mean the cluster is out of space right, it's much much more complicated than that with normal classic osds. They report usage statistics to the monitors and once those get full enough, the monitors will declare the pool full and the and on client rights will start to fail, but some of that machinery, I don't think, is in place in crimson. You are probably not going to see any no space.

A

What you're going to see is I o hanging until you add more space to the cluster or remove objects.

A

That doesn't mean that that's what's happening here, though c store could well be bugging and the gc could be not making progress. So you'll have to debug further to figure out. What's going on there.

B

Yeah yeah yeah, I'm on deck.

A

So one or the other not sure.

B

Right yeah right, I'm I'm yeah, I'm working on I'm looking into gc code now, uh trying to you know um trying to debug that um yeah. That's all! I had thanks.

A

All right uh agent how's it going.

C

uh I think that available uh ratio uh currently is uh 90, so there will be always reserve 10 percent of space available here so uh last week, mostly on reviews.

C

uh The first is the split uh object, data blocks and that is merged, and there are follow-up works to works to do, and I will try to further improve the conflict detection uh during your claim- and I also simplify the random block manager circular journal and review the async cleaner trimming, and I think we probably can split the cleaning, cleaner implementation for a random block and a second manager- and I will do a second review this week and before hdd. I think it might be good to make consensus on the level architecture.

C

First, that's all yep.

A

C

um I'm working on that air vacation improvement last week and nothing else.

A

Okay cool, uh should we.

C

uh I'm working on the list object back and uh work still modifying the pr according to the comments and still needed to debug a enumerage object. Just late, so still has a bag here and you need to get back. So that's all.

C

Okay, uh israel.

D

Oh last week I run into uh two parts of death reference when I was developing the physical b3, optimization I've separated a pr for the first block uh and the second part is here and the code I to fix it um is I I haven't. I haven't submitted the pr4 for the second part, but I think it's uh it's getting closer I'll, also be the pr for this uh within the next one day or two. um After that, I will be back into implementing the optimization for physical features.

D

That's that's all.

A

All right, um I thought we could discuss the hard disk stuff a little bit, but does anyone have anything else they want to talk about other than that.

C

Okay, cool mix looks good to me, so I will proceed to merge that if no other reviews.

D

A

Yeah yeah go for it: okay, I'll remove myself.

A

So shayhan, I thought I'd ask you um yeah what what is your goal for using the random block manager for hdds? I had assumed that the whole point was to do right back to dirty extents.

D

uh Actually, um I I think- uh or we think um if uh the nvme, the nvme devices are large enough to hold all the data, then we do not have to write all data back to the hdd, then maybe they may be accessed even even though not frequently, I think if they are an available space on the on the on the mvme devices, then why why don't we put the cold air? Also there uh is.

D

We think that it's only when there are not enough space in the in the mvme devices uh for to hold all the data, then we have to put coded it called the color data into uh hdds. uh I I don't know if this is uh reasonable.

A

Well, that certainly makes sense. um I have two questions, though one I really want to support clusters that only have our disks, so whatever design choices we make here have to make sense. In that context,.

A

Secondly, I'm still not sure what you're hoping to get out of an rbm device, if not dirty right back.

D

uh Actually we, uh uh what we were thinking is um uh for uh for data overrides. um That's our 4k align are we. We do not use mutation. uh We just use this uh extensively specific function. We preach the the overwritten data are as new as new extents and um when writing those news stands back to hdds.

D

Oh, we have to get them together with the old, with the old, old and larger one. So the old one is not splitted forever. On the hard disk. It's just split temporarily. When uh there are, there are overwritten data on the nvme devices.

A

So again, this design has to make sense, even without an nvme device. Is the first observation, so you're saying yeah. So I'm telling you that if we're going to introduce our disk support at all, I want to be able to support osds that only have hard disks.

A

There are cluster use cases where that will be important, specifically data lake rgw s3 configurations where iops is not the limitation total sizes, and there will also be configurations where the nvme tier is very, very small, and some mutated data still will have to be written back to the hard disk.

A

So, okay, so there are two pieces here. um The first is even if you are performing so I I think part of what you were just saying is that when we get a mutation to an extent located on a hard disk, we split the extent and we write the newly split extent to the faster it's hot. So that's a that's likely. A good use of time right is that am I understanding that correctly. So I agree with that. There will be scenarios where that's the right heuristic when we perform right back, though um sorry.

A

I think we also need, if you're going to do, that, the hard disk tier should probably be a segment manager.

A

I have a couple of reasons for that: supposition, um if you, if you don't do it that way, then you're going to end up with very, very sparse random free space on the hard disk. This is a worst case scenario. Not only will you have a hard time finding free space to do large, contiguous rights, you will also be unable to do large contiguous reads, because your data will be heavily fragmented.

A

The advantage of something like a garbage collection system or any log structured file system in this context is both that the garbage collector will allow you to do large, sequential rights which allows you to get full bandwidth out of the hard disk which you wouldn't be able to get otherwise and during garbage collection. If we add the right heuristics, we can and we can defragment the relevant extents.

A

So it's not the case necessarily that garbage collection is a problem. It may actually be something you'd have to implement on an rbm device anyway,.

D

um Actually, we thought that uh we, if we at a at a minimize, minimal allocation size in the implementation of hard disk support, uh the hard disk are at most fragmented in the unit of that minimal allocation size right.

A

Only if you allow right back to extents that are that size, so that means dirty rights.

A

Yeah or you have to or huge right amplification, that's the other option.

A

Okay and 16 kilobytes isn't big enough for sequential rights. It would have to be really big 256 four megabytes most.

B

Objects aren't that big.

A

In the first place in rgw use cases, objects are typically on the order of 32 to 228k.

A

A

So I don't disagree that it may be worth implementing an rbm, hard disk, but the scenarios where that's worthwhile are exactly the scenarios where you want to be doing dirty right back.

D

A

I think at least that's my guess.

A

What do you think does that make sense to you, I'm sort of interpolating based on what I've read from other log structured file system papers.

C

Yeah, I think I think so I mostly agree with your point.

A

um And if my instinct on all of this is to do as little hdd specific code as possible, so one advantage just because we don't want any device specific code that we can avoid um hard disks share, essentially the same access mechanisms as any normal sata ssd. So in that sense, they're, not special, the only area. I think that we need special heuristics for is allocation.

A

So if we choose the strategy I outlined above in this thread, we end up with an rbm implementation that can tolerate hard disks and we can implement better allocation strategies for hard disks, as we well measure really, while also retaining the ability to use a block segment manager for them. So we can test out both both strategies and evolve from there, and I think that's the direction we should go.

C

Okay, so I'm wondering if, if these two structure, if these two strategies are similar, so when we want to write a small portion of the existing extent, we write the data and then do rewrite or and between another strategy. That is to split.

A

Oh, I agree with you that is true, certainly for eddie writes smaller smaller than 4k you'd want to journal a delta, otherwise you'd get right. Amplification.

C

If that can, maybe you can.

A

The um the the counter argument is that for hard disks doing an out of line write, cuts, the or doing a non-sequential right cuts the device throughput by a factor of a lot like a hundred thousand. It's a ton right, so you're actually willing to pay a lot of right amplification on a hard disk to avoid random rights.

A

So there will probably be use cases where, even for very small writes, it makes sense to split them rather than doing dirty right back.

A

And I think that will be more true if you have a fast tier, because the fast here is what's going to absorb most of the hot rights.

C

Yeah, my original thinking is that we only put hdd to the slow tier so that only when we were writing a larger generation segments we rewrite to the disk, so it won't affect the user ios. So it's only moving data code data from hot here to code tier it doesn't directly affect the user.

A

You put well except that if the envy mirror, if the faster slo um fills up, then you become bound by the right speed of the slow tier right.

C

Yeah, but usually writing is, is very sequential. It's collect a lot of data and horizon together as a transaction to this code tier, so it will exploit the uh most of the bandwidth out of the hard disk.

A

Yeah and that's my argument for using a segment manager in that scenario, it'll ensure that you get big sequential zones to write to it also avoids allocation overhead, which is a positive element. Yes,.

D

Yeah, but I think that's all that will mean that um uh if, if, uh when we're writing data back to https and the data are well and each each right are relatively um each, each data piece is relatively small and we might be ending up uh making logical, continuous data better on the hdd devices and when there are sequential reads for a logical, logically, continuous data that that kind of that sequential release will turn will be turned into random reads to the hdd devices, and um I think that that will be a problem.

D

If we want to support like databases uh using lsm3 or other big data applications. Do you think that's that's reasonable.

A

That's true, but the rbm device will make that worse, not better. The best solution to that problem is to modify the garbage collector to perform large, contiguous, logical rights. So concretely, that means when we're garbage collecting an extent that is part of an object. We one notice, that's a that. It's an extent that's part of an object and instead of performing a 4k right back to the cool tier, we check to see whether it makes sense to also write the adjacent four megabytes back wherever they happen to be located.

A

That's actually a problem that garbage collectors traditionally help with not hurt.

A

Do you see what I mean john.

D

Yeah, but that means runtime pulification.

A

Right but it's a hard disk, we don't care. We will cheerfully pay a large right amplification on a hard disk to allow us to do sequential rights and reads.

A

D

I see your point.

A

Yeah sequential rights on hard disk are free compared to random io.

A

Okay, the actual cost is the cpu cost of having to go and find the other extents. That's the part. That's going to be the binding part, not the hard disk bandwidth.

D

A

So I'm not saying that the rbm strategy is bad. I think I'm saying that it's more complicated than just rbm allows us to avoid fragmentation.

A

Generally speaking, I would actually expect garbage collection to be better there, but we can actually implement both both strategies and that's what I think we should do.

A

I think for different use cases. It may actually make sense to allow um overwrites on a hard disk, so I think we probably want to experiment with both.

D

A

um We can also continue discussing this and I'm not sure if yingsan has more opinions he wants to pull in here.

C

I'll show our update there. If I have.

A

Does anyone have anything else they want to talk about.

B

Oh one quick question: so uh there are these type of um smr hard disks. If you guys are aware of it, so they are like um I mean some support is already there in blue store and the like. So so, if at all, crimson ost is going to support uh like hdds for a slower tire, uh this one storage, then uh uh the is it a possibility to like add smr, also to that this one, because it is more, it mostly works like a dns drive only.

A

Yeah, I would say we would just create an smr implementation of segment manager. Easy right. I don't know if it'll work well, but it should be straightforward in that yeah you're right do they behave very much like cns devices. Concretely, though, I haven't seen very much from the smr hard disk concept in the last couple of years, it seems like it didn't really catch on. Am I wrong about that?

A

Or do you are yeah? Is your experience that it's something that's going to be showing up more.

B

Yeah, it is actually catching up now. It.

C

B

A slow start, so people are interested.

A

So so yeah, if well I mean we don't have any. So that's one of the reasons we haven't implemented it, but if you want to implement smr support same way, you're implementing zns support. Now I think that would be great cool. Okay,.

B

Well, yeah, I think we'll I'll look into that once this dns is in a good shape, decent shape.

A

Oh yeah ningxia is pointing out that he noticed this earlier.

A

But yeah, I think it would be awesome. Yeah thanks.

A

All right have a good week. Everyone bye.

D