Ceph Crimson Weekly, 22 Apr 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CDS Reef: Crimson

Description

The Ceph Developer Summit for Reef is a series of planning meetings around the next release and some community planning.

Schedule: https://ceph.io/en/news/blog/2022/ceph-developer-summit-reef/

A

All right, let's get started um so welcome to the crimson cdf session for 2022.. um I thought I'd start by giving a very brief overview of where we are with quincy and a few thoughts on what our priorities should be for reef.

A

So on that note,.

A

A major focus of the last year has been on stability and deployment stuff um thanks to radic. Crimson now has support in both rook and safe and seth adm for deployment. um We have an initial stab at a crimson rados technology suite.

A

It's even got a limited selection of the thrashing failure, injection and uh all those good tests that we've come to know and love with pathology, um and we've done a lot of work on stabilizing the bugs uncovered, with that initial toothology testing, in particular lots of refinements to the way up ordering works and improvements to the watch notify apis.

A

John may also did a bunch of work, improving blue stores, support and crimson to the point where it works fairly. Well now,.

A

um We've gained uh moving on to c store, which is where a lot of development happened. In the last year, um we've got a new metrics framework and a set of metrics, so we can better understand how c-store is behaving. This is going to be really important going forward as we improve there as we start as these store becomes more feature complete and we start focusing more on performance, so we've got metrics for the cash transaction conflict rate a bunch of stuff in the lba tree to help us understand the lb allocation, behavior, etc.

A

It's already proved very useful in tracking down transaction conflict allocation problems due to incorrect, hinting a lot more work done with c-store internals. There were rewrites to simplify and approve the lba tree. Generally speaking, um joyhound added support for placing extents in non-journal segments, as well as internal refactoring, to support multiple devices.

A

The conflict detection machinery got rewritten to use interruptable future. um We added initial dns support, there's a zns segment manager that has been tested on an actual dns device. So that's good and we have some um generally put a bunch of work into going down the list of radius api test um coverage that we didn't that we didn't have yet um on the performance side c store saw some improvements to lba hinting and journal coalescing, both of which improved performance.

A

One key there we go so this brings us to our sort of uh our top line summary for crimson in quincy, which is that crimson osd in quincy should support testing of rbd workloads without snapshots on crimson osd in a single reactor. Configuration with blue store, cyan, store or c store back ends deployed with rook or seth adm.

A

I think this is a fairly decent initial milestone, though I fully expect there'll be quite a lot of bug fixing along the way, but this will let us start evaluating crimson in a somewhat more more real uh situation.

A

So moving on to reef the biggest things I see in reef for for crimson as a whole are multi-core snapshots and scrub um scrub is particularly important because we won't really be able to be confident in any other components without scrub. Working snapshots are a fairly core feature that baton's going to be working on. That should help us fill out. The rbd use case and multicore is going to be critical for just being able to make use of modern hardware on a basic level.

A

Single core was useful for prototype, but that needs to be needs to be addressed for reef um for c-store, there's a ton that we'll be working on, but I think they mostly fall into a couple of big buckets: further improvements to multi-device and tiering, particularly with respect to garbage collection, support for fast nvme devices via the random block manager and once again, multi-core, and I think, we'll sort of cheese. This initially by simply running multiple c-store instances against static disk partitions. This will give us a performance baseline that we can use to evaluate that changes.

A

The metadata structures we'll need to do this in a more clever way.

A

um So that brings us to what I think is the first sort of discussion point I wanted to get into today. I know jinx and you had added something to the schedule for um to discuss gc strategy so we'll get to that next, but before we get to there, I've been working more on crimson in the last two months, rather than c store, and we need. I I think it's important that we regularize the way logging works, so that it behaves the same as it does in classic.

A

There are sort of three specific areas where there seems to be some disconnects.

A

We don't respect the debug, underscore config options from classic, and I think we should we've trained users and support and a lot of people to use debug, osd and debug ms to control the output levels when trying to understand what's going on, so I think we need to match some of the more common configs there, including the numerical values, although they'll map back on to debug, trace and error in crimson.

A

Of course, the other piece is that pg log messages in classic have a macro at the d out macro that ensures that the log lines are always it the same way. This is useful for a bunch of reasons, including because it allows you to very easily grab out all of the log lines for a specific pg.

A

um So I think we want if anyone wants to sort of volunteer, to go and do that auditing. That would be excellent and finally vstart and technology log to different files in crimson than they do in classic. um We log to osd.0.standard out instead of host.log, and there are a few other differences, um and I think we want to eliminate that difference as well.

A

These seem like minor things, but as the wider community starts to interact with crimson, we're really going to want to make sure that everyone's on the same page with debugging, especially when the differences from classic aren't actually that important. um The other piece is that now that we're going to have a release where crimson kind of kind of works we're going to want to make sure that it continues working, which means, I think we need to select a toothology test to start with.

A

That is our gating test for all pr's, so that from now on, when we merge napr into crimson, we run this toothology suite, consisting of at first a single test and start expanding that test suite as we get better reliability.

A

So with that, I open it up to comments, questions reactions.

A

For what it's worth, I created some tickets in the trello board. Can you guys hear me?

A

B

Oh good, okay,.

B

A question on the debugging subsystem: do we want? I, I believe we I believe we still want to stick to the assistor's login infrastructure. Is that correct? Oh.

A

We're not changing the logging infrastructure. What we want to do is match the command line: options: okay, okay, okay, we've trained people to like just to name two common examples: debug ms equals one prints out all the messages that get sent right.

A

Sure, that's my favorite, my my favorite debug mechanism. Ever and not only should we support debug ms equals one should turn that on in crimson. It should have the same format. Every every character should be identical so that any any graphing is as close as possible to classic. Unless there's a good reason to change it.

B

Okay, I see because you know uh there are some uh fundamental differences between uh the logging infrastructure in classical and in sister, even the number of uh debugging uh uh message. The number of uh verbosity levels in in sister there are in in classical we have at least 20 even more in the uh in this sister.

B

There are few yep.

A

So I think, there's a pretty simple solution to that: um debug, ms, equal or debug, whatever equals one gets us error and it'll have a special case for messenger, because just because um equals 10 or 20 would be equal to debug and 30 would be trace. That's the convention. We've typically used in the osd anyway.

A

It's very unusual that we've ever that we create the out messages that aren't at 120 or 30. five sometimes, but in in other words, I I think we can. We can have.

A

We can set thresholds that just map to error, debugger, trace.

C

Yeah, I think I have already defined some lock, which, which lock level will record which information in the messenger. So I think we can start with that.

A

Yeah, the only missing piece, my my only request is that debug ms equals one should be identical to classic specifically the log line that prints, sending and receiving of messages.

A

Since we've trained people to use that one.

B

Yeah make sense.

A

Try to find the pad, so I can take notes.

A

These are also pretty good onboarding exercises for anyone, who's relatively new to the code base, so I'll just put that out. There.

D

Hey sam, this is so I have one question uh for the crimson design. Do we have any consideration to cover the qrc case? I know uh from the early design. We can see the consideration for the vns that uh how about accuracy uh which may be similar but uh still have some difference if compared to the ns.

A

You uh you're saying qlc as in yes more dense than tlc. um My strategy, for that is that it's either fast enough to be treated as a random block manager or slow enough to be treated as a segment manager.

A

So there will, it will be. It will fall into the same bucket as zns, but we can have different back ends so that it speaks a more appropriate um command set, but we'll still write to it in a sequential way. Does that make sense.

D

Yeah, if accuracy is sequential uh right for from good but for rhythm, it's from uh perform very bad yeah.

A

Okay yeah, so the strategy, I think, is that the interface is different, but the right pattern is identical to zns with that. Is that fair to or the desired right pattern I should say: is that fair.

D

Yeah, similar yeah thanks yeah, so some another question. So uh if we uh you know, if some customer want to uh have some try on crimson so as of today, what is our best suggestion?

D

You know if some customers want to do some poc on crimson to understand the I mean the performance difference that how they can can catch so uh months before we ever see some share from uh from mark uh it's mcnielsen from from uh red hat team to share some uh performance benchmark, but with some uh kind of hack to get that result so yeah. So the question is actually.

D

If some customers want to have early, I mean early try on crimson to compare what they can get so as of the latest uh stability and at the latest code base. What is your suggestion.

A

Well, like I said, quincy should support testing rpd on crimson. It won't be particularly stable that will improve over the course of this year, but you should. um I would suggest that if you have customers that want to try it, you work out or you do some testing yourself and help us work through whatever bugs show up in that environment and then give them the instructions.

A

Does that make sense in.

B

Other words, you should be able to deploy.

A

It sephadm and rook have the ability to deploy it on quincy, it won't be stable and they should not counter that to keep data.

A

Does that make sense, yeah yeah got it. Thank you. It should work mostly long enough to do benchmarking, probably yeah, um hopefully we'll be more stable as we go through this year, which is a different topic I forgot to bring up. Actually um is everything on that, though? Yeah.

D

Yeah, so uh you know I asked this before we did uh get. You know. Seven messages from the uh the customer we to to to you know, as the industry is moving to the accuracy instead of the tlc, so we did see more uh interest interest from the field to see how crimson can actually uh help the transition from tlc for to the currency. I mean.

A

For what it's worth, I expect c stores um maturity to lag crimson as a whole, so that'll be a little bit more difficult, but again it should work now. It just won't be particularly stable and it'll be changing rapidly this this this year, so any feedback you can get us during that time will be really helpful. um Speaking of feedback during this year there was another topic I meant to bring up um for classic osd testing. The most recent release is usually fine for people, but for crimson we really don't want people testing.

A

The quincy version come come august, so I wanted to float the idea of doing a more frequent release, perhaps once a month or once a week, some kind of an automatic build snapshot of just upstream main.

A

What do people think of that? I propose that we do no testing on this whatsoever and no back ports to them. They're just point in time snapshots that people in the community can download and run.

A

It would give people a way to test crimson more frequently than the yearly uh releases.

D

Yeah, this sounds good sounds good. If we can have some lkg, that's some good uh imagery and with some instructions. So we can have more. You know more interactions with the customers who are willing to to to move from uh bluestar to the you know. To the uh I mean to the crimson yeah stack.

A

Yeah, that would be good. You could also help us write up instructions and fix bugs too that's another place. New contributors could could help um anyway. So that's something I'm going to try to push for with the wider community. I do not think that we are the only I don't know greg and joshua here. I don't know if you guys have an opinion, but I I feel like we're, not the only component that would benefit from this.

A

It might also serve as a forcing function to improve our release pipeline.

D

Yeah and some as you just shared so for uh for reef yeah, for this release, we can now our targeting is actually to get the rbd walk with crimson right.

A

Well, that should work with quincy, but it doesn't have snapshots. So we are going to continue refining and improving rbd and probably picking up support for rgw and ceffs in some capacity.

A

Yeah got it thank you, but for what it's worth rbd has been our primary target because, from a cpu per io point of view, rbd is the most demanding user for the most part. So that's been our that's been our focus.

A

All right uh any other, any other comments about this part. If not I'm inclined to hand it off to yangson.

A

Did you want to talk about your gc strategy.

C

I pasted a link here, so the general background is that for.

C

For c star to store the extent to the disk, I think if I can share the screen so for every extent written to a disk, uh it has uh the extent has character. That means it can be hot. That means it. It is a root extent or a logical, lba extent or unknown extent.

C

They are metadata, so they tend to be hot and they can be code like the object, data block or omap leave and every extent written to disk has has an age that it it is, can be calculated by the current time minus the last modification time so so that every extent has has character and k and h so uh from the lfss design.

C

So it might be better to group the extent of the same character and the similar ages together because they tend or we assume they they will have a similar proper probability to become that and uh from the paper there is a cost benefit garbage collection policy that will be more effective if we can achieve that goal to play similar extents into the same segment for reclaim.

C

So that comes to this design dock. So the goal is to make sure that the extent can always store the the segment can always store the extent of the same character and similar edges and there's uh two factors: the the first one.

C

The first fact is that we can already uh split hot encode extents into a different group of segments, so it is just a configuration and we can enable that so that hot and cold extent will be placed into different segments, and the second part is that I think we can further introduce the generation to segments. So that's the the generation number n will mean that the the containing extents in this segment will are reclaimed by n times, so that it means that in a single segment it can store extents with the similar edges.

C

Because that's a similar right, I think it will imply that it has a similar age. So that's that's. uh This is a picture that uh this, for example, that if we have uh three generations and the two uh extend characters, so we can. We will have uh six groups of segments so so so that each group will contain the uh specific character of of either hot or cold or the generation 0 or 102, so that each segment will have similar characters to be rewritten or to to be reclaimed.

C

So is there any questions about this yeah two observations.

A

So generation is essentially a very coarse age right.

C

C

A

Right so my other question is: if we have a generation, two hot segment: isn't that kind of definitionally cold.

C

uh What do you mean by? uh Could you repeat.

A

So a hot segment that makes it to generation two without being rewritten.

A

I guess you could either think of that as a cold segment, or you could observe that the hot pot segments containing hot extents will tend to cycle faster. So I guess hot generations are smaller time slices. I guess that makes sense. Okay is.

B

D

Point at which you demote.

A

As an extent to a cold segment.

C

um I think it will always promote from generation 0 to 2.

A

Sure, but what if it goes from gener from hot generation, 10 to cold generation, 5.

C

Well, the current as the current thinking is that the hot end code are the characters of the type of the extent. So so I it may be more uh better name, the metadata or or data it is a fit. It is a logical character of the extent it's it's not about the may, not make sense to name it as a hot or cold.

A

uh Yeah, I think that's where I'm getting confused. I interpreted this as we in we begin with a heuristic prediction as to whether we expect it to be frequently modified or not, and then we adjust later if we're wrong, but.

B

I guess this is more like.

A

Two different garbage collection strategies, one for metadata, one for data.

A

C

Strategies they just placed in different different segments, different groups, but the the essential policy is the same. It's it's. The cost benefit policy only calculates the age and the benefit.

A

I guess I'm not sure of the advantage of splitting it like this. It seems more like if we predict that metadata segments will be more frequently rewritten um they'll naturally end up grouped nearer generation, zero.

C

Yeah, that's true so so for metadata, it might be less have less probability to become generation, one and generation two before it is uh that so I.

A

Guess I'm saying that any extent that does make it to generation two, I think we're wrong about. It should have been designated cold. That's what I'm worried.

C

About maybe not true, because there can be possibility that some portion of the o node will not be accessed.

A

Yes, we think that's actually going to be like.

B

A

You have multiple pools on the same osd then pools with pgs or pgs belonging to pools that aren't frequently written will have those portions of the uno tree really will be cold like forever on average? Yes, so.

C

Even though they're metadata.

A

They'll be low, um they'll be infrequently mutated.

C

A

C

Guess I'm suggesting that we want that.

A

Well, I'm suggesting that we may. We may want to treat this as a heuristic initial guess, but not treat them as separately. Otherwise, for instance, we could write cold like if, if, if our heuristic guesses, that an extent is cold, we could just write it directly into generation. Three.

C

Yeah, that's that's true. I think we can directly write code to uh logically.

A

So we won't. If we do, that, we don't actually need to separate the generations for hot and cold. They become the same thing.

A

Metadata segments will naturally group towards the lower number generations and data segments that don't get frequently written will just fall off the end.

A

In other words, I'm not sure the division is useful.

C

I think the the current problem is that when we rewrite the extents we mix, we tend to mix all the uh different segments with different age together. So if we only write uh extends of generation, one into generation two, so the generation two will have similar age. That's that's! The goal is to uh isolated the different age, the extent of different ages.

A

The goal is actually to group extents by when they're likely to be mutated in the future. Age is just a good heuristic for that.

C

Yes, so it's a guess, but I think do you think the assumption that the old extent will be unlikely to be.

A

I guess I'm saying.

C

B

A

Could be more compactly represented by simply starting cold extents at a later generation.

A

It eliminates this. This um strict division between extent types which gives the gc more freedom, I'm not sure, but I wanted to put that out as a suggestion. Basically, I'm.

C

Skeptical about this hard.

A

C

I think that could be flexible once we have this in place, so we can try different uh groups.

A

That's actually true yeah. This.

B

Could be something that gets.

A

Written into a plugable um garbage collection strategy- that's true.

C

Yeah, it's just one form of the groups.

C

So the general idea is to introduce the generation concept into a segment.

A

um Do you think this would allow you to do away with tracking lifetimes explicitly.

C

Oh sorry, I'm not.

A

So we spend quite a bit of metadata overhead now tracking last modified times like second granularity or something this would allow us to do away with that entirely right.

C

Yeah, I think that will remove some requirement to track individual modified time.

A

And we wouldn't.

C

Need to record them that way: okay,.

A

Yeah I like that. I like that change.

C

So any other questions.

A

Joyhon, what do you think about trading the very granular you time we're currently using for modification time for this more heuristic generation concept,.

C

Yeah, I think the the generation better version is better.

A

Cool we may want to retain the time stamps for a little while, um actually, I think I think we should retain at least a compile-time flag to enable them, because after we've run a benchmark, it would be nice to be able to dump a a histogram of the ages of the extents in all of the generations. That will be one way we can evaluate the behavior of this algorithm.

A

I don't think it's something we'll want to run in production, but it'll be a useful piece of um piece of information.

C

Okay, I think that makes sense to me.

C

So the second, the further idea is that forge. If, if we have the generations, then we can have uh multiple tiers of devices.

C

The concept of tears is that each tier will have uh have a group of one of more devices of the same type, so it will have this same backend and the implementations and the higher tier generally will have larger disk size, a lower cost and but lower performance.

C

So uh if we, if so the generation will help us to uh to to place different data into different tiers, because the the the more number the larger number of generation also means that the the the data it will become more code and it it doesn't require uh frequent accesses, so it can be placed into higher tiers and because each tier have different characters of access patterns.

C

So we can use different uh types of devices, so we can put tier zero to have more performance and and the more expensive devices for this implementation and puts more uh and configure uh larger and.

C

Less performance devices into second tier and we can have different implementations, internal implementations between different tiers and the interface will will be as simple as just allow the right transactions to uh to move data from one tier to another tier. So I think that's the general idea.

A

Yeah into the question earlier that would be zns hdd, slash, qlc that'll be qlc devices as well in that cleaning policy too great. This looks great, and this this makes uh all kinds of sense.

C

Okay, so so that's all I think we can. We want further discussions. We can start a thread.

A

Yeah, I think this is absolutely or this is um generally speaking, the way to go. I think this. This looks great okay,.