Ceph Developer Monthly, 3 Nov 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Developer Monthly 2021-11-03

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, I've added the agenda in the chat and I'm guessing it's a holiday in a lot of places. So we won't get too many people, but uh I guess we can kick it off um hello and welcome everyone to um cdm for the month of november, and we have three topics on the agenda and we already have I'm sharing the screen. I guess so. We can just kick it off with the crimson update. That's okay,.

A

I guess so some audio.

B

Yep, sorry about that all right now this will be short just come just uh you guys on some of the more recent crimson developments so just to get right into it. The main theme for crimson as a whole has been stability.

B

uh Radic has been working tirelessly on bug squishing in the crimson radius, toothology suite, so we're up to a limited selection of thrashing failure, injection tests that we've all come to know in love with the toothology, and this has driven a lot of stabilization work, including some changes. The way the watch notify api works in crimson, as well as refinements to op ordering logic and just an absolute slew of other fixes.

B

B

uh Radik also has crimson deployment with rook, not working, so that is major progress and chad may has contributed some improvements and fixes for crimson's blue store integration, which have improved uh performance a bit.

B

So the next step for crimson as a whole will be getting the scrub work moved into krypson itself. The big original chunk of work was refactoring, the classic osd state machine kind of like what we did with the peering machine so that we can reuse the bulk of the code in crimson.

B

So the outstanding bit of work is actually doing the last step of integrating into crimson, so ronan's working on that when he's got cycles and the other big chunk of work is multi-core, which greg is going to talk a bit about after this.

B

So for cstor, a lot of the recent work from intel has been adding metrics, so um ingsan and john may have helpfully added a just, an absolute ton of counters and histograms for counting things like transaction conflict rate allocation information.

B

All this stuff, it's prov, it's proved uh instrumental in fighting some uh performance issues that turned out to be because the uh allocation hints weren't working for omap, so everyone was getting allocated in the same or starting at address zero, which made for a linear search for every single address over every single address so far allocated.

B

It was quite quick to track down thanks to the counters. So that's going good um for c store and turtles. There's been a lot of work on bug, fixes and stability. As usual, the lbi allocation pathway got basically rewritten my first snap edit. It got a couple of logic, problems that were really annoying to fix, so it's cleaner. Now it's got an iterator-based interface. That makes it easier to reason about.

B

um We've implemented lba hunting throughout all major components, so that in general, when we allocate a new address with an lba space, it'll tend to be like our guess. At a free address will tend to be correct, so we'll tend to be able to do the allocation immediately without having to skip over allocated addresses, um which dramatically reduces conflict rate, since it spreads um update transactions over lba space and groups them by pg.

B

So parallel transactions for multiple pgs will tend not to hit the same portion of lpa space, which will in turn make the bus light that it could flex.

B

uh Let's see, we recently merged the extent placement manager from schwehn, which was an initial step in supporting tiering within c-store. This moves before allocating extents to a new, to this extent, place manager component, which knows about things other than the journal so immediately. This means that we can do allocations to segments that aren't the journal segment um and joyhead is using that capability to implement an age-based bidding scheme from sprite lfs a bit like what um I'm forgetting the name.

B

The fast flash can't remember um a bit like the that flash file system, I'm feeling the name of the famous one anyway, uh an aging system somewhat like that, but it will also be instrumental in allowing us to write extents to devices that aren't the journal device.

B

So this will allow us to garbage collection from the hot device back to, for instance, a zns device um which means I'll skip to the bottom part. So the other piece of this is multi-device support, which means that the internal addressing in c-store has gained some bits that allow it to designate up to 16 different devices so that a single c store install can have multiple devices, um the first pr for that is merged, so that bit of the data structure exists.

B

There will be follow-on patches, adding actual proper tiering heuristics. Although I expect that will be rather a long journey. um We also reworked the way conflict. Conflict detection works um all internal things, use interruptable future now so, rather than at commit time having to iterate over the read, set and check whether all the extents are still valid. Instead, when we invalidate an extent, we have an intrusive linked list for all transactions currently referencing it, and we immediately mark all those transactions conflicted.

B

So when the interruptible continuation or when the interruptable future resumes it checks that condition and immediately bails, which saves a lot of wasted work.

B

It's also a little more efficient since we don't actually have to iterate over the read set in in the success case and lose a lot of the work to the invalidation case, and even that is proportional to the number of other transactions accessing the saving extent. So more efficient.

B

So next steps here are: we need the ability to remap the location of logical extents, because we need to be able to do that to clone object data.

B

This is going to be a substantial bit of work and it will likely need a parallel structure to the lba tree structure, representing a reverse mapping from physical offsets back to logical offsets. I haven't quite worked out if I want to do that yet, but that will be okay, a thing can.

C

I ask a really quick question about the conflicts. Absolutely um this is, I think you've explained this to me before, but I've forgotten. Can you remind us why the conflict detection is necessary?

C

Does it know so that isn't usually an.

B

Issue in the limiting case, you could have a different metadata tree for every pg and in that case there would never be any conflicts, but because of splitting and merging within the node tree, it's possible for parent nodes within that tree to be accessed by parallel transactions from multiple pgs.

B

Does that make make sense? There are other ways to do it. We could add a ski.

C

Or something at the end, like.

B

C

Wouldn't I guess, there's no locking, I guess on that.

B

There's no locking at all right now. That's that's the win. The win is that there's no pessimistic logging, okay, so in theory this permits concurrency in cases where otherwise we wouldn't be able to permit it. If we had to lock on the way down, we would have to be really careful. We would have to do like a one of those schemes where you like, read lock on the way down figure out what you need to do. Then we take right locks back down to the possible but complicated.

B

We can fall back to that scheme if we need to and likely we will. um But the fact that we allow retries in the first place means that we can use a locking scheme with um conflict detection back off, in other words, a scheme where we we permit the possibility of a deadlock, but do deadlock detection instead of needing to do double black avoidance.

B

So, even even if we do end up doing that, this mechanism will still be useful since the hard part wasn't so much the transaction conflict detection part it's the retry part.

C

Are independent, there's no read site, I guess in these transactions they're, um because there's already isolation is that the idea.

B

So you can these transactions do have a read side, so we track the extents we read and that's important, because there are a lot of edge cases in split and merge where you modify a child adjacent to another child modified where you've like deleted it.

B

um Maybe I'm just understanding the question: yeah. Okay, in the event of a read, we don't, strictly speaking, need to remember the read set and we don't, strictly speaking, need to validate it. Since most reads are point reads, but if you did want to do a transactional, read of multiple points in the tree and get a consistent answer, this does support that as well.

B

B

um So far, I've been trying to sort of get as much as I can with like one really simple mechanism. We can add faster, degenerate cases later as needed, but fewer code paths means less review.

B

B

Let's see ah this is last page. Okay, so tiering, like I said, is a next piece um I think zhuahan and the folks at uh micron. Micron no samsung, yes, are also working on this sort of from two different angles: joy hands. Looking at it for tearing back to hard disks, um the micron folks are looking at it as a way to do faster, true, random access storage as a sort of a faster tier or is the soul tier depending. I think it's, I think, as a faster tier is going to be the fastest.

B

The vastly more likely use case as dns becomes more common, because, with some of these devices, we really will need a faster tier. So one of these fast random access devices will will be good for that. um That is things with like bite level updates, like octane or in the limiting case, persistent memory.

B

uh Another major piece is enos enablement um joseph is working on this. We've we're actually getting pretty pretty close he's got most of the interfaces implemented, uh mostly, I think he's trying to make sure he understands what all of the relevant I actually do. So getting close, um c-star itself already had the code.

B

uh A samsung engineer already contributed the code to c-star itself to enable the relevant iactyls so passing the I o commands through as possible. It's just a matter of wiring it up to segment manager.

B

uh And, finally, the last piece is uh adding a debugging technology test. Crimson with c-store chun mei's actually picked this up. So she's made a lot of progress.

C

Did um fbdn support beyond here, since we have recommending.

B

It that's the next piece for sort of usability, stuff, yeah.

C

I was just pondering this a little bit earlier. It would be a lot easier for set fadium to manage this if it were the same container image that had a crimson binary in it, as well as the non-crimson binary yeah, I believe, do.

B

That, or not thinking so far has been. It's just different containers and the the one for the container is just called ceph osd, but I agree with you. I think it would be better if there were like a well-known if every package had both we had classic oc, crimson, osd and sefosd, which is whatever the current default is.

C

A

Yeah that seems sensible to me.

A

So, sam, as far as the rook integration goes, I think that is complete right I mean we can start up an ost run. I o everything.

B

Yeah yeah, I think it's got a ton.

A

Awesome so I guess use next usability pieces fadm, and maybe you know this is an opportunity to make. We add, uh you know pathology tests which are directly based on ceph adm as well in the crimson radar suite.

B

Yeah, that would be a really good thing for someone to work on it'll. Be me at some point, but if someone.

A

B

A

And uh I was just uh thinking about one more aspect so in terms of like you know the ost ops piece, so we've got like recovery, backfill figured out, we've got pairing, figured out, it's basically a scrubbing that we've already started working on, and it's like snapshots that you just talked about. So those are basically what we're looking at right.

B

A

Has sort of two pieces c.

B

Store itself needs support for snapshots. That's one part, but crimson osd, the top end, doesn't have any of the logic for handling them. Yet so the I o path for.

A

That that needs to be.

B

So that that will work correctly with blue store once once it exists, and then.

C

B

With c-star, once the c-store site exists, but the c-star part isn't a blocker for getting snapshots implemented.

A

Got it got it and one last question I had was: I know somebody was working on integrating rgw object classes with crimson. um How far did we get with that.

B

uh I kind of remember that it probably just happened.

C

The rbd one's already done right, yep, yeah, okay,.

C

What's that a non-most repetitive, adjusting or refactoring the existing classes, or do you have to like duplicate them and use a slightly different framework or something.

B

No, I believe, what's going on here, is that for the most part you can use the existing classes. It's just that not all of the api methods are implemented in crimson yet right. This is the pr form it doesn't. Look like it's merged, yet.

B

It hasn't been as much of a priority simply because rgw is a less obvious demo for crimson being useful, but it is definitely useful.

A

Okay, cool thanks. I just wanted. I knew there was some work going on. I just didn't know what the status was.

B

Yeah this should get finished. It looks like it's pretty close. Let me add this to my notes.

C

My brain is still stuck on the container thing it seems like that would be simpler if we changed the way that the makefile was structured so that when you just do a normal build, it builds both all the regular stuff and also the crimson stuff in one pass um so that when it gets to the packaging stage it can, I don't know we want to do for packages whether it's going to be a cephos dns foster crimson package, um but it could do it all all at once and then produce a container image at the end, as opposed to having like a different define that you load in the front end and having a totally different job to do it.

C

B

That makes sense yeah.

C

B

Fifthly, I don't remember what the reasons for doing it this way were originally. My instinct would have been to just build all of it in one pass yeah. I agree with you.

B

C

Made it very marginally.

B

Easier to get it to work with rook, because it meant you could just call the exact same binary or by name. Then it would just work.

B

More difficult to handle upgrades too just more complicated in general.

C

It would be also nice if developers when they're just doing a build in their local dev copy would build both the classic and crimson at the same time, instead of having because right now like you, can break the crimson code and not notice until you push it and the fci tells you it's.

D

C

Oh, a lot of extra.

D

Time, I'm I'm not sure we want to just always build it.

B

Well, I mean all does build everything, so I think yeah, it's more that it needs to be one of the more default targets. If you want to build less, you can always build less on purpose.

C

B

Like it's not like, it would build crimson. If you tried to build rtw right.

C

Yeah and when you build crimson it all, I mean it's just the osd right, there's no other target really.

B

Yeah there's some ancillary tools which actually kind of also like to get built, but because they those tend to get broken. When we change crimson code, yeah.

C

But yeah, I wouldn't expect it to be too much.

D

Well, it's because you have to like recompile a bunch of the common and core code like you're, compiling two instances of mon client and stuff okay, so I haven't timed: how much longer it adds on its own, but it wasn't trivial.

B

D

Is that that's the stuff you tend to break by accident.

B

Yeah, that's the part you tend to break by accident.

A

Okay, the the other thing that I just uh I was talking to, I think mark about this, the the perf ci that we had for crimson, it's still functional right or crimson prs. I haven't looked at it recently.

B

I mean mostly, but all it does is run radio's bench. It's not like it'll catch, a gross violation, but that's kind of it.

A

Okay, okay, yeah. The reason I'm talking about. I wanted to revamp that a little bit and just you know not just for crimson, but for most prs, uh we're probably going to discuss that at the performance meeting tomorrow we have a person who has some experience with uh building some performance, uh automation, tools, etc. So maybe we can discuss with them just just fi.

A

So what do we have next greg.

D

Oh sure um I I did not make a reason david sorry, so uh I did not I've not gotten as far as I'd hoped to when we scheduled this, because some things came up so instead of having some like big progress, it's mostly just a pile of questions and some goals, um but we're trying I'm trying to work out how we want to do um multiple or multi-core crimson, um and, let's see so sort of the obvious thing is that we have a messenger um and we can bring in messages and we can send those messages to pgs which can be entering cores and the pgs have to do.

D

Transactions which can be on different cores and the messenger can be can be on can be multi-core if we have a way of directing them appropriately.

D

So talk about this with sam and you know at some point it would be very nice if we could say just telegrams and that we want to use six cores or, however many cores and um the clients direct messages to the right core. That's just listening on different ports or something and those go to the pg's that live in their cores and those go to um object, store state that lives in their cores.

D

um But that requires you know, clients that know what's going on and changing the site and and changing the osd map, structures and inflating them somehow and we're always going to have old clients. Presumably who don't know that, so we can't rely on that.

D

So the first thing I want to do is get it so that we just can have a front end messenger and it can parcel out to um to pgs that live on different cores and then those pgs can funnel to the appropriate object store, which um initially will just again be back on one core.

D

So it probably won't actually be much faster for me, going with police demonstrate that the cross-court communication functionality works, um and I mean these are the same sort of crossbars that we have in the in the classic osd.

D

But they look a little different when you, when you, when you submit a client request in crimson, then it actually sort of the operation starts itself on the osd. So I'm exploring how we can try to handle the memory allocation better um and if it's possible, that we can only decrypt or decode the operation enough to know which chord needs to go to and then have that chord decode it into its local memory. So we don't have as much as much split memory communication happening.

D

Luckily, it looks like that in terms of allocation isn't going to be a problem that was one of my concerns was just that if we have operations crossing cores all the time, then it'll fragment the the allocator pretty badly. But cylidvy does not seem to be concerned about that. They just generate function, pointers or sorry. They generate foreign pointers for literally every message that comes in before they direct it down.

D

So that seems like it'll be okay, um but then you know we started the pg's across fours and I mean you know one once you're within a pg. Then that's logically pretty simple. It's just about having the right interfaces.

D

um I don't really know if there are like questions of people or things that people are concerned about, but.

C

Is are you looking at the on the c-store side too, or is the idea that it's once it gets to c-store, it's all back on one core for sure.

D

I have discussed it with sam, but that's not anything, I'm going to be working on at least not in the near future. I think he he had many ideas which we've talked over. How we want to do that to to charts you store.

B

um Yeah in the extremely short term, it can be a static sharding, where we just run nc star instances with some kind of a really really coarse way of allocating big old chunks of discs between them.

B

um It gets more complicated when you want to be able to support, split and merge and dynamic assignment of pgs, which we do. um I have some thoughts about that, but it's pretty complicated. Basically, all of the data structures need to be need to be ordered in a way that allow us to grab a subtree corresponding to a pg and stick it on another core, which has a bunch of knock-on properties.

B

I don't know that's tpd.

C

Do we have any like sense of how many cores we'll need to sort of service a single device.

C

No, like, I would imagine, bigger zns devices without any like couple high performance. Whatever, then one core is enough, but.

B

I'm not sure that that's the interesting metric actually so I have this sort of model for flash devices in my head, which is that they have, however, much parallelism they have and they take up. However, it's bandwidth they take up so in an ideal world, if it takes four cores to saturated device- and you have 32 of them, you just have eight devices um and whether you run eight osds or a smaller number of osds will depend on what the most efficient way to set up c store is.

B

So I don't know that we actually want to be making assumptions like that just yet. I think the assumption needs to be that c store will partition up devices, however, it doesn't and we need a way to move stuff between devices or between cores.

B

Better, not a good answer, I know, but it seems like the only design.

C

Well, I mean, I guess there would be an easy out if we thought that the fastest device could be serviced by one core, but right. I assume.

B

C

B

Yes, this is what I'm.

D

Saying I don't know numbers, but we know that. That's not true.

B

I think c-store will end up being a significant percentage of the overall cpu usage in retiring, the right to disk so yeah. I don't, I don't think so.

B

I think as like, we can seriously change the ratio, which is the entire point of crimson, of course, but devices are so so so much faster than plasticos even currently service.

B

But I don't. I don't think betting on a single core is a good idea.

B

But I think we can get a lot of mileage out of the static assignment thing and, like some problems are just actually kind of hard, so that will take some thought. Yeah.

A

C

B

Static assignment thing.

C

It seems like the the easiest part is that if each writer has its own zones or whatever that is writing to and they're all independent. But they need to be able to read and hold references to things that are previously written by other zones.

C

And so if, if that framework can be.

B

The problem is the metadata and allocation structures, as well as garbage collection. You ideally want each of these cores to be able to do local garbage collection so that that's the hard part like the the ios themselves aren't really a problem like whatever, but accessing an o node tree, that's held in common by all cores, that's much harder, because it means that it's being mutated by multiple concurrent journal segments, which means ordering is complicated.

C

Yeah, although they would be using different regions right, like they're constrained to their yes precisely.

B

C

B

That's the sort of distinct c store bottle where each quarry essentially gets its own little c store and really big gross divisions of disk. And then, if we do need to like move a pg between cores or something. In the meantime, we can just route the relevant ios back to the correct core yeah.

C

Yeah or like do a force, a split on some random, whatever pages so that the pg boundaries are like the edge of the pg. Doesn't.

B

Matter, yeah, so that's that's the real solution, that's what we need to do to solve it for real, it's not that hard, but it means that all the all of those data structures need to behave like that so yeah. But I think that's that is indeed.

C

The end point so that you can.

B

You you can extract arbitrary subtrees and paste them into a different cores. Three. The only hard part is that some of the data will exist in segments that are ostensibly owned by another core, but that's not a huge deal. You just track which which segments ownership, yeah yeah so from a hand, waving level. I think it's very solvable. It's just work in the meantime, we can do that static thing.

D

Yeah, I think the the main reason that's interesting is when we start worrying about things like rtw, where we might need to make sure that you know all the index bucket pg or the bucket index pg's are on different cores or something.

D

But while we're focused on rbd, we can basically assume every pd is going to have equal heat.

B

I mean even that's very resolvable once you have the ability to move or have pg.

D

B

Information period you just like look at the pool and go oh. This is apparently a performance goal, I'll put it on a performance core.

B

C

B

Does this seem like a like an insane way of doing c store? I feel pretty unbored from.

B

Other design strategies at this point.

C

It I mean makes sense to me: that's good, I guess the the part that makes me nervous is um what the like interim static mapping thing is just making sure that that's gonna like be in the right direction, not be a cul-de-sac.

B

Oh, I mean that was gonna, actually just instantiate like when you make a fast the osd, you tell it, you want five cores, it creates five yeah these doors and it like writes down on the first one which pg gets mapped to which one yeah that's. As far as I was going to take that one, the allocator part where you need to be able to allocate big chunks of the dose different ones. That's the that's the step, zero of supporting it for real. So I don't think that would be wasted work either.

B

A

Okay, um greg did you have anything else or.

D

um Not really, we want to move on to telemetry with the five of us.

A

Yeah yeah, you just here so hand it over to her you're it over to you.

E

Sure, hey everyone um so yeah. The first topic um is about the re-opting flow that we're trying to solve an issue with the current design that right now, when users opt into telemetry, they basically opt into a version of telemetry, which is currently three and, of course they need to explicitly agree to send the data.

E

And if we want to introduce collections of new data, um we need to bump that version, which means that the user needs to re-opt-in, and it means that if they don't, they stop sharing the reports with us, which is not ideal.

E

So we had all sorts of ideas of how to handle it, but eventually the design that made the most sense to me was to have like a collection or a metrics uh data structure on the module which um will be written to the database once the user opts in.

E

And this way we know to to have the div pretty easily between what exists currently in the module and what exists in the database that the user gave their permission to to send those metrics still works in parallel with the channels design that we have that we only send information.

E

uh Only if the uh channel is turned on so this still is still the same, um and this new design basically allows us to to backport changes easily and to introduce new collections to existing channels already, and we can show them a preview for a report that the user is already opted into, and a preview of the next version, like with a telemetry, show and telemetry show next, um because we will have whatever is already opted into in the database and whatever is existing in the module um so yeah.

E

We just uh we discussed that on the previous cdm and since that in our tom tree huddles. So just wanted to give an update on that.

E

um So does anyone have questions about this specific topic.

A

Yeah, I guess the idea is, I think we all agree is a good idea, given that we want to add more information to um channels that people already opted in. So I guess for for a user, so you said that the current telemetry version is three. um So what what happens when you know uh quincy comes out? Will uh they will continue to use three and three will have some extra data they'll be sending and do they have a step where they they like? They seamlessly do this right after upgrading.

E

Yes, so so there's one step after upgrading that um if the user is opted in already to three, um we will have this as the legacy uh collection um and we will continue sending those metrics and in case they re-opt in, we will uh sync with whatever will be available in the module at that time, which um the the div right now is the perf channel.

E

So so we will just add that in case they reopt in if the user is up is opted in to um version two uh will not update in their database. um Those metrics because their regardless needs to re-opt into three. So so the other side that that's a design, yeah. Okay,.

A

That's the re-opt-in step is what I wanted to okay got. It got it.

E

Yes, it's I tried to implement um the previous idea and uh it was too cumbersome. So um I feel this is a lot um elegant. So absolutely.

E

Cool and um okay, so I guess we can move to the next uh topic unless anybody.

A

Has any questions.

E

Yeah yeah sorry.

A

Okay, I think you can go ahead right.

E

Yeah, okay, um so I'm glad that ray is on the call, because um you had some concerns um about ideas for uh collections uh that we had, um and I just wanted to say that I updated the cdm dock with um an ether pad that laura prepared for today. She compiled a lot of notes that we had from all sorts of ether pads and google docs and etc, which is awesome. So thanks laura.

E

um So I I was curious greg. If, um if you can tell what exactly are your concerns about uh um collections um specifically, that was, um we mentioned the osd epochs um and in the next performance uh in in the next uh version that we're going to have for quincy, uh we will be collecting a lot of um performance metrics.

E

um I can share this ether pad. It has an example for telemetry report.

E

um It's very simple, it's from a vstart cluster, but you can get an idea for all sorts of metrics that will be collected for the perth channel and it will be also um per pg counters.

E

And that's because uh when we talked with uh mark and adam, they said that it made more sense to have a better resolutions of of these metrics.

E

So and of course, the idea here is to have these um um changes run on the lrc and um to review those metrics and see if it makes sense uh to collect uh metrics separately and aggregated at the same time, but I'm very concerned with concerns um about collections. So um if you can share your thoughts, greg that'd be great. Please.

D

Oh, I I remember mentioning that that sounded more intrusive than we currently collect, but I don't remember if I had any specific things. Part of it is just that the more we collect the more careful we need to justify people and make sure that it's not, but it can't be dna man. How do you say the word you can't make it unanimous.

D

And there's a little one here that I remember just having, but mostly what we've mostly collected so far is about the like. You know the versions of things and and the parameters of of stuff that we mostly said by default right. There's like like sort of the the thing most likely to be said here is okay.

D

So we like to talk about the number of red pools and things, I guess, is what we collect right now, um whereas once we start collecting like the epic generation or like the generation of ethics, then we're getting into into knowing how quickly a person's cluster changes and and how healthy it probably is.

D

And there are all kinds of reasons that that is like incredibly useful information for us, as upstream developers to have right like if we know how like we can tell what state people's clusters are usually in and what sort of situations we should be targeting for. um But it also means that it's a lot. It's probably like.

D

If anyone ever figures out who's who owns a particular cluster, then they have a lot more information about what that person's thing deals like and if we're like collecting performance statistics on a per pg state, then like anyone who runs a proprietary rtw cluster is definitely not going to not going to share that data with us, um because they're not going to want to want people to be able to see how they're like or like what they're supporting for their external users or their customers or whatever.

D

And I mean maybe that's okay, maybe we want. I don't want that data and so reducing the number of clusters we get is fine.

D

If, if we get that data from some clusters, it's just it's I've not thought a lot hard hard about this much but like I just wanted to make sure that someone was thinking hard about it.

A

I guess there are telemetry huddles happening every week, because you're thinking very hard about this.

A

And how healthy your cluster is, I don't see why people wouldn't want to share that information. I mean like that's something that we want to improve right and then we care about how healthy cluster is. There was another thing that came about.

D

I mean yeah, we care about it, but people might not want to share it because it exposes how their operation works.

D

A

That's that's why we have something called opt-in right. I mean.

D

But that also means that we're going to lose all of the data from those clusters.

A

D

No! No! No! No! No! No! No! No! No! No! That's not! That's!.

B

The idea of channels right there's the yeah.

A

B

A

I think exactly.

C

I have I have kind of a similar concern, but it's a little bit it's it's not specific. I don't think that there's anything in particular or wrong with like any specific field.

C

um What worries me is that if I, when you look at the like the perf dump, it's it's like, it looks like there's just a pg dump included in here. It's like everything in the kitchen sink is sort of dumped in just because it's like it's convenient um and I think we should. I think we should flip it around a little bit and so that, instead of saying like here's all this data, is there anything that we don't want. Instead, we should think of it in the opposite way and say like what is it specifically?

C

We want to learn and what is the um the most the clearest way to include that information, so, for example, if we care about how healthy the cluster is, um don't say that that's why we want the ethics and we can refer from that, but instead have a section. That's like explicitly describes the information that we want to know about how healthy the cluster is.

C

So, for example, have a histogram of the pg states or something like that right or if we want to know about the osd logs, then we could have a section for that, like I wonder, if almost we should, um we should flip it around sorry hold on.

C

We should have it so that there's like a um for every given like section or field you should be able to like write a sentence or a paragraph that says like what this is and why you include it and why it shouldn't worry you that it's included something like that. You know what I mean um or why it should, but why we'd really appreciate it? If you would, let us have it anyway,.

C

Yeah I mean, I feel, like it'd, be easy to go back and write that for all the existing stuff, that's in monetary right, but if you, if you then go to like um or like even like the oc per histograms like this- is a section that was like critically constructed for the purpose here, and you could write something that says here. We want to see what kind of latencies users are experiencing and how big their ios are. You can like write that and say like you're like great, that's it that's.

C

I can understand why developers would want to know that, and I can see that this information is what that is right. But then, if you go further down in this dump, you see the pg dumps and you're like, or maybe the the perp counters are a little bit tricky because you have to like think about what they actually are measuring, um so that one might be a little bit harder to justify.

C

But if you go down further, you just see like the pg dump, and this is like every field in these structures- and you know most of this is like individually. It's not useful, but also there's it's like their time stamps, with like nanosecond precision like there's. No reason why we need to know that right, and so it should be.

C

It should be the other way around where we're like. We want to know how big the pg's are, or we want to know how big the logs are, and so you have like a sort of a summary that, like picks, that particular thing out.

A

I think that that's that's where we are getting out with uh trying to like this code is not going to go out without us, trying it out and then creating the summary that we really want. That's the whole point of installing this in the lrc, so this is this is just the raw dump. Okay, we also don't know what we want yet, to be very honest, so first we look at the realistic data. We figure out what that good summary looks like and start with, like some subset of it.

A

um Yeah you'd correct me if I'm wrong, but I think this is what we agreed to like you know this is you know the idea of like first, let's just throw everything and see you know which of the we actually can make sense of, um and then we as developers decide like. Okay, this is a summary that we want and then we roll it out to users. So I guess that and we and we need to be very cautious about you know.

A

um As I keep saying, we don't want to collect anything that we're not going to use um or like we cannot make sense of later. So we need to answer that question right away before we even ship it that okay, we are going to use this, and in this this this ways um or in one particular way. What can this um help us answer or infer? So I guess those questions will be answered, but as going back to yuri's concern about um you know security and like what a safe, not safe.

A

I think when, if we do this due diligence of looking at each and every field, uh we are being very, very cautious. In my opinion,.

E

Yes, that that's that's the purpose um behind our best efforts, with a run on the lrc um to see what we can infer from everything collected um and, of course, um we cannot foresee the future and think of metrics. That will eventually become useful, but maybe maybe it's better to er on on the safe side and um collect less and then.

A

Yeah yeah and, to be honest, there's no perfect answer for this. It has to be like incremental. We learn from our own mistakes and we learn from what we did not collect. We need to collect next time, so it's like an iteration in my mind. So as long as we're starting with something and when we're doing due diligence as to what we want to get at the first go, I think we should be good.

C

Yeah, I agree: yeah.

E

That's good and uh yeah um greg I'll be very happy if that, uh eventually, you you'll review those metrics as well. If, uh if you feel like it.

D

Just tag me on the vr.

C

One other thought: I think that um maybe another way to look at this is from the perspective of the operator, who is trying to decide whether to turn this on, and so they go look at the telemetry dump.

C

If they're looking at this dump, that's just all these fields and they don't know what any of it means or they have like huge grids of numbers, and they don't know what it means.

C

Then it's like very easy to imagine that there's, like information being hidden in there, either deliberately or accidentally, um whereas if it's, if it's structured in a way that, like picks out exactly the the high level, meaning that we're trying to derive from it and the fields are sort of named appropriately, and it would be awesome if it could actually be like commented. You can't really comment jason, I guess, but um then it would be.

C

It would be easier to review and then people would be able to actually turn on so far they're just like they're. Just they just don't know how to audit it right. So.

B

C

Publish a schema.

B

Something with an explanation of what the fields mean: yeah.

E

Would you would you do that on, like on what level like um on the collection level or like um on the very basic uh key value level.

B

I don't know, I guess it depends how much work it is, but I would yeah well I mean just thinking off the top of my head. We could make some documents in doc slash telemetry with just like what like an entry for each channel with like a json description of what would be in it, and then we can update that as we update the versions it actually has a side effect of.

B

If someone opts into version three but not version four, we could just not populate the fields that are different, so we could keep the schemas themselves versioned.

C

E

Sorry, I I didn't get the last part of the sentence, then.

B

It just occurred to me that, depending on how much work this is, you could do something like if the code internally, when it's projecting the information that it's collecting onto the schema, that it's going to emit to the telemetry platform, it could keep versioned conversions of those of that projection. So if someone opts into version three, if it hasn't updated to version four, we could continue sending the version three version.

E

Yeah, yes, we will. We will send the version three and basically, after version three, we will not have version four. It's like a virtual version for um it's like um whatever would we will add to the um like that collection uh data structure on on the module um will be that version, um so so yeah, but but then, but then, of course we will add uh the explanation for these metrics as well.

E

Yeah, there's like um like um a description field for each of these, um but that it's only on a high level um for each collection.

E

Sorry, sage, what what did you want to add.

C

I was thinking that um it would be, it would be cool if the sort of the the prose that like describes what we're collecting and why could be embedded in the code itself as a comment and annotate it in such a way that it's it specifies where, in the eventual json structure, it shows up so that later, when you generate the json, you could generate an annotated version of that. That then includes those same comments, there's probably a lot of like wiring together to make that.

C

But I could imagine somebody like they do this initial review. They're like okay, I'm comfortable with this, and then they like um with each release. They can look at a diff of manager telemetry and they can see what what changed and they can. That could be part of their like re-opt-in process.

E

Yes, yes, so I I did, uh I think of that on the like, on the collection level.

E

So, for example, if we now uh add the perf channel, um so we we can add, like the initial uh perf collection, which would be um right now, it's like um certain, um like multiple keys on the um on the root of the report like um stats, rpg and ost, histograms, and so on, um so that that basically is like the initial uh collection for for the perf channel and if the user opts into that, so we have a description for like a general description for that.

E

But that's not uh a great resolution for uh like these are not um it's not very detailed. um So we can we we we can break them down. We can break them down to to each one of these collections and have like a description for each one of them um and then on.

E

When we show the diff, we can show the like the descriptions um for the the different differences between the versions um and if you want to show the the actual uh metrics we we can print the actual report, but but you're saying that it should be a report which is annotated. If I understand correctly.

C

I mean it's a pipe dream, it would be nice.

E

E

C

I mean it would be it would. It would be. I guess the best case would be that the the raw jason was structured in such a way that you could intuitively see by the naming of the elements like what it is, um but I haven't. I have a feeling: that's like that's asking a lot. um It'd be nice if jason supported comments, but jason does not, um although maybe there could be like a variation of it.

C

That could like add comments but could invent stuff in there um and then separately it'd be nice if the somewhere somewhere, we should actually like try to write down. Why we're collecting each thing and give sort of some justification. Maybe the source code is the most reasonable place to put that.

C

um Maybe you're not yeah, I'm not really sure, but in any case it would be nice to have if we think about the person who's trying to decide whether to opt in or to be up in, like think about what would make their job really easy, I guess yep whatever you can do there.

E

Yes, that was my concern as well that uh people could be cynical when they see all the huge amount of um like perf counters, that we're collecting and might be uh too paranoid to share it. um Even though it's all anonymized and uh we're doing it to improve the quality of the product. um But.

E

Yep. Okay, so so we need to find a an elegant solution to describe the metrics that we are collecting either in the source code or in the docs or both.

C

E

Yeah yeah uh yeah, I'm thinking.

A

Yeah yeah, the more I think about this. I think the sooner we try to install the first, the thing on the lrc, the more answers we'll start to get and better direction. I I feel so, let's do that. You know maybe as soon as possible yep. I I already saw that uh laura has a backboard uh ready, so we can figure out whenever she's ready to you know, go ahead with the with the install.

E

Yeah, I think maybe there are a couple of more um uh pr's, um yeah and they'll they'll, be um maybe um one or two um about um uh probably rocks to be counters um but uh yeah. I I guess I guess we can start with something um absolutely yeah we can do. We can do further upgrades.

A

Right that that's okay, yeah um yeah, but the the the the only I mean like uh just talking about the specifics. So we we will not be shipping. The code in a pacific point release that we are clear about right. We just backboarding it installing it in lrc and then rolling it. You know or likely just letting it be in the lrc or something like that. Just for experimentation purposes.

E

That's a good question: um initially we said that uh we're not going to backboard the perf channel.

A

Yeah- and I don't see a reason for us to backboard the perf channel uh for any point release and it should go in a major release which is quincy, which is fine, so I think the easiest way to go about. This is just uh pick a time to merge this in or like you know or merge this in pacific install it and then maybe revert it and then install the new version that that can be uh one way to go about it.

C

A

We can talk about those specifics. Those are uh details.

E

Yes, one of the ideas to um to switch to to the new design of the uh re-opting in um was that um we wanted to to make sure that we can backboard collection of new metrics as well, because otherwise, we'll have outliers in the data. For example, if we now, we collect a new metric.

E

Just in quincy- and we want to compare it to other um versions, it will not be there because we we never backboarded it um so so it's either as long. We as we remember that it doesn't exist in older clusters, so there are no outliers in that sense or or we decide to eventually backfort new metrics. But of course we can decide on that later.

A

A

Okay, anything else on this topic, or should we point to the last topic.

E

um So laura had some questions um about collections of new metrics. um Some of we talked about already, but I don't know how long the other topic can take and it's late. So maybe you can come back whatever works.

A

Or yeah yeah, if it's, if it's you know it's fine. If it takes few minutes, we can spend on the telemetry topic and then I don't think the last stop was gonna. Take too much time. um So you can go ahead. If you want.

E

Yeah, so one question was about how we um detect availability um in the cluster. There's that trello card that is created and um is that.

E

And she had a couple of questions um so first, um how do we aggregate such metrics uh over time and um whether we need row values or um can we just report uh the information at a higher level.

E

But again yeah! If if this is not super urgent, we can always discuss this.

A

Yeah, I think this goes back to stage's earlier point about, like you know, not capturing every pg's state, but just just seeing what the distribution is, and then the next level is what the distribution is over time. um So, if we create a summary of of um total number of pgs, how many pgs are active plane over what amount of time that will give us a rough idea about what the availability of the entire cluster is.

A

So we I think we need to go into specifics into like you know how that aggregation or how we want to do it or how much time etc. That logic obviously doesn't exist, but I think that logic is also useful, not just for telemetry, but maybe even having it in like a cluster report or something like that. um At the moment, we don't have a good picture of what the availability has been over time.

A

We just have like cluster logs, which tell you what has happened only us over a specific amount of time, so I think that's a common topic. In my opinion, it's not just a telemetry topic. If you have something, telemetry can definitely use it. Okay,.

E

Okay, um so that's yeah, and that would take um some more infrastructure uh work.

A

Yeah and- and- and I just to add to this- I think uh you're already aware that prashanth has been doing some aggregation of even the slow ops so, like historically we've been logging, every slow up that we ever have had in the cluster and, like you know, and other problems associated with that.

A

But what he's been trying to do is summarize the slow ops over a period of time uh so that we don't log everyone, but we have a distribution of what kind of slow ups are there in the cluster and even like you know, in which host which rack, which osd that kind of summary will go into the cluster dock. So this kind of ties up with that, I feel so we can. We can discuss it in a separate uh meeting and see uh you know how we can make progress on this.

E

Cool well sounds good.

E

um Yeah, so I think we we can go on and discuss the other issues um offline. If that's fine.

A

All right, so I think the final topic uh is about notifications about um critical issues. So we had uh the motivation was we've had over the years.

A

We've had these um critical as in data corruption, issues and things that can bring down your entire cluster kind of bugs um that we uh try to let people know, by means of you know, mailing lists or docs, etc, but the there was a good proposal that came in this uh self leadership, call about having a health warning um that could alert users if they are prone to any kind of critical bugs.

A

So there is this tracker issue.

A

Which has some ideas already.

A

Paste it in the chat for everybody.

A

um So one of the ideas was that if running version has a known issue, we have this.

A

We have this kind of a json which has like versions and critical issues, a mapping, um and then a health warning would be generated if, if a user is running, this particular version, that's the general idea, but I saw another comment from ernesto: who was talking about a manager, module called feedback?

A

I'm not entirely sure what that is, but there is a pull request linked there, which you can probably take a look at. I guess the general idea is the same. Anybody else have any idea about this feedback module.

A

Just from his comment, what I'm understanding is basically, um so it's going to be using a red mine and whatever we flag in red mine, it is going to uh raise a health error or help warn or whatever, based on what uh criticality the red line issue has, I guess I mean yeah we have. We have to have like a really severe kind of level for like. Currently, we have what urgent a lot of bugs are urgent, but probably doesn't matter to users.

A

You have to have something uh specific um in redmine if we want to use this model.

B

Yeah, this shouldn't be an automatic thing based on a bug simply having a particular severity step. One yeah, probably the lead of the component. You should probably need to mark it as flaggable, but the rest of it. I like, like we've, had occasional booster problems, where it's very important that you know this is a thing before you try to upgrade yep I'm a little worried about disallowing installation of marked versions, though, because it'll break automation, um the severity of the bug may not be higher than the danger of being unable to introduce a replacement, osd.

C

Yeah, if it could be like a warning or a force or something.

B

I guess yeah, I think logging is correct. I think it it should like loudly warn or go into health ward or something.

C

Yeah this reminds me of that feature. We recently added to cdm where you can do ceph or upgrade ls, and it will list all the available versions in the container registry that you could upgrade to, um and then this which I used, the current implementation, looks in the registry and it looks at the tags and tells you what versions or tags you can use, um but the original idea for it was actually to look at this releases.yaml file, which we already have in the tree.

C

That includes the date that every release was released and also what the eo end of life is for each stable series. It makes me wonder if we should have.

C

Something that almost combines the two so that the cluster could reach out and look at this, this releases.yaml file- and this would this could be where we would mark that, for example, this is a toxic release or it has a particular label or flag on it. That says it has a data, corruption, flag or it has some who knows whatever. Whatever annotations we decide, makes sense, and then, when you from the cluster, if you list available versions, it'll show that information like these are the available versions that you can upgrade to.

C

These are the flags that are on them. That, for example, say don't install this one, it could even say what the end of life is for, although it's not for point release or stable series, maybe that wouldn't make sense, but it could have like when it was released. um I'm not quite sure what other information we put in there, but.

C

But then you could have.

C

Yeah, I don't know yeah, yes,.

C

You would need that you would need that raw data in order to have the a feature where you would raise a health warning. If you have a problematic release right, so we could warn if the release is approaching end of life.

C

For example, um there was a sort of a similar idea that we had a while back where, if you have, um if your cluster has seen certain crashes that are referenced in the tracker and um like linked as fixed in a particular release, then it could tell you that you've experienced a crash that is fixed in a subsequent release, and maybe you should upgrade.

C

Maybe that's a little bit yeah. That seems really hard to jump. If you haven't hit the bug yet doesn't mean that you don't want to avoid it.

C

But I guess, if we're, if we're going to have, if we're going to programmatically label releases as bad, we should decide what those labels are and they have to be put somewhere, and so we might as well put it in a location like this.

C

That also has all the other information metadata about, really that we can gather so there's one central place for it and then, by default the cluster can go, pull it from um whatever from git uh or whatever, and if you're on a private network, then you can mirror it locally or you could have a you could configure some other location to pull it from.

A

If I'm not mistaken, this releases, that yaml is the one which is the raw data for even the tables that we have, for what releases are there if we have some extra fields here that could also reflect in the in the corresponding docks.

C

Yeah and that same or similar information could be presented on the dashboard or via see like.

A

A

Okay, um okay, I think we all agree. It's a good idea. uh We didn't need to just write it down somewhere, like maybe I'll, update the tracker with what we discussed and um get somebody to work on it.

A

Any more thoughts on this you already does this tie into the telemetry story at all, or because a lot of these issues may not have crashes or like one particular crash. So that's one of the problems so that we cannot like collate all the crashes and say that okay, this bug, but in some cases we can so like, uh for example, for the first data corruption bug there were multiple symptoms.

A

One of them was a crash which I at least found in elementary, which was good, uh and there was not only one cluster experiencing a lot of those that was, I guess, the impact wasn't too bad, at least for that crash, but yeah.

E

Yeah, so my thought of the uh on that was that um we have the crash signature on the client side, which is um like the v1 version of the signature and.

E

In order to understand that it's it's the same crash that was reported through telemetry or by anyone and tracker um we it it can, it might be problematic because it might be a totally different signature and in telemetry we on the server side, we just recalculate a v2 uh signature.

E

So uh this way we can group all of those sequence, 6v1 versions.

E

So unless we have the raw data um to calculate the second signature, we might miss that it's actually the same issue, and that's that's a problem that I see here, but in case it's the same um in case, it's the exact same signature, uh the like v1 signature. We will be able to to say yes for sure it's uh it's the same crash.

E

So once we um once we have all the changes for the second signature on the server side, we will update that on the client side as well and within a couple of majors, we should have the same uh signatures both and the client and server side.

A

A

All right, um I think we are done with all the topics that were listed. Anybody have anything else.

A

Should we just call it a night.

A

Have a great night, everyone see you have a good day thanks thanks.

E