Ceph CDS Reef, 22 Apr 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CDS Reef: RADOS

Description

The Ceph Developer Summit for Reef is a series of planning meetings around the next release and some community planning.

Schedule: https://ceph.io/en/news/blog/2022/ceph-developer-summit-reef/

A

Hello and welcome everyone to cds, um raiders for the reef release, um so today we'll be talking about um features and things that we want to do for the next release that is coming up, which is the reef release.

A

There are topics that are already there in the ether pad. We will try to follow the order and just ensure that we are not taking too much time for any one topic so that there is fairness across them um to kick it off. I think we are going to be starting with the telemetry um topics, so I guess yarit and um laura. Do you want to take it off sure thanks niha?

A

Would you share your screen, or would you like me to do it.

B

uh If, if that's fine, yeah.

B

Thanks so, first uh just a quick uh announcement that I'll be giving a walkthrough on our telemetry crashes, probably in a couple of weeks, so I encourage everyone to join. um I put on the ether pad a link to the public dashboards. So I'm I'm not aware. I don't know if everyone is aware of our public dashboards.

B

This is where we present aggregated data from our telemetry. So I encourage everyone to check it out so yeah, let's, uh let's kick it off with a list of subjects that we have is everybody.

A

Able to see the either part that I'm trying to share.

B

Perfect yeah this is uh for uh for the general session, uh so we'll be looking at the the one for the telomere sure.

B

Yeah, so the first uh topic that we want to cover um is new metrics collection. So, right now we collect data in five different channels. We have the basic channels where we collect general information about the deployment we have a device channel where we collect health metrics matrix, mostly smart, smart metrics, and then we have a crash channel where we collect all the crash dumps that happened in the cluster.

B

We have an ident channel, which is um a channel which is totally optional. If users want to identify themselves, we allow doing that and in quincy we added the perf channel, which collects a lot of perf counters of the cluster.

B

Recently we had a discussion about.

B

Data that is needed in order to identify uh rook deployments. So apparently we do not have this data collected yet and we started a discussion offline about it. But um I I just wanted to make um to make sure if everyone, um if anyone has any ideas of what else needed to be collected specifically for rook and generally speaking, for other components as well.

B

So I'll just say that for for the rook part, blaine already said that um we probably need to collect. um They will have to add some flag that um the cluster was deployed uh via rook, and um this way we can fetch this information and collect collected in telemetry and radic also suggested one configuration option that might indicate it is a rook cluster.

B

um This is uh on the ether pad under um the first, uh the first topic. It says ms learn um learn address from peer thanks yeah.

B

So this is not super urgent and I don't want to take too much time from other topics, but uh um just if anyone has any new metrics new data that we're not collecting uh in telemetry uh that you would like to see that is uh being collected. Please uh please let us know um just take into consideration that uh it cannot be anything that is user defined, um we're not collecting any sensitive information uh like host names, full names or anything that can identify the user.

B

So does anyone have questions or ideas.

C

Well, maybe in the long run, maybe it would be worth considering to introduce some kind of extra option that would be set by the deployment machinery to indicate who was responsible for deploying such a cluster, whether it was adm whether it was rook. You know, because now msl engineer from pier is far from from being an implicit I'm from being an explicit solution. It's a very implicit one and if we want to, if we really need such an information, maybe we should, in long term, not now in long term should have something better.

A

Yeah, I think the orchestrator and use will be a useful field right, like orchestrator, we have adm rook or whatever the next orchestrator becomes.

B

Yes, this, we should definitely collect um and yeah. We we can, edit to one of our next point releases for sure yeah.

D

And I think we can also talk to the rook community what else uh what they are exposing from the set point of view right I mean something we can figure out from the upstream. We can uh showcase this that this is what we have in the telemetry and in the community. There is a rook community meeting. We can present there and ask their feedback like if there is any other way. We can, you know, get adoption and publicize it more on the rook community.

B

Yeah, it's a good idea, thanks for catching.

D

We are it, I will talk to them. Maybe one day we can invite you, I can talk to travis and and then you can present what is except elementary, how it works and how it can be configured and all this.

B

Right sure sure this will be great um thanks. I appreciate it um so if no one else has uh ideas, uh we're just really short on time, I think we can move to to the next topic and please, if anyone has any ideas, please feel free to update this ether pad um or just bring us offline um laura. Do you want to go to the next topic.

E

Hi everyone, um so the next topic about telemetry is about um collecting a data availability score and including this uh score in the basic channel.

E

So a while ago, uh sage authored a trello card which is linked under the at the top there about collecting or tracking data availability over time, and essentially we uh this information in the ceph-s command, which provides a snapshot of availability.

E

um For instance, you might see a warning about reduced data availability if certain pgs enter unavailable states or, if they've been peering for a certain amount of time.

E

But what we are interested in doing is collecting this information over time and tracking the overtime uh and there's a telemetry can be used to do that, since uh we collect uh telemetry reports once every 24 hours, so we were thinking that would be a good tool to collect this information, and the newly added uh perf channel performance channel already collects a lot of information, that's available in pg dump, so we have access to the states that certain pgs were in and um when they were last active when they were last peering. Timestamps like that.

E

um So there are several uh questions to answer with this information um question. One would be how often and in what quantity has data been unavailable in the stuff cluster, uh so we were thinking that one way we could uh answer. This question is by looking at the last active timestamp on clusters, pgs and um tracking, the so sending that uh on the telemetry public dashboards and we're. uh We have several ideas of how to do that.

E

uh But that's one one way we would decide data availability and another uh question we're answering is: um how does the frequency of an individual state relate to the cluster as a whole? So, um in the the telemetry uh report we have access to the states that pg's are in and certain states are really indicative of unavailable data such as um anything, that's, not active or uh anything that's incomplete or creating.

E

uh So we would be interested in collecting um data about uh you know daily. What states pgs are in and does this indicate uh unavailable tracking this over time? uh So the main uh concerns we have with uh doing this in telemetry is um collecting. Collecting this and uh creating or calculating an unavailability score would only happen once every 24 hours, because that's the default configuration for telemetry reports they're sent to uh to the dashboards once every 24 hours. So this would really provide only a snapshot of data availability. But maybe that's enough.

E

Maybe that's would still tell us something, and the main thing is also in addition to um tracking this in telemetry. We would also want this data availability score to be um available to developers on their clusters, so we would uh consider putting it maybe adding it to the s command or in some kind of existing command, um but essentially we would want um developers to have access to this score.

E

um That's pretty much it and we're still uh in the in our telemetry huddle, where there are ongoing discussions about how we will calculate this score and what uh data points we need to take into consideration.

E

But um if you have any comments about uh the frequency of which this data should be collected or um if there are any data points that we haven't considered, um you're, welcome to add those comments um on the ether pad or just any questions you might have about what we're. Looking to collect here, that's it for that topic, I'll pause for any any any kind of comments before moving on to the next one, oh and um yuri. If you wanted to add anything to.

B

Yeah, I you covered, uh you covered everything, just uh yeah that nuance about uh whether uh calculating it uh by a daily snapshot would be enough, or do we really need a better resolution for these um data availability score.

E

Okay, if there aren't any questions again, if you remember something that you've wanted to ask and we've moved past, just please write any comments. You have um there's a comment section on the bottom of this ether pattern. You know wherever you end up putting it on the the schedule which, wherever is the last topic. um I'll cover is oh.

F

Sorry, I did have a question so.

E

Are you saying.

F

When saying a snapchat, are you saying that you would like only look at the data every 24 hours to calculate scores? Are you saying like we would look at it every 24 hours and whatever the data ability happened to be? That is all we would have visibility into, because if we're going to miss every 15 minute outage, then yeah. That's. That seems not very useful for what we're looking to understand about about the clusters um and we definitely need more resolution. Then it was down so long that we happened to catch it. um Yeah.

G

F

That's good for an urgency of updates, but yeah this I I I don't know exactly what our options are on the manager or whatever, but like. This probably needs to be a like a push model rather than a polling model like like, like noticing certain events happen and like like taking a notification from certain events happening in the cluster and then being like okay, something was was gone for this long.

H

I know the model needs to change, but I think we talked in previous celebrity meetings about um doing that. More data collection, as I continuous basis within the cluster, and then only reporting that the the kind of value to telemetry and the current daily schedule.

F

Reasonable yeah yeah.

A

Yeah, the other aspect is like you know: the availability thing can be independent of the polling, frequency of the telemetry module itself, so the telemetry module can capture something that maybe the cluster on its own has already um determined.

A

So we don't need to rely on that once you know polling for pg states or something like that.

B

Yes, that's a very good point now that it could be totally um separated from telemetry and telemetry. You can just use um whatever reports um that tool uh can provide so yeah. The user can use that um or orthogonally of telemetry yeah.

E

So we have, you know if, if we're thinking about calculating the score in the telemetry report, we don't have to do it. That way we can just have the um telemetry report uh take whatever scores has already been calculated on the cluster side, which maybe that has a more um you know, that's been collecting it every every minute or something, um and the telemetry uh report would include that score rather than calculating it right before the report is sent.

E

But thank you for that comment because that's that was one of our questions. You know, uh would collecting something. 24 hours be accurate, and so it's good to have that feedback.

E

Are there any other questions about this topic or comments.

E

Okay, I'll go to the next one, just since we're uh so. The final topic for telemetry is um identifying osd performance outliers in the manager.

E

um So there's a trello card which is linked here for a set for reef, which is uh expressing interest in collecting or identifying osd performance outliers in the manager, side and um neha had a thought that uh the perf channel in the in telemetry is already collecting a lot of performance information about uh each osd. So that would be a good idea to.

E

I use that information to identify outliers, and this is a fairly new topic. So we haven't uh really had any discussions on um what that means to identify outliers but yeah.

A

I think it's written. I think I think this is more like you know um it's, it's just a thought that now that we have all this information, we better use it, for you know something that we've been thinking that we would do but yeah. This needs to be thought through and um discussed in you know the telemetry huddle or something.

E

Yeah, I just wrote down some information about what the perf channel is collecting per osd like we're, collecting perf counters, um histograms, mempools and heap stats, um so that might help those data points might help us um identify outliers and of course, more conversation needs to happen about it. But um it was just an idea for life and um that's that's really all I have to say. Were there any final questions or I don't want to take away from the next topic? But if there are any comments, you know again just add them. um Wherever.

A

Cool thanks, lauren you're it. If there are no more questions on telemetry, we can perhaps go to the next topic. um I believe this was radic.

C

Yep, it's it's about our it's about our c plus plus api guarantees.

C

Actually, at the moment, we we provide users of this of the cps api we've granted that it's it can be compiled with c plus plus 11, which is now well 11 year. Old and other parts of the project are uh undersea 17. unlikely. We will move further, and this brings a question.

C

Actually two of them is uh is is providing such a guarantee, a relevant thing today and even more general question: do we really have honestly any users of the c plus plus api of the c space radius? Api for sure c? Api is widely used, but I'm doubtful when it comes to see past pass, so maybe maybe uh we could actually remove some of uh of the mental burden uh put on us just for the sake of providing a thing that is not so widely used and necessary.

C

That's that that are my questions. I would like to ask here.

I

It was my understanding that the simplest plus api was internal and that com, that sort of part started with conversations which we had with jason tillman three or four years ago, and then we created is there just- and I guess I have to ask- is that is also looking at your github or your um tracker issue on this. Is this?

I

Is there a distinction between the c plus plus api and and and and and the ap and and api that was being consumed by in c plus plus by by rbd, and that we modified with where well emerson, uh enhanced recently so called neoratus.

C

Well, my understanding is that we have only, uh as you said my understanding is. We only have internal clients, we don't need to worry uh about external ones, but it's just a say this. I would really love uh to confirm before uh before casing, before facing the guarantee.

I

Well, I think I shouldn't be so I I I understood that there was no such guarantee and, and I think it would be harmful for us to absorb it.

F

Absolutely well like you need to email, the dev list to ask that, because.

J

F

It's gonna gonna be a different user base. Then it shows up on a lot of these meetings.

H

F

H

To people's plus 11 in the first place, though, when we had to ask this question at that point- and I think folks were happy were fine that changed, and so that's this will be that's pretty smooth selling too.

A

Okay, I I think um it doesn't make sense to discuss any more on this. I think, following up on the mailing list, would be a good idea and then go ahead with it.

C

F

I guess I did. I was also curious, like was there a specific problem that came up that I I assumed there was. This was a big issue that came up.

C

Yes, I I linked that. I linked the tracker in the chat. Basically, it was in the buffer in the buffer header, there was a construct. There was a patch that introduced a specific construct and because of that, we violated uh the uh supers plus uh eleven granted enforced. We got one test uh for technology very correctly uh that tries to that that tries to build a built an example program using uh cpr 11.

C

got it anyway. My understanding is that no no objections here uh to uh to strip down this guarantee and we can move forward by basically issuing email uh to the uh the dev list.

A

Yeah yeah sounds good cool.

A

Cool all right, that's ridic, um moving on lock devices with compression uh josh. Do we know if the relevant person is here or.

H

Yeah, I think uh someone who brought this up.

A

D

Yeah hi neha martin is gonna present. That uh martin is online.

K

uh Hi, yes, um so I don't have any slides or anything for that. Sorry about that. um This is about a patch that we propose in order to support um to support uh block devices that have internal compression capability.

K

um Ibm has developed and actually even sells, one of these devices, it's an nvme device and it's a model usage model is very similar to the vdo device that is already supported within seth. It provides a very large logical block space, which is backed by a smaller physical space and provides information about the state of the physical block space via sideband interface, um which is basically some special nvme commands.

K

um This is very similar to what vdo the video device does. The video device is basically an inline compression module that is part of well can be inserted into the linux kernel that uses a standard, nvme device or any block device at the back end.

K

There is already support inside ceph for that, but this support is buried inside the block uh driver and it's uh referenced by the kernel device. uh What we're proposing is to encapsulate this interface between kernel device and block driver into a plug-in api, and this patch basically contains this plug-in system. The plug-in system itself is based on the eraser code, plug-in system that already exists in there, so the structure is basically identical and it defines an interface that allows to query the physical block state of the compression device.

K

This allows basically later on adding multiple different plugins for different hardware, um so the the um the patch for this is uh yeah provided in in this.

K

In my uh public uh vlog, um it is actually consisting out of three uh commits sorry about this.

K

The last commit that you just showed is uh one that basically supports some additional things that we encountered when, when running tests with it, for example, um running out of the top as out of the build tree uh needed and restart some patches, and there was one dependency required that would force basically the the uh block device driver which actually instantiates the uh or references the the plugin system, to use the comment, because the plugin system is actually pushed into the common library, uh but essentially the the two main commits of this. I will.

K

I will be based this make it nicer. Sorry, no.

A

Worries I just decided to click on whatever I could find.

K

Okay, um so the main two commits actually uh one that implements the entire plugin system and wraps the video driver into uh into a plug-in. So basically, I took the code of for the video support and put it into plug-in so that it's basically used via that and there's actually a second commit that I separated out, which um adds the keep caps bit uh to the state of of of the osd.

K

um We need this actually for our special plug-in in order to do certain nvme pass-through calls uh I occur to pass through calls that require certain kernel capabilities and when switching from uh from root user to self-user, you usually lose it, but this patch basically maintains it and for capability, aware plug-ins, you can basically activate the capabilities that you need.

K

That's basically, it.

K

um Yeah, uh sorry, maybe if you want to see this here, uh I I did the big mistake of pushing this morning uh the merge button for the the up upstream on my uh on my fork, which causes the course the last master to be merged into this patch.

K

So you basically on the on the wrong side of this merge. uh I have to rebase my patch to make it more clear. Sorry about that.

A

No worries no worries, and I think this sounds exciting and it's worth discussing at a cdm um and when you know maybe you can more formally prepare and share it with the broader group. What do you think it's so cdm is essentially like once in a month session that we have a developer um meeting, so maybe we can choose uh whichever one works for you.

K

Okay, should I already create a merge request, or should I hold it off and just have the hatch ready.

A

You can create a pr, it doesn't hurt to create a pr and there's something called rough pr these days, if you want to put it in a draft straight, that's also. Okay,.

K

Okay, I would do that.

K

um That's easy, basically, for my side.

A

Cool any questions.

A

Okay, if not, we can follow up.

F

Oh yeah yeah, I just want so um I guess I I know that there's some integration with video that already exists. Have you have you looked at how? Well it works for your block device, because I never got the impression the video integration was really fully baked.

F

It was sort of a technology that that people wanted to put together, but you know, video, trades off, cpu usage for lower storage utilization and that's sort of the wrong optimization for seth um having it in the device is a lot more interesting, but I'm just wondering what happens when we start running law in space.

F

Okay, so, like all the actual space in the device- and it still says in the one layer that it has a bunch because it's you know trying to pretend.

K

Yeah so again, yeah, as you said that it's actually in this case you don't trade off both cpu cycles, because it's all the device that actually does an inline compression and you actually is now doing it full line rate at full. Full bandwidth, there's no impact whatsoever, um the uh the yes I I agree, the the the integration does not seem complete.

K

You get at least the feedback to the upper layers of cells, so you will get warning messages that your cluster is running for, if it just it really runs out of physical space, and you can take measures at that point uh similar to when your uh non-compressed basically runs low. uh But, yes, the displays are not that nice, because um the the ratios do not consider actually the the um compression 100 percent. You get basically a certain starting capacity, which is your physical capacity.

K

And while you arrive, you write much more data than the physical capacity shows and you still have lots of available and then the percentages are a little bit funny. But it works so far, at least for alerting us to to low low availability situation so that on the higher level, actions can be taken.

F

All right cool, as long as we know what the limits are and where we'd like to improve it.

A

Well, thanks martin.

L

A

Move on to the next topic, which is qos, um I believe, we've got ashvaria and sridhar. However, you want to drive this topic.

G

Yeah, so if you could just open that speaker pad and just walk you through yeah, that is being done yeah. Can you guys see it yeah? Yes, so I can see it yeah, so I thought I'll just give a background of the current work that has been done for the client versus client viewers. This has been in the works uh for quite some time in the past and folks like eric, sam and others.

G

They have been working on this, so I I basically based the current changes that I have made on the on the original pr that the folks have been worked. I had worked on uh that. I highlighted there two zero two, three five, that's the one, so I based uh my the the work um based on this pr and adapted it to the current implementation of the clock scheduler.

G

So the the pr that I'm working on is currently still working progress and I have uh provided a link to that as well.

G

So there are a few still a few uh uh see quite a few things to work on uh that uh and I'll just present, the current state and the next steps that uh uh that we can take so the um the current state of this is that the uh I'll just enumerate the the things that the changes that have made as part of this, uh this new pr, uh the the first item that that talks about the service tracker.

G

It's basically the we basically incorporated the dm clock service tracker in the objective code. uh The service tracker essentially tracks the response from the vm clock server and then proceeds to calculate the request parameters like uh delta and rho, which which is essential for further dm clock algorithm to work so service tracker object. uh Essentially, does this work for us and and uh based on the implementation of the original we have? We are. I have uh added the comments uh which which essentially follows the same uh thing that what the original vr at intended.

G

The other change is the messaging changes to allow. You know clients to send the request parameters like delta and row and then also in the response path, receive the qos response, like uh the cost interface, uh that the service tracker kind of uses to calculate the request parameters for the subsequent requests from the clients.

G

The other bit that that that I have integrated uh uh looking into the original pr is the qos profile manager uh that that essentially encapsulates the uh client uh viewers profile uh and the the block service tracker. So this is this. This is essentially used by the injector in the library uh and it basically acts as a conduit to pass the keyword, request, parameter parameters and then receive the keywords response.

G

uh The other bit that I have added, which is not there in the optional pr is the client registry. uh This is again based on the implementation, the the main dim clock code. So this uh the client registry essentially uh tracks the uh tracks, the map of the external clients and the and the qs parameters uh associated with those external clients uh and also introduces uh cleanup logic uh to clean up uh to remove uh tail uh clients from this uh registry.

G

So, for example, expensive client becomes silent for quite quite some time. uh It doesn't make sense for us to keep that entry uh long enough for so long. So the the the cleanup logic essentially looks at looks at the age based intimacy, based cleanup mechanism and cleans up this registry periodically, and this vr also implements quite a lot of unique unit tests to test all the above functionality that I have mentioned and uh yeah. So this is the current state of the pr.

G

uh The next step is obviously to get the first set of reviews going uh just to uh just to capture kind of.

G

It's in line with whatever folks had thought of earlier and that feedback would be highly uh highly valuable and then and then there's a bit uh there's some few few things about the us profile manager that I need to really look at so essentially, what happens now is uh we.

G

The profile manager is implemented like a like a static, uh it's a process-wide kind of variable, uh and then I see I I saw a few changes that was that was made by sam uh to make it on to make this uh on a project for objective basis. So that's something that I'm trying to take a look at, and if that is those things, I need to look at and discuss and then uh proceed to make changes there.

G

Apart from that uh yeah uh based on the feedback, we can get into uh testing um with the actual liberators client software after incorporating the new apis, so that that's the ultimate goal. uh I was not sure about uh the support of other types of clients like rgw and south of us. Maybe that could be taken up at a later point, but right now uh this spr essentially focuses on liberators.

G

So that's the current state of uh client of this client viewers work.

G

Open to any questions or feedback here.

M

So certainly that sounds very impressive and I'm very appreciative that you're picking up uh the ball here and going with it because yeah it was a lot of work and at some point, rgw needed my attention. So I left it and sam pushed the push it forward quite a bit. But thank you for doing all of this.

G

Yeah sure a lot of interesting work, so I really hope that you guys uh take a take a look at this new pr and then give me the feedback so that I can take this forward and yeah.

G

If there are no other questions, you can go ahead with the next topic, which is handling high priority operations. This kind of uh game came up um on a trailer board just just.

A

One second, before we go into that, maybe it's worth giving like a two-minute overview of what currently exists in quincy and why this handle high priority ops is required. Maybe I don't know whoever wants to do that. Maybe just give a quick summary of what we what exists at the moment or folks who are not aware.

G

uh You know you mean the current implementation in the.

M

J

Right yeah background.

G

So so you could see we essentially added the dim clock capability only on the usbs so to provide cues for operations like client, ops, recovery operations and other background operations like scrub, uh nap trim and the pg deletions, so the uh uh so. The testing for this is currently uh in progress on on large clusters. So on a small scale. Of course, we have been able to test a few of these and the fine-tuning of the of the of the profiles for the background operations that is still in progress.

G

I think we have pretty much narrowed down on the things that we want to make for the conflict profiles, but largely it looks good for the client office and the recovery operations and a bit of fine tuning is necessary for the background operations. So that's where we are at uh currently with the uh changes in the uh osd, um so this came up.

G

Yeah so so this came up because uh we saw an issue where the uh in some cases the uh pgs in three merge state. They were stuck in. You know: backfill, wait state for uh for a long time, and uh there was a need was failed to handle, handle such such pg stuck in this state at a higher priority. So essentially, the requirement is for the the.

E

G

Scheduler to also look into the priority of certain operations, a small subject may be and then prioritize them quicker than the other other operations in nhq.

G

So this is still in the conceptual uh conceptualizing stage, but some solutions that I I thought of was one which uh involves using uh iq, which has slightly a lower clarity than the current immediate queue that we use uh named claw in the n clock code. Now, the immediate queue is not handled by the anticlock scheduler, but rather it's used for very high priority operations like replying to the duplication options, and things like that.

G

So we could introduce a high q where we could handle such operations by looking into the by looking into the existing priority field of the obstacle or item.

G

That's that's one solution that I was thinking of the other one was to introduce some kind of logic where the after looking into the high priority uh high priority field, we could manipulate the cost of such items, basically lower the cost, so that m clock so that, once it gets put into the m clock queue, uh it's able to decrease the the these kinds of these operations.

G

uh Much quicker than than what is currently happening, so these are the two uh sort of conceptual things that I had in mind, but I'm open to any feedback to tackle it in a probably different way.

A

Yeah, I don't think we need any other third solution. One of these should work. I think we discussed uh solution, one I believe, and one in the qos call right yeah. I I think either should be fine. Let's probably you know, look at the feasibility of the code and see but which one we can get to merge faster, because this kind of completes the picture of a background qos and then you know we can completely focus on client versus plan, which I believe is going to be the tougher night.

G

A

Cool any any questions, uh anything else, yeah somebody had a question.

A

Esther anything you have you want to add here.

B

Look, I think, sridhara covered it all.

A

All right cool, um so if there's nothing else on this one, uh the other piece of qs I wanted to just quickly mention is the codel pr which is actively being worked on from um folks from uc santa cruz and there's already been a cd and sessions. I'm not going to go into too much detail into it, but we are pretty close to um getting this merged and the basic idea is admission: control at the blue store um layer, um sam anything else. You want to add to this.

N

Maybe I will jump in for a second. uh If I for a some time, I was very picky about this this pr. I did not wanted to merge it, but recently and understood that it it its algorithm might be a bit deficient, but still it provides a valuable infrastructure to actually improve upon the the rate control algorithm. So I will quickly review this and because it was already in a very good shape, um technically uh from a technical point of view a long ago, so I think we'll merge it. Rather sooner than later,.

L

Cool, I think this is sam. um I think it's pretty well isolated from the rest of blue store, so I think it's pretty safe to do so and we'll want to make further improvements to it so merging it will make it easier for people to do performance, testing.

A

Yep, I agree all right for folks who are interested in further details. This is uh in the pr itself, there's a bunch of links to uh presentation in the code, walkthrough so feel free to look at that.

A

um Okay, I think that, with that all the curious stuff is complete next, I just wanted to touch upon a few recovery improvements and changes. I believe that we have in the pipeline um and it's worth um considering them for the reef release.

A

uh Some of them are more complex, so it would be a good idea to start thinking about them sooner rather than later. um The first one is ec clay code, um so this got introduced in nautilus.

A

It is a new eurasia coded plugin, um but it has been marked experimental since then, um last year there was some discussion in the user email list about removing the experimental flag and we in the process found out that there have been users who've been using this uh successfully um since we released this, um but the only. I think the common concern was that it made some changes to the ec back-end logic, uh which made us think you know twice before removing the experimental flag.

A

So in general, I think the question I want to raise is that: do we think we can, um we should just start testing it more and give it like a full cycle not make it like anything like a default or anything, not even suggesting that but remove the experimental flag. If we think that it's not breaking our our tests or the additional test coverage around ec that we could add in reef.

A

What do people think about it.

H

Yeah, I think it's been there for a long time and we've seen no issues from it, so I think it makes sense to take away the experimental flag. What.

L

Does it do other than the normal plug-in stuff.

A

Rather than the normal plug and stuff it optimizes for um network bandwidth,.

L

There is no, but.

A

L

You mentioned that it makes some changes to easy back-end. What does it do? That's different from other plugins.

A

That is a very good question, so there are two pr's that went in when this got added uh the follow-up. Pr is where there were some ec uh changes that were made by sage. Those are those definitely need to be revisited. I don't remember of the top of my head uh josh. Do you remember.

H

No, I think we looked at it last year at some point, but I forgot now.

L

That would be something to work out. Then.

A

Yeah yeah, so I guess what I'm just trying to get concession consensus on is like it's worth revisiting those prs and to test them or test the c clay code more for the cycle. It's no additional code change that, but more testing and investment into looking at what changes went in at that time. It.

M

A

A project it was like a project, an outside contribution, uh which we definitely welcomed and looked very promising, yeah the cats. You had something.

D

Yeah, so I think uh there was one thing we wanted to uh ask this instead of users and also to run some extensive tests right that in the scale also like how the ec pools will be with the rgw workload, especially because we, the larger use case of easy pools, are with the rgw workload right. So.

A

Yep yep, that's exactly what I'm saying when I'm saying in investment: those are the kind of things we'll probably want to do uh before um and yeah. I think I'm again suggesting removing the experimental flag. So it's not breaking anything um badly. I think I'm more concerned about the corruption aspect of things. So all right, so I don't think there's much more discussion required. I think we all agree revisiting the ec code.

A

Changes and further testing is what we need um the next one is another pr that we've discussed at a cdm, and uh there is a very detailed, uh is mccola on the call by any chance.

A

Yes mccola. This is your pr that we uh held off merging for quincy, but I think we definitely want to go ahead with it for reef. Do you want to briefly cover what the purpose of this pr was.

O

O

Looked at it long time ago, so the idea was.

O

O

On our recovery, we found some missing object on primary osd and we restart this primary osd. As this information is lost, the idea was to.

O

Persist this missing information on in the secondary with this two and here the patch will it's kinda worked, but uh when I was testing its own technology, I've found some age cases where it still didn't work and yeah right now I don't remember details. uh I did not have a chance to open it for a long time, but probably yes, it's a good idea to so yeah. I I still have a plans to look at it and analyze the results of testing and probably to come with a better solution.

A

Yeah, I I totally agree and the there's a lot of good context in the links that I added here and your pr already has those so that'll be good to revisit, but yeah. The general idea that you described was the same that we make. The replicas also persist.

A

The missing sets, because I I remember that we had uh there were some corner cases that you ran into while fixing some other problem where we were actually losing this information and um that in nautilus it was leading to a crash or something, if I recall correctly, but all the context is in this pr there's a bunch of discussion we had around this, but I think the uh important piece here that I want to focus on was the testing piece.

A

um There were concerns that we raised at that time that we do not have enough um coverage around.

O

A

O

The test piece was actually we wanted a backport. It was a bit different, related problem, but yeah, and we wanted it backport for no tulus and due to due to refactoring, it was rather.

O

Dangerous, although this is simple, but it was dangerous, so we decided not to do one so.

A

Remember that I think the additional piece that we discussed at that cdm was that we lack um enough tests that are.

O

Yeah yeah yeah yeah.

A

Exactly so, this was this is the piece I'm talking about, so, given that we are, we are talking about making recovery improvements back in changes, and things like that. I think uh revisiting. This is probably not a bad idea.

A

Okay, any more thoughts on this. This is mostly everybody agrees. We just need to revisit it kind of thing.

A

All right, um if not let's move on, uh there are a couple of items that I added here again, just uh mostly for everybody's information, um but these are pg log related improvements. um There is a well-known issue that was reported around the accumulation of dupes in the pg log, which led to out of memory conditions, and we now have a pr that fixes this from mid-sun.

A

Nothing much to be discussed. Everybody agrees, it's a good idea.

A

We just need to get this done um and similarly there's a another pr which is essentially going to be um logging if logging the size of the pg log entries, uh because in the past we've run into issues where one pg log entry for one, for whatever reason has been huge um and it's it's not something like, I would say, is fixing a problem or anything, but it's more like a developer knowledge in case um we we, it saves us the trouble of going back and dumping the pg log and all kind of object store tool activities, um which is also another pr that nitsan from the raiders team is working on.

A

But yeah, I just wanted to highlight these couple of things. Any more discussion on any recovery related stuff.

A

All right, if not oh,.

L

G

L

Going back to that, the previous one about the missing set recovery, things that was adding a message that actually moves that data right.

A

Yeah it was the uh the primary would send a message to the peers. Yes,.

L

Yeah yeah, I'm skeptical about that. That sounds extremely complicated.

O

Actually, probably yes, and because when I started testing it and so so, do you have a better idea.

L

um Not immediately, I would need to review what the problem was in the first place, but I'm just going to say that any any change that involves changing the message, protocol or adding messages. The osds is a vastly.

G

L

Bar testing wise than any other solution. So if you can come up with a better way of doing that, that would be better, for instance, if you could make it less sensitive to needing to remember that in the first place, in other words, make it not a problem that it forgets the information.

L

But if we do need to do this, then the testing requirement would be extremely high. You will spend vastly more time developing testing than you will developing the actual patch.

O

Okay, I will probably then write a message to the list due to cc describing the problem, because I need also to.

O

Recall what was going on there and yeah if you have a better idea, how it's.

O

It's highly appreciated.

A

My memory of the discussion, we all agreed that the testing requirements were more than what the coding requirements were, but I would also.

O

A

To uh yeah go ahead.

O

Yeah I mean that I agree with sam's.

O

Point of view that this uh adding a message uh would complicate things very much because it is, it is what I I I remember what I saw that I I still love after adding I I had some age cases where it didn't work as expected on on the test. Probably it was just a problem with my implementation, but still yeah. So if we the think of about another solution, that does not require at all to send those messages and instead exits differently than yes.

L

Or find a way to ignore the problem right.

O

Yeah, oh yeah! Actually, yes, because yeah it's actually it's a rather minor.

A

Yeah, it's a very edge condition if I remember correctly right, mcconnell, it's just a restart, it's a sequence of steps that need to happen right.

O

Actually, it's not so edge it's when you are doing backfill and found during backfilling that some object is missing. You enter this backfield not found state pg, but.

L

When you restart.

O

L

O

L

Can only happen if the pg is already corrupted.

O

All right, yeah.

D

O

You when you restart osd, this information is lost in it's just active in clean state uh enters so and but if you run deep scrub, it will find this problem again. So the problem in our customer was that they didn't run even just crap, not this crap. So they had this disabled and were not aware about this problem. So but they were still very unhappy about this and wanted us to fix it.

L

I'm I'm I'm of the opinion that we should definitely not be adding this kind of complexity for what is essentially a scrub problem.

L

This can only happen if the pg is already corrupted. Adding a message of this type adds code paths that are more likely to cause bugs than this bug in the first is likely to exist in the first place.

L

We have to maintain it going forward. Every time we touch related code.

M

L

Have to make sure we don't break it. It's a big deal.

O

Okay, actually, probably.

L

Yeah we should reevaluate this and try to find a better way.

O

A

Cool all right and um just to make sure, except like the condition that you just described my collar there weren't any other upsides. To doing this, if I recall correctly right like, is it going to be handling any other weird edge condition.

O

No basically, this they are about to fix this problem. That, uh for us also, I mean for me, the developer does not look very critical. It's rather my minor the issue that the information about.

O

This corrupted pg is lost when the primary is restarted. That is only the thing that this pr is was targeted to fix so, and this does not look like a critical issue, just our customers were very upset about it and yeah and.

F

I think I'm misunderstanding, then: how do.

L

I'll I'll explain it greg the the the general version of this problem is the primary when doing scrub observes that one of the replicas doesn't have a copy of the data right with.

O

That uh it was well, it was.

L

Detected during backfill, I know yes.

O

L

Yes, but I'm saying the general version is the one where it gets detected during during scrub. That's a lot more common. We don't currently write that down. We don't propagate it back to the replica or write it into the primary's missing set. So when you reset the acting set, that information gets lost until you re-perform a scrub, there's a special case version of this, where if backfill is operating, and it expects to find an object that isn't there because again the pg is corrupted in the first place.

F

Well, sorry, that's.

L

F

I'm confused about: why does it have to be corrupted? Why can't it just be an old like it was doing.

L

Because we wouldn't have looked for the object in that case, we'd already know we already have affordances for that.

L

Okay, that's the yeah! That's that's the answer. We.

F

L

Go looking for projects to find out whether they're there on the authoritative copies, that's only on the backfill targets.

L

So in either of these cases, because we don't take this information- we've learned and apply it to the primaries persisted missing seven, the replicas persisted missing sets. We will lose it when we go through an acting set change.

L

So if we wanted to fix this for real, we would try to tie that specific backfill case into a general scrub case that simply tries to re-propagate any of these missing objects back to the missing sets they came from, and that might be a valid feature to add. But it's extremely complicated.

F

Okay- but this is this- is only occurring if we mysteriously lost data off of a disk is what you're saying.

L

Yeah there already has to be a corrupted disk for.

L

Like we already do for certain scrub problems,.

L

That's my perspective anyway. Other people are free to disagree. Of course,.

A

No, I I'm all for complex, uh like you know, reducing complexity. um I still think the uh the the error injection tests and all that can still be improved, given that there are certain edge cases and things that um sometimes users report and we are still not catching them enough.

L

So I think I think in this case that it would be a prerequisite for merging the pr.

A

Yes, yeah. I absolutely agree.

L

Like if it were an urgent problem, it might be worth merging a fix and then filling in the testing gap later. But this isn't enough for that and we can't afford to break it by accident. So we would need the technology test in place with the initial merge.

O

A

Okay, I think that's enough discussion on that topic in the interest of time. Let's move on to the next topic, which is bluester um igor adam. Are you here to drive this.

N

Yeah I mean I will maybe quickly go through my points. um First one is I wanted. I want to make a continuation of some allocator tests that were done uh very long ago and basically I didn't properly finish them. Then um it's um stems from the recent mark tests uh that were being performed, that we had the problems, uh performance problems, uh difference between quincy and pacific, that we couldn't explain basically to cut it short. We found that the allocator had a deficiency in some in some situation.

N

um The overall problem with blossom allocators behavior, is that we basically picks one issue at the time when we find some problem with allocator, we do a modification either it's to reduce fragmentation or to improve its speed and in some conditions that we actually were uh have were having problems on.

N

My goal is to make some simple infrastructure that will on any standalone, osd, basically deployed a new will.

N

Fragment fragment the data put objects basically do some uh simulation of a real life workload.

N

In with with some some data points like uh having um testing rbd a highly uh variable high right, random rate with uh osd fields, to like 80 or 90 percent, and on such created uh states then perform allocator tests. It's it's. The goal is to actually make a comprehensive table of our allocators performance in a different st, almost corner case scenarios, so we could actually validate them, um and the use usage for this would be, like recently mark made a very good change for avl allocator to um keep it.

N

Always move its last allocation search place, even if proper place was not found just to not repeat the same search next time, but we currently, we don't know how this will behave if our osd would be severally, fragmented and also very full if it wouldn't be better to actually now pay more time to find some better placement, even though we are just breaking some uh requirements and yeah. Basically, that's it it's it's something that I would. I would just make infrastructure for testing our new allocators and existing also, so we could know.

P

Adam, I think one of the tricky things with this is that um we're we're definitely finding that different hardware behaves differently, not even not even so much just you know hard drives versus flash, but within you know the the range of different flash hardware we're seeing different behavior. So it's it's a tricky problem.

N

Yeah, okay, I might have uh omitted that uh such testing would be maybe an analog to rudder's bench, so it could be actually run on a deployed, maybe empty, but a deployed osd on on actual hardware, not an analytic test run on um some validation hardware.

N

But if we expect that something is fishy in in some uh actual hardware environment, then we could use that to verify.

P

I think I think one of the things this can be really difficult for us is figuring out like what is the the cost benefit of spending time now to reduce fragmentation? That will hurt us later right. Like you know, do you really need this one io to to be guaranteed to to finish fast or you know? Is it okay for this? I o to take a little longer if it reduces fragmentation on the drive, so it's a hard question to do and then we can test for it.

P

It's just a question: how do you answer it.

N

Yeah mark I totally agree and to actually find out that boundaries we should have tests to push us very far to like a full uh filling their drives, because we know that for high.

N

Usage drives, we sometimes are having a problems with blue fs and then we're having a problem with blue fs super block, not fitting and and so on. The problems we already identified, but still not solved.

P

Yeah, I mean the the whole reason why kefu's pr went in last summer for the avl allocator was because I guess um you know we were spending like excessive time in near fit like to the point where it was like having a huge impact on customer clutch clusters.

P

So we we limit it, but now by limiting that search right, we we fragment. So you know it's it's kind of tough to know, okay, how much fragmentation is acceptable and how how much delay is acceptable when searching.

N

P

N

That's what I mean by having some higher order algorithm to actually control allocator behavior.

P

N

For a long period of allocator running, we could allow it to just spend milliseconds, but if we finally see that the fragmentation is going high, then we allow it to spend time more for doing a better, better fit.

P

Adam, I don't remember, did you did you comment on the pr I've got for the time based control? No, I didn't. I.

N

I did not comment. There was nothing to comment. It was just just okay.

P

Do you do you think, that's a I still um yeah. The opinion is the better way to go than what we're doing now.

N

Exactly that's a better way that we have now uh still. I think we need more, but more, not not. Instead, okay, yes,.

P

Q

I don't just uh almost a side note about limitation.

Q

In my opinion, that's I mean dealing with fermentation or attempting to avoid fermentation is not an exclusive task for a locator. Moreover, it's impossible to resolve it at the locator level completely. So it's user pattern user access pattern which determines the the final pigmentation.

Q

So I think that, no matter what we can introduce in alligator itself, we will still have uh potentially issues with pigmentation. So maybe we need some different means to to fight that something like documentation or whatever.

P

That's a good point igor. You should probably have some means for defragmenting.

N

Q

So uh just want to say that I'm not sure this.

Q

This fermentation with allocator tuning or improvement, makes much sense so to some degree yes, it it definitely makes sense, but it wouldn't fix the issue completely. That's my point.

N

Okay, igor, so your suggestion would be instead of fighting very hard, with fragmentation and allocator level, make it some defragment and some feature like that in in the loop or out of the blue right loop.

Q

uh Perhaps yes so well, uh my point is it's it's we we need some means uh to fight the fragmentation. Is it from uh defragmentation or any other means like yeah? Probably the fermentation. That's the only thing which comes to my mind uh but yeah. So the idea is, we need something more higher level than allocator to fighter.

P

I agree with you igor. I think it's probably good for us to try to optimize for fast turnaround on allocations, maybe not ridiculously fast, even everything else that we deal with so some some, you know don't fragment everything immediately but um but uh yeah for for cleanup, especially of really full discs.

P

Some kind of defragmenter defragmentation strategy, probably is, is the way to go.

Q

Yeah, because if users fragment the disk by their random chunk releases, you can't can do with that through allocator. It's it's.

Q

It's a different customer yeah allocator can help to to reduce the fragmentation, but the longer drive runs of storage leaves.

Q

The the more fragmentation will come might come to extend.

Q

And at some point you still can you can just fight with that in allocator.

N

Okay, so I might take from this would be that pushing down such endeavor will be uh better, just not not make excessive uh testing infrastructure for uh allocator, even if more think about uh recollecting uh merging free space in in some some way. Some the fragmentation.

L

N

I guess you still need.

L

A way to empirically evaluate these approaches right, that's why you were talking about testing.

N

That's why I'm talking about testing but in.

N

In actual spending there budget, I I maybe should more focus on making um the fragmentation fierce and then trying to improve and find some corner cases for allocators. That might be more um more gain.

L

uh Maybe but a background defragmenter is going to require a lot of empirical evaluation. There's a lot of freedom in how you implement something like that. So I don't know testing the fragmentation. Behavior seems pretty important to me.

N

Okay, in that sense, having an infrastructure that that can test uh behavior could uh be and valuable input to also evaluate the fragmentator. I I agree yeah. I I get that.

A

O

A

H

Hate to do this, but I think uh if.

A

We have to have a broader discussion. Let's take these topics to cdm and, let's finish whatever we have, we are running out of time now sure.

A

But it's good and then we are getting food for thought for see future video cdm sessions so.

N

Go ahead so next topic- uh I guess it should be uh mentioned after uh igor talks about possibly his work about uh custom right ahead log for roxdb, so I will postpone that, and the third topic is we seem to have. I mean it's basically uh confirmed in some cases we have a situation where different rights um done done to objects, meaning, instead of actually writing to to drive. We only mark it in writer in right ahead. Log in in roxdb in deferred operations table.

N

Sometimes we can when we fail after.

N

When osd makes a different rights and then fails- and we restart, it is possible that we override some bluefx data- it's not mandatory, but bluefs data is the most probable target.

N

This is something we basically introduced when, in nautilus we dropped mechanism that we moved data blocks between blue fs and main blue store device, and that changed cost us to open a window for data corruption. So it's possible that we will, after restart corrupt some blue fs data. There are two possible solutions I mean for um um for this issue. The bad and quick one is to not uh execute and deferred right. When we detect that we will destroy active blue fs data.

N

Therefore, we will basically get an a corrupted object, but it's better than corrupting roxdb data, which always kills us completely, and the better solution is to modify a different.

N

Structure different in message, format to include also which object and on what offset is being written.

N

Then we could control that if we have some some miscalculation, we could recover from that and realign to the new position that we should actually write the data. That's like it's! It's not done it's just to to have half cooked solutions to to the problem.

N

Just to put in context such occurrence is relatively low. We could, we could only find actually igor found one one such occurrence in a specific, larger coded tools.

L

Well, but why is that possible in the first place.

Q

uh Because we keep deferred rights in database after okay, the charts they attached it to are actually released and hence on non-graceful shutdown. We might want to replay these default rights again and they get into.

Q

uh Into already reallocated chunks, I mean these chunks are already taken by bloopers.

L

So have we addressed that problem.

Q

Q

Actually, the problem is still open, so yeah looking for better for for best solution. So far,.

Q

So uh from one side on one hand, we don't want to over complicate the default right processing.

Q

I mean this detection, which default rights to be released if some of the right happens to the same child, it's a bit tricky procedure.

Q

And, on the other hand,.

Q

It's probably better to implement during the play, since it's less inclusive and.

Q

We are not that limited to cpu cycles, for instance to do that checks, um so it doesn't matter much for if we need a bit more time during different try to play on laptop rather than perform such checking on each right in regular regulation.

Q

And just a couple more comments, so I managed to reproduce that using this dark cluster- and I saw that at least in at one cluster in this field and also from time to time we are getting some rocks db, corruptions with unknown causes which might be caused by this issue. So I I can't say it's very seldom so, but most probably we just haven't identified it properly. All the time.

L

So you're taking sort of a two-prong approach, you're discussing a mitigation here that would be easy to backport and then you're going to fix the actual corruption problem.

Q

Well, I think it's it's a single task, so I can't see how to split it into two parts.

L

Oh, I guess I misunderstood the proposed changes, though both of those sounded like.

H

L

Ways to reduce the damage when the bug happens, which doesn't seem sufficient to me.

N

Yeah, actually, my thinking was to make it to a uh first make it quick mitigation uh as much as as we can without any change and then on a longer time.

N

Think about a good solution to prevent it from happening actually.

L

Because I mean given how bad this is, my inclination would be to disable deferred rights completely.

Q

It's a very devastating change for spinners.

N

But from on the other hand, for spinners, we might be actually willing to pay the cost for checking if we do have a different right waiting when we try to release object or do something like that.

L

M

I mean, I think you.

L

Have to handle it one way or the other on any kind of disk.

L

I mean period right. This is a fairly serious data, corruption, problem.

N

We don't do deferred on solid state drives by default.

Q

And just another comment of that: this actually happens when db the user's main device, so either it's collocated with uh main volume or if it's split over, so it doesn't happen. If you have standalone.

L

uh It's true that it won't. It won't corrupt rock's db, but it could still overwrite another object's data right.

Q

uh Well, it's less likely and I have a feeling that it's not happening at all, because.

Q

Actually, what's it's a short gap on startup, where bluefest can allocate chunks for its maintenance purposes before the default rides are part light and if it happens, then the issue might appear. But during these periods the startup period, there are no user uh activities.

L

Q

Any probability.

L

Of this happening strikes me as extremely serious.

O

Q

I don't really care.

L

How rare it is.

Q

um Well, I what I'm trying to say. I think it's not possible it's not corrupting user data. I I can prove that, but uh my current understanding is that causes corruption to well, if it's only, it might cause corruption to block the soul.

Q

So if you use, uh which means that if you use standalone blueprints volume, uh you are, you are safe.

Q

Well, I'm not saying it's it's uh perfect, but just just just to mention.

Q

But I I completely agree that we need to fix that issue. The the the question is how to do that in a simple and straightforward, straightforward way, which wouldn't bring the issues and um yeah again uh trying to to avoid this during regular collaboration. uh Looks not that easy and.

A

It sounds like there are some conditions like it's co-located or db, and block only on spinners is where this bug can actually show up.

Q

Q

Am not completely sure that we never use default rights for ssgs.

Q

Which I have and I well, I have another feeling that I I saw that for for ssgs. In some cases.

A

I remember those different settings settings I think it's set to zero or something.

Q

All right, but.

Q

I I was quite surprised when I realized that I could see that um when using ssd drives only, uh I need to double check that, but but I have a concern that it might affect all uh this is the configuration as well.

A

Yeah, I agree, and I think we should double check that and look for a minimal solution that we can provide to. You know existing releases, and that should be our focus area.

Q

I agree these. This should be fixed as soon as possible.

Q

At least we should have some work account.

A

Okay, I think with that. uh Maybe you want to cover the other topics. You added uh eagle.

Q

Yeah, I can say that I was planning some extensive discussion on that. Just try to recap uh features you might want in upcoming in in the next major release and the first one is this new right effect lock, which I mentioned and which I'm currently working on.

Q

The idea is to uh remove uh right ahead lock, so is to replace roxdb right ahead. Lock embedded throttle props to be right ahead, lock with standalone one residing at the store level uh which will allow us to parallelize access to it from multiple threads, which is not the case with roxdb write the headlock currently as we access roxdb from single tvsync thread.

Q

And probably we can also use this right ahead. Lock, you write a headlock for.

Q

For pg locks tasks, so instead of using roxdb, we can have a more natural way to keep pg lock using this right ahead, one since we you get more control over it. uh When we have it in the painting blue story, we get more chances to do that.

Q

uh I still need to think about how to ensure consistency between db transactions and pg. Lock updates uh so just need some some more time to proceed and think about it, and also there are at least a couple new issues or tasks which you might want to have to fix or to implement before moving right ahead, lock to production.

Q

uh The first well. Actually, the second I mentioned here is: uh we need to refactor our blues fest bluestora statifos tracking mechanism, which is currently like we update start fs on each metadata, update transaction and similarly to debit staff, which currently avoids allocation map updates on each transaction, but provides means to recover that on startup from database, we can do the same for bluestock.fs in absolutely the same manner.

Q

If we, so what what what we gotta be doing, he enumerates all the objects and retrieves the location. uh Retrieves charts, which are located for them and builds the allocation map. This happens on non-grayson shutdown only and the same. During the same process. We can learn how many.

Q

Space is allocated how many spaces, actually users, is actually referenced by users, the same for compressed data, so that that's what this starterfest.

Q

Piece is keeping so again, I'm on that at the moment, um hopefully to uh to publish this piece.

Q

But one more issue that.

Q

That might be in front of us is the this applies uh to both new right ahead, lock and to some degree to gabby's location map recovery and that's.

Q

The issue which is caused by.

Q

Let's say circle dependencies between database uh allocation maps and some additional entries, like uh you, write ahead local recovery procedure, and it looks like well for, for instance, for for recovery procedure.

Q

Which is trying to.

Q

To build an allocation map, we might get the situation when database repair requires a repair and hence the recovery procedure is unable to proceed before if blue storage sdk is applied, but this fsck procedure in repair mode.

Q

It might need some more data and blue reference to be allocated and to do that we need a location map and hence the link is closed.

Q

The same applies to right ahead, lock, which needs to.

Q

Replay transactions on startup, hence it needs database in right, read, write mode.

Q

But this happens before allocation map is retrieved and hence bluefish is unable to allocate space from from main device. So workaround for this issue is to use standalone again, so it tracks allocation maps independently of database and hence we are able to allocate chunks from it at any time, but for shared device.

Q

It's it's not the case, and the only solution that I can see so far is to have a sort of reserved or recovery blueprints for bluefs that main device which is used for these specific tasks and no locations happen to each in regular operation.

Q

Doesn't look very elegant and we need some means to to restore this region if it was in use and things like that, but so far I just want to highlight that issue.

Q

Well, any questions so far.

N

Well, if eager you do your separate right ahead log, how much could it bring us closer to a solution of your circular loop if we could actually.

N

Tap our totally separate.

N

Our metadata on on allocations from okay, again, my thinking is maybe if we had uh right ahead log, the your new one. Maybe we could revert what we did with allocations, meaning we will not pay modification to roxdb, because we will have a location file with allocations and any updates that we did not do will reside in right ahead. Log just piggybacking.

Q

Not sure I understand how this would work so, in my opinion, right ahead, lock is absolutely.

Q

Orthogonal to location map and actually introducing this, this log makes this circle dependence the issue more visible, so uh I definitely need to avoid these circle dependencies for from europe. Alright write a headlock, and the only way I can do that right now is to use standalone blushes to extend the long volume for the db.

N

But if I understand correctly, we have that the same circular dependency even with existing right ahead log.

Q

All right, but for for existing uh right ahead lock, we.

Q

It's transparent.

Q

So, with new right ahead, lock, we actually have three different energies write the headlock database and allocation map.

Q

And hence the the the circle we've write a headlock embedded into roxdb. It's not the case in my understanding.

Q

uh Well, well it it's the case for of all this allocation recovery, stuff still, uh but I'd say it's less visible.

Q

We can talk about this.

Q

There offline- probably some some some some other issues with you write the headlock, which makes this more important, uh but it's hard to to explain during.

A

Yeah, I I think that's a good idea. You are in the interest of time again, we've got almost 11 minutes, so you want to talk about the next.

Q

uh Well, just briefly, uh we talked about that before we might need some improvements uh to bluefish.

Q

The major uh reason behind that is to improve the robust uh blueprints robustness and to do that, we might want some to implement some redundant super blocks, for instance, start using 4k allocation units and maybe well not- maybe it's definitely. It would be definitely required for 4k allocation unit. We will need expandable super blocks, because smaller allocation unit sometimes result in lack of so that they might suffer from the lack of enough space.

Q

Fixed 4k super block.

Q

Yeah and the last topic that I'd like to briefly discuss or share it's unrelated to booster, and we had the discussion before, but it's maybe might be interesting for a bit different audience. It's an implementing of scratch pools uh which is actually.

Q

Another backer another complete independent backend like replicated and recorded ones, but the idea behind is to to have absolutely transparent.

Q

Logic, I mean it no logic at all at this level and.

Q

Hence gain performance at cost of data safety, so for some use cases users might want to get performance, but care do not care about data safety. So- and I I I did some pc on that- and indeed the performance gain is pretty large and the implementation is actually completely independent on of the other osd stuff or why not simple method.

A

Yeah, I think we discussed it performance call and we agree that it's a great idea, maybe once your poc is ready, um it'll be a useful thing to send it out to the mailing list and get more feedback. I'm I'm curious to know what those use cases look like.

Q

Yeah, I just wanted to share for a big different audience.

P

A

A

I, like it um all right, any any questions, any thing we want to discuss on that topic, or should we move ahead.

A

All right not hearing anything, so I will gloss over the next couple of topics, so these are uh features we wanted to get into quincy, but have um spilled over. uh There are outstanding prs for both the first one is about uh configuration profiles that can be used for several purposes on a pool level or- and there is a pr link in this- I'm not going to go into further detail.

A

um The next one is about the automatic key rotation that is. There also is an outstanding pr that I think radic has taken over. From said, we are pretty close to getting that done for reef um with that. I am going to give it um laura to talk about the balancer improvements that are in the pipeline.

E

Yes, so um I'll go through this quickly so for targeted for reef, but it will be able to be backported into quincy, um josh salman and I have been collaborating on um a workload or primary balancer um first, so uh the current in ceph's current uh balancer implementation, it's important to balance, write and read, requests across osds for optimal performance and the current capacity balancer works well to or handles write requests, but there's still a need uh to balance.

E

Read requests based on pool workloads um so targeted for reef josh, and I josh josh salman, and I have been working on um implement cloud balancer that will balance, read, requests um on a pool by pool basis, and um there was a some work done for quincy that uh made it possible for us to backport this.

E

We worked on refactoring the existing capacity balancer code to make it more to clean up the code essentially and make it easier for people to in the future and to make the code more understandable and that refactoring pr is linked to the bottom of the ether pad there and to view a more detailed explanation of the high level design of the workload balancer.

E

I've also linked an open pr that will be included once the implementation has been merged, but it explains uh what the existing capacity balancer does and why?

E

uh Why there's a need for the workload balancer and how they would um how they work in conjunction and how they sometimes contradict each other for use cases, um and this uh this work is also uh set to be talked about at uh cephalocon, so you'll hear more about um if, if you're planning on attending cephalocon you'll hear more about progress during that and then another thing I want to mention is something we work on see, but we'll carry into reef. Is uh we want to improve balancer testing and to do that?

E

We need more um accurate examples of what uh so this happened or was brought up originally during the user dev monthly meeting. But we created a a tracker issue. It's at the bottom, uh the third thing listed in the appendix there uh and we opened this up to the the ceph community so that people could share their osd maps with us, and this is of course, voluntary.

E

uh We would never ask for osd maps. You know against people's will, but this is all volunteered by stuff community members um and the goal here is to improve balance or testing moving forward and to account for more realistic scenarios and we hope to continue collecting uh during the development of reef and that's pretty much all I have for that and of course it's a it's a lot about the workload balancer, especially but I've, hopefully um the links, I've included.

E

If you have more questions or are wanting to know more details, you can uh check those out and that's all I had. But if there are any questions, um feel free to ask or comment.

A

Cool thanks, laura and, as you rightly pointed out, if you want to hear more about or stay tuned on the progress, uh this will be discussed at the reception talk. So I think with that, I'm going to move to the next topic, which is about the auto scaler. So this is a very open-ended topic. I would say: there's nothing in particular that we have planned or we want to do, uh but we are, um is junior on the call.

J

Yes, yes, I am.

A

Yeah, so I guess junior, maybe you briefly want to talk about, like you know the checks and balances that were added in quincy and um some recent um issues that we you know that were brought to our notice, because of which we want to probably revisit what else we want to do for the auto scaler.

J

uh Sure um so for in quincy, we added, um I think the big thing that we added is the dash dash bulk flag, which this um essentially um any pool. That is, a data pool that we anticipated will need a lot of pgs the dash dash. You would create the pool with the dash dashboard flag and that will essentially start to pull with the maximum amount of pgs that they can can be given and any pool that doesn't have like the dash dash bulk.

J

It would just remain the same kind of behavior of the autoscaler, where it scales by um like capacity. So if there's more capacity, you skill more.

J

The other feature, I think, is the no auto scale global flag, where you can turn off the auto scaler globally, with just one command, rather than going into each pool and manually like turning it off and so and the issue we just um face uh with this um there's a set of users of users um so that that user, basically uh upgraded from, I think, 14 dot, um something to like 16.27, which is, um which is, uh I believe, uh pacific.

J

And um there were some rebalancing issue which uh it takes like um many days for. I think pgs to um to to to decrease, and basically I think the motivation behind um this topic is basically um should we um like. Should we make the autoscalers um not like do things behind the um the user's back uh without like if, if the, if the scaling of the pg's will, you know like really impact like the performance of the cluster, it takes too long for it to um to to change the number of pgs.

J

Should it should be like um not? Should the skill not do that, and maybe I think we can just like- tell the user that okay, the auto scaler, wants to do this, but it it it takes. It would take too long yeah. So.

A

Yeah, I think the general questions that we have on our minds is around visibility into what the auto scaler is doing, and especially this one. It seems to be a weird case when the autoscaler tried to scale the pools uh down uh by um you know four times or something. So why did that happen? And if you know what what extra you know, maybe logging or improvements we can add into the auto scaler uh to be able to easily catch these kind of cases.

A

In general, we have a max misplaced um object, metric that the auto scaler uses to not create too many misplaced objects at a time. But, given this particular case, it seemed like this is one case where the the auto scaler was on in a in a large cluster. So even the number of misplaced, based on the total number of objects did make sense initially, but there's still some unknowns that we are trying to figure out in this case, and in line with this, we want to.

A

You know in general see what else the autoscaler needs um in in order to perform well in all kinds of scale.

A

Any more just any any more discussion on this. I know the auto scaler a lot of feelings around it, but in any other concerns or anything else, we want to discuss at this point.

A

All right, if not um so, there are two more topics: there's one about ost map trimming on the osd that was brought to my attention uh by prashanth uh yesterday. I do not have too much context on it, but it seems like there's a case that has come up where there are millions of osd maps um in the usd that are not getting trimmed because of a trigger of osd map generation. That is required.

A

So I believe I'm going to take this topic to a different forum for discussion after I have a good understanding what's going on um and with that we can go to the last one. There are some interesting you see. Partial stripe breeds.

A

uh Who added this and once released.

H

Yeah, I added that one. um It's been on the on the backlog for a while, since we haven't really done, we can look at it. I think it might be becoming more relevant again, with more analytics workloads being run on top of rgw things like um uh okay and and there are, and other sorts of processing engines that end up doing smaller beads than the whole object.

H

They're just looking at a subset of the data, and essentially the idea with the easy partial stripers is to be able to read just a small piece of the data without going to every single osd and and um if we're saving the cpu overhead uh on, say. Nine of the 12 of.

H

One end of the cables them um and then I went into the cost, basically for a kpsm erasure code. So basically, this is a um the read case. Theories is relatively simple compared to the right case, and I think it seems like it would provide significant benefit for a cpu load for these kinds of cases. Another application would be potentially a log structured format for rbd, which could be stored in ratio coded pools much more efficiently, but where reads especially small reads would be expensive with the current um striping strategy.

P

Josh, at one point I think we had been- we had talked about like for sufficiently small reads: switching to replication. It was that, did we end up ever discounting that, or was it just kind of they just disappeared?.

H

Well, this would be small reads for large objects, so the objects would still be large.

P

Oh, I see, I see, okay.

A

D

I remember that.

A

This was there in the background, but kept getting pushed. I also think this is a broader topic josh. Maybe we should also pick this up in a cdm where we can.

H

Yeah yeah: let's.

A

Talk more about it, this.

A

Okay uh with that, I think we've got a couple uh more broader um things to discuss in terms of testing improvements and board cleanup and deprecation. uh I know we are over time, but maybe next five minutes. We can quickly touch upon these.

A

So I guess the goal here is to talk about testing coverage improvement for things like stretch mode that got implemented a couple of years ago, but we are starting to find bugs um when our downstream folks are testing it. So it seems like we need more coverage across stretch mode testing in topology um other than we have some facets, which change the election mode and I also believe, there's a net split um uh subsuite.

A

But I guess this tracker kind of describes what the essential goals are.

F

Yeah, I don't think the netsplit is actually run anywhere. um I started on it and it sort of was like okay, here's here's the tasks, but actually integrating with anything useful is, is not done, um and I know it's confused like I implemented stretch mode and there was no pretense that the technology testing was adequate. um The.

H

F

Goes back some to some of the well, maybe it doesn't. There was a discussion earlier that was making me think of this. um So there's the election mode testing is pretty good because we can just switch it for the monitors and run it everywhere and because um I you know, wrote a new rope wrote a bunch of new monitor election logic. That was easy for me to architect so that I could write unit tests for it.

F

um But setting up netsplit is a pain and at the time that I was working on it, then we just couldn't really count on and then I know this is fixed now um we just couldn't really count on one test that took more than two nodes really ever running, and um you know we just mostly can't unit test the peering state, which is in lsds, which is a bummer, um but basically someone with you know more time for development than me needs to sit down and make right tests for this.

F

um I'm doing it or a definitely contact. That's basically like here's. What's written, here's what's not written and by the way please write tests because we need them.

A

That's that's a good start.

A

So how about we? I mean like at least one of the problems that you said. Like you know more than a test with more than two nodes. uh We don't have any kind of constraints at this point uh with the new dispatcher stuff in pathology. So I think the best starting point would be to look at what the net split uh suite is doing, and maybe you can talk about that in your talk and uh start running it and add more coverage. There.

F

A

Somebody else from.

I

The netherlands.

F

And then, since it's not possible to run things with three or five nodes or whatever, then we could use that to start doing exactly the split testing. um But all those things need to be written.

O

F

Yeah, yes, this needs to happen. If there's someone who wants to volunteer that would be great. I think junior has sent me a couple questions about pieces of it that I've been slutting back to him on, but I will try and try and help anyone as much as I can if they want to work on this.

A

There you go, I think we found found the mentor and the mentee for this.

J

Only if you buy me up here.

A

So yeah junior, you know which talk you're attending at cephalocon for sure.

M

A

All right, I think we are- we at least have a path forward for this. um The next one, and maybe the next few topics are not just radar specific. I think the whole chef project can um benefit from these. uh The first one here is about large, logical um skill testing.

A

So we discussed this in a cd, a cdm and one of the you know the basic building blocks we needed was this blue store, zero block detection, a piece that laura already worked on, and we have already merged this pr.

A

So, with this in place, I think the idea is to be able to design tests in our technology with like not just like three or five nodes, but much more number of nodes and many more number of osds and the starting point would be to just you know, run these existing tests that we have on a larger scale and see you know what it takes and I'm pretty sure there'll be a lot of bottlenecks and things will fail, but given that we have all all the basic stuff in it's again something we need to start exploring, um I mean yeah.

A

This is not not much discussion to be done, but work to be done any other any other thoughts. Anybody else has on this and why I said this is a crucial not just for raiders I mean any other component can start doing. You know larger scale testing you, given that this piece has already merged, so we are not literally filling up the the devices in the smithy machines, etc. So we can't afford to run tests with larger number of osds, even for a longer amount of time. If you want to.

A

Anything else on this topic.

H

I guess part of this could benefit the geta testing as well, since it involves looking at how we can scale down the resources required, not actually storing any data, and that could let us do even larger scale. Okay, by testing too.

A

Yep yep, that's a good point and it kind of ties with the config profile thing. uh We discussed that we could have a config profile that would be like low resource usage, config profile that we could set and the test would just you know, uh tune tuned. The cluster based on that.

D

I think we are already in discussion with some of the qe members right yeah for the automating, some of it, so I think it will help. I.

A

Think yeah automation can.

D

A

Yep, in terms of like you know, the actual work, I think if our qe folks can help with this it'll be uh it'll, be great.

A

All right with that, the next topic is everybody's favorite topic and uh one of our pain points upgrades. I think it's high time. We need to improve our um upgrade test coverage and we had some discussion around how we um you know: how do we ensure that all the prs that are getting back, ported and even like new prs that are coming in um are going through adequate upgrade tests or upgrade test runs? um I I feel this is going to be an ongoing discussion, but some of the basic ideas that came in were like.

A

um If there are, um let us say, encoding changes, or there are pieces of code that are prone to backward compatibility breakages, we could have things like github hooks, which would raise some sort of indication in the pr uh to a increase. The awareness of the reviewer be um kind of make the upgrade test um run a requirement, um and not just you know, leave it to to the p person who's testing it to identify whether an upgrade risk is needed or not. So I guess there are a few ideas.

A

We need to start implementing some of this github hooks thing or even like bring it to our um day-to-day when we are reviewing pr things that need upgrade testing, how do we ensure that they don't get merged without that, and the other thing is that we are also doing baseline baseline tests for for, like almost all suites, for the master branch every week. So upgrades is going to be something that we are going to do. You know test with that.

A

Any more any more thoughts on this.

A

If not, I am just going to leave the last. I mean this is just a general topic that we we have like things like trello tracker tracker in which we track features, but we don't have anything where um wherein we can actually track test improvements, um and one of the use cases was that we had a contributor who who had a valid patch, but we wanted them to write a test and they because of their lack of familiarity or whatever it is weren't able to contribute that particular test.

A

um So for such cases, and even for, like you know, other things that we keep talking about, like hey it'll, be nice to have such a test, there's nothing that we are doing to attract those kind of things and some of those could actually be beginner projects and low hanging fruits um for contributors.

A

So it's a general question about how what mechanism we want to use to track uh test improvements. We could do and do it in the form of, like you know, a separate category in um our red mine component, like we already have bugs features. We could have something like this that we're looking at at a regular cadence or we could do trello, but I I believe this is also another topic um where we want to get more developer feedback on any any other thoughts or strong opinions on this.

A

Or do we all agree? It's a good idea.

E

I just wrote a comment: I like the idea of using redmine for that, um because that's a lot of users are familiar with and it's a good idea and overall, I think.

A

Yeah, I think uh radek- and I were talking about this and we were like okay, it just makes the obvious place would be a red mine. um I adding a new category. There shouldn't be a big task, but if you think that's a good idea, we can just go ahead and get that done.

A

And yeah, I think the next one is it's again, something that the raiders team has seen um so currently, the entire time for to complete a raiders uh sweet run is like five hours. I would say not less than that. Of course. uh What can we do to reduce that? So I think this is again a common topic we will be discussing in different forums, but just wanted to raise awareness on it.

A

Finally, I think the last topic is around cold cleanup and deprecation, and probably radical your favorite protection here. um You want to leave this one.

C

Sure, well, actually, one of the ways to reduce the uh the run time of of our test is to duplicate components. For instance, I think we're still testing a pretty extensive file store. We could, if we duplicate it. If we we finally remove it, we could actually uh go back some of our resources and we can spend it are either spent on more, let's say, bluestar testing or just squeeze the time of uh the resources spent for particular first single run, so it's also been beneficial. However, I haven't added file, sir anyway, I'm the I'm.

C

I really I would really vote uh for it.

A

Yeah, so I think we can that prick. We have already sent out a deprecation warning in quincy for file store, so I would say um most of our test cases can now afford to not run with file store. Maybe there are things like upgrade tests and stuff which still should use file store, um but I think in general we can get rid of it.

C

D

A

C

Easy, it's not a no-brainer another. Another points are about, underpoint is about memdb. It's uh it's an outstanding pr.

C

And the question is whether do we really have an active user of it, because if not, we could not only kill one of the kv backends, but we also could uh eradicate some complexity from buffer list. Actually, there is batteries. Buffer raw has the interface for cloning, which means a bunch of code and even putting some extra data for into each of our buffers just for the sake of cloning for the cloning uh responsibility which is being used solely inside mdb. So I believe there are a few rabbits to shoot the same ballot here.

C

The question is: do we really have an active user of it? I don't know, I don't know the answer.

A

So what prevented us from merging it last time just got reviewed? Was there.

C

Being unsure, I think we should I I think we should uh at least make an announcement to uh to the mailing list. Ask uh ask all developers I mean outside uh whether it's still useful for them or not. I would expect a silence to be honest, but.

F

We're is that something for the like memory store that we have probably I mean.

C

No, it's external users.

F

Or something that's in memory.

C

No, I think it's for k for testing the the kv interface, uh so it's not used by uh memster.

G

J

C

It looks like you're gonna track.

F

That down, because there's no way something like that is exposed to any external users, um it's just about what we're.

G

Doing in the project.

F

C

People have suggested that there might be some users, let's say in intel, but I don't know.

A

Okay, I think it doesn't hurt to ask in the user uh thread um if there are existing users, they'll, probably chime, and then.

C

Yep well uh another question: it spawns whether whether do we need to go through uh the full brown uh deprecation and removal process, or maybe we could just remove it immediately in in the reef.

H

Like we said, I think it's a feature, not a user feature, so I think it's, it's really a removable uh question, not a deprecation question.

A

C

C

I think we have this point addressed uh so moving to another one. The next one uh chef was the watch. Op legacy watch very old one. That is that, however, there is one clutch here: uh krbd and actually uh cardinals uh uh in the candle version around 4.66.

C

There is a last comment from uh from ilya droimov who, uh who made us aware that this this old, this legacy app is still in still is being used by by on very old cardinals, uh okay, everything below 4.7 from july 2016., six years uh coming from now, six years after after the release, I think it would be rather safe to finally kill it.

C

Still, I'm not entirely sure- and I I will need to I really. I would really appreciate feedback on that battle. The benefit the potential benefit from eradication is that uh actually the watch notify machinery it got into uh now it got in the classical osd. It got two variants, two tastes, two flavors uh one is assuming that we can reliably.

C

Discover a connection uh reset, which is the legacy one well internet wrong. The assumption is wrong and later implemented, a full blown, pink punk machinery on our own just to detect uh just detect connection issues and this one in the new flavor. Only the new flavor got implemented in crimson.

C

uh Unfortunately, the old one will be far far complex and painful to bring, especially in the multi-reactor.

C

Stage of crimson, because it requires basically some interaction with ms handler reset of a messenger. It has, uh in other words it's it made our our session. It makes a necessity of our sessions to be aware about the watch con state.

C

If we can remove that it will be, it will be, it will for sure uh decrease the complexity pretty significantly of the watch certified uh stuff.

C

Still. Is it safe to forget about carbon stream? uh Let's say 2016.

F

Those are two different questions. I think um I can't imagine there's any reasonable argument. We need to implement the old version in crimson um for removing it, though. What we need to do is look at what distros are shipping which kernels and what lts releases probably and then make a call, because if you know, there's a still active, ubuntu lts that doesn't support the new watch, notify we probably don't want to kill it until that's done.

C

Agreed agreed, but actually yes, there are two cases we don't. uh First one is uh screen is implementing the crimson. We don't need to. Of course we don't need to do that, except somebody would mandate as the support very also cardinals right or not. uh The second one is with slg movable, so yeah you, I think, you're also right. We need to. We need to start from uh from uh from taking a look on on the current version.

C

Distro mapping.

A

Anything else on that topic.

C

uh Another question uh pull duplication or just removal, assuming assuming uh we don't need to worry about the distraught ldss.

A

I I don't know, let's at least see if there are any destroyers that are still supporting it, then you probably cannot even send a deprecation message right. It's a dependency.

C

Yes, but there won't be after solving it. It won't be cds to ask about.

A

It'll be tbd, we can, we can discuss in a different okay, okay,.

C

Once we have the answer, okay, you need to no need to spend more time on that. uh The last thing is: uh uh is inside mdr module? I don't know uh whether that.

A

I added it I just wanted to understand. Like does anybody know if there are any users of this module? I know it got added few years ago, but I don't think anybody's maintaining it.

H

I wouldn't expect any, I think, it's again something we could ask the user list about just to be sure.

A

Alright, I agree, and that is pretty much everything that's on the agenda 27 minutes over time, but um I guess is there anything else which is not in the agenda and you want to talk about.

A

If not, thank you for joining um the raiders area session for reef and thank you for staying over time as well. Have a good rest of your day, see you later bye. Thank you. Everyone guys bye thanks.

D

Thank you. Thank you. Bye.