Ceph Developer Monthly, 7 Dec 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Ceph Developer Monthly 2022-12-07

Description

Join us on the first Wednesday of every month for the Ceph Developer Monthly meeting

https://tracker.ceph.com/projects/cep...

A

Maybe a couple of minutes past beginning time, so maybe we can kick it off. We have a bunch of topics to go through thanks Laura for adding those links to the chat and, let's kick it off with the first topic. The Pali.

A

Are you on the call? Yes.

B

Yes, I'm there.

A

Feel free to take it away. Yeah.

B

Okay, so I am a graduate student from Northeastern University studying information system today, I'll be talking about depend about so what is depend about? It is basically uh um I'll just I'll just brief about what is depend about. It is basically a tool where uh which helps in which helps um our which helps um are depend which helps to which helps out depend which helps our project for watching, update or to help um help our dependency to have a sick to help our project to have a secure, secure dependencies in our project.

B

So so why? um Why do we need a.

B

So what is the current situation in our safe project? So um so, if and for example, if any uh developer is working through an error and he spends the entire day working on it and at the end of the day he understands that uh the error was basically due to the version was not updated and the project was being working on an older version of that project, which was uh eventually giving that error.

B

So uh the developer has spent the entire day and it was not needed what, if this process of having an updated dependency in our project was automated. So how? How do we go about this uh so we'll basically need to have a list of current dependencies and their version? So how can we get that it can be through spec file? uh Also, um we can get.

B

um We can get that for python for requirement dot text for package manager pip uh for in the same manner, for npm will we can get it using package.json uh for uh JavaScript, um typescript or angular.js. Also uh for Java related dependency, we can have pom.xml or go modules uh for uh for go language and also have their Origins uh in the listing. So why would we need uh to list the dependencies?

B

uh These listing would help us understand the current version that we are using and also if there are, if there is, we can check if there is any new version uh for that particular dependency. This will help the developer to save a lot of time, also or if, um if there is any current dependency, which is not uh secure to be used, we can also check that so here comes uh depend about.

B

um So uh what does depend about? Do it basically uh tries to update um it basically keeps your dependencies updated to keep our software secure.

B

It um how how does it do that? It basically provides two features, uh that is a version update and provide security alerts. uh It can it checks uh for packages uh for its uh for its features in uh Ruby, JavaScript python, PHP Elixir go rust, Java and dotnet. So what are the features? uh What are the features of depend about so um it provides? What does it provide depend? What does Dependable provide? It basically provides version updates for its depend or our dependencies also.

B

It provides security alerts uh for uh any vulnerable dependencies that are there in our project. So let's go with the first feature and that is version update. So what does uh depend about do into its version update feature for any normal circumstances.

B

um If a developer is if a developer comes across it that a dependency which needs to be updated with a newer version that developer will manually, go and correct the dependency and create a PR and uh go through all the process, so depend about comes into picture and it will check the list of dependencies that are there in a project all and also um if there is any new version for that particular dependency. It will update it and also create a PR for that project.

B

So all the effort for the developer is saved and uh here's the virtual representation for the pr uh that is created by depend about for version update. So we basically have um the bump for this particular dependency node from 12 to 12.21 to 18.8, and um it also has reviewers for the particular PR.

B

So this is actually the um actual representation of the um PR. So uh we have the commit message and the commit message can be configured in a depend about configuration file like uh the dependency node, which is being bumped from uh so and so depend version number also where it is located. So if any developer needs to go and verify all these changes, so he can check it through uh through the location. Also uh through uh the the description we can understand.

B

If there are Earth what are the additional features that are there in the new version and if it is needed for our project also, there are reviewers so that we can uh um go add the reviewers for the particular ecosystems or package managers or languages related projects in ourself.

B

So moving forward, we have the um next feature that is security alert.

B

So um so, if there is any dependency, uh that is uh there in our project, which is not a which is not vulnerable, which is vulnerable and is not secure for our project and um our project might be prone to any malicious attack because of that dependency. So what does depend about uh do in this? It scans all the dependencies in our project uh and checks uh if there are any dependencies which are vulnerable, uh which are vulnerable or not secure, and it will give a give give or send an alert in GitHub.

B

It also suggests a fix on that. It would create a PR uh for um for the closest non-vulnerable version for that particular dependency, so that we can know that um we can place our dependency version to a safer version.

B

So here's the visual representation for uh the alerts that depend about creates. So um so we can basically uh um sort these by a severity as in high critical moderate. So when a developer goes through all these uh um all these lists, he can sort these and give preferences to the severity.

B

So uh moving forward, we look at the implementation. uh The implementation is simple and easy. We need to enable depend about for GitHub, uh which is already done for uh npm um package, ecosystem package, ecosystem and GitHub work actions, uh GitHub actions, um so um we configure depend about for all the dependency uh ecosystems that are required for version update uh ecosystem is basically package manager. For example, your package ecosystem is npm package. Ecosystem is GitHub actions so in safe.

B

Currently, we have already implemented ecosystem for um npm and GitHub actions um also for getting the security alerts. uh We enable it in our um we enable it in uh it in our Organization for GitHub and thus uh for moving forward. We um configure depend about for using depend about dot yaml in GitHub dot GitHub directory.

B

So the configuration uh consists of the package ecosystem that is pip and the directory where we want to uh have these um checks to be done on. We can also schedule this like on daily basis or or give a specific time. The intervals can be like daily, weekly or monthly. Whenever we want, as said previously, in the version update uh PR, we can even uh have a commit message uh and add prefix like uh like. We had um ngr dashboard for the dashboard related project. Then we can add reviewers. Also.

B

We can limit the number of pull requests uh for that um for that particular ecosystem. So we can have open uh 20 open, pull requests for pip for this particular directory location.

B

So with all these fascinating features, what what can be the challenge in implementing this in our project? So, uh as you see like, uh there are 56 open, um PR's and 153 closed PRS, so um any developer uh work um working on depend about would get uh like on daily basis.

B

uh We can have number of dependencies being updated from um from major version to minor version and Etc, so we can have number of uh PR's being generated so um so the developer might get too many PRS for to review and which might be not even needed for our purpose.

B

So um the solution uh is, we can basically add reviewers So. Currently in our project we have code owners, so um so, if we are making any changes for the directory e dashboard and through code owners, only the dashboard related um dashboard related team would get the alerts about these and uh can't review the PRS and approve them. uh Also, if, if needed, we can add reviewers explicitly like we can add the team, we can also add the username or we can add the organization.

B

Thank you, and uh if there are any questions, I'm open to answer also um the mentors in my uh project who helped me through this is Chris. My manager, Christina, Justin and Gabriella.

C

Thanks so much that was a great presentation.

B

C

Gabriella says: awesome, work, I, don't know if you can see the chat.

C

I have a question for you. um So when you were working in your internship, what was the hardest uh part about it and what was the part that you liked the most.

B

uh At the beginning, uh as I I would say, depend about working on depend about. One was one of the interesting thing as um as I didn't know, like I just had the problem statement that I needed to know how. How can I update the uh update the version and like I, went from scratch like how can I go about it like um like getting on?

B

How do I get uh get the dependencies and know the version numbers and how can I compare it and understanding this I understood like um how can I go about it and look look around in on uh GitHub to our other contributors or what are they doing about it? And uh if I can understand what they are doing and apply it through uh in my project about it? So that was the most interesting part.

C

That's great to hear and I think in the the project uh we've got. um We had depend about it for some in some areas and not exactly.

B

Yeah we had depend about for uh dashboard related project that is basically for npm, and but we can add it for uh others to like uh for python. There was specifically one uh PR that I saw that it. There was a update for Fedora and if, if we could get that uh um update, like um any OS related major updates, so it could be really helpful for our developers that they can look for foreign.

C

Thanks so much does anybody else? Have any questions or comments for Ollie.

A

Thank you for presenting your welcome buddy I think it's really useful, as we understood from your presentation.

B

A

If there are no more questions for the Pali, you probably move on to the next topic.

A

Ronin, do you want to talk about your super awesome, scrub scheduling, yeah.

D

So let me start hey I'm. My plan is to describe the some of the changes that I hope to get merged. They all were already merged into with.

D

Regarding the scrub scheduling, um the idea is to dedicate some 10-15 minutes to describing house Cup scheduling is handled in in the existing codes and why the existing code is built, the way it is, and then what the changes that I'm trying to push for approval and what will be the benefits in those changes now most of what I'll be describing is now in a pr49237.

D

D

D

um So this is a I said that starts with a few minutes perspective. This is the way scrub scheduling, scrub implementation, a.

D

It was designed for where, when they started handing the codes.

D

um The basic operation is, was, and is still like, this, the OSD uh as a scrub queue a queue of the owned pgs, those pages for which.

D

And everything, actually it's a take without OSD log for those but every tick. The OST queries these queue and looks for the first first PG, that is a scheduled to describe to be described and scrubbed. The PG held its as part of the PG class held the scrub related data that controlled the scheduling and controlled. The behavior of this most most important word is a set of flags.

D

There was probably you know. The names there were there was a set of a large set of flags like wreck scrub. Mask scrub, must deep scrub Etc, which controlled when when the scrub will be performed to some extent and what type of scrub, what level whether it will be a shallow or a deep scrub and controlled some other parameters. We talked about it later in more details, Okay, so I started working on scrub.

D

Some like I, said three and a half years ago, and the idea was there to put it wheels on which I never got around and but on the way to doing that. I.

D

Felt obliged to a perform some cleanups and solve some very small number of bugs that existed in the code and some features were added. So the first.

D

To facilitate this deporting, the first step was to a extract most of the scrub scrub related code out of the PG proper PG class popper, a a new, a new class was created, the PG scrubber hpg owns one of those and the PG scrubber encompasses most of the functionality that is required to perform a scrub by the way I'm talking without seeing any one of you and without you here we can think so.

D

Let me know if you stop hearing a mail. Okay,.

D

Okay and I don't see any chat so I, don't know why but I don't so.

C

I'll, let you know if there's anything in the chat so far, uh it's clear.

D

Thank you, okay. So the first thing like I, said we describe and when I try to extract the pre-discover I again encountered the issue of flags and trying to understand those flags.

D

um This was one of the heavy heavy issues that I encoded I. The first step I did was to separate those into two groups. One one group was the set of flags that controlled the next Club the plant crab and the other was were made part of the pity scrubber and were affecting the currently running scrub. The current scrub session- okay- and there is there- is a. There is a point in time when the scrub starts, when this set of plugs is Frozen.

D

Now this this is the one side of the change regarding the scrap scheduling a in the first phase, nothing was changed and then I took up. I took out the scrap queue from within the OSD made it separate entity, or is it each of his dinner owns a scrap queue but- and the idea is mostly the same- there is a queue of on pgs, but there was an added a structure that was added, which is the scrub job in the middle.

D

Describe job has a holds the data required by the scrub queue in order to risk schedule order and initiate, or ask a PG to initiate a scrub on a specific for specific PG. So this is a.

D

This is very basically the situation. The main.

D

The main points of the existing implementation, two points to note, take a look at what happens when a when trying to schedule a scrub like we said the OSD once every second or on most seconds there is some Randomness there was the ask, describe queue for the topmost PG describe and then make a just-in-time decision. What level of scrub is required?

D

The level is determination is based on the specific flags on the in the PG, those plan scrubs we discussed and some environment variables, for example, are we allowed to perform a deep scrub? Are we allowed to perform better scrub whatever? What what time in the oops I don't know why you just okay, um what all these decisions are made when we initiate the scrubbing of a specific pigeon Now display itself is a problem, because it means that we can never tell the user or not always.

D

We cannot always tell the user what type of Club will be performed on the specific regime in the field. It is a minor nuisance and there are some other issues with this implementation that as it was they.

D

The main problem that was encountered by your users is what happens if this OSD tries to start scrubbing. But there is a problem and there are two and for this discussion there are two types of problems.

D

So, let's tap every tick or more sticks, we are selecting the first PG in the queue as a candidate. We then perform some validation. Checks is the PG. Still there is the PG still active. Is it everything is right for coming? Do we have the environment and the configuration Etc to start scrubbing this PG? If all okay, if all is okay, what? Why is this.

D

Spelling by itself, if everything is okay, we are asking the PG to perform the scrub.

D

And get the PG then goes and I'm trying not to something causes the slides to go back. I, don't know why, anyway, uh the PG tries the PG try to initiate the scrub by first reserving the replicas, and this is an important step, the P, the primary the primary OSD requests, all the replicas to assign resources which you and the resource is simply a counter within the active OSD.

D

Are we that says how busy is the models? How many scrubs are running on a specific OSD, and this is a limited resource? The PG asked tries to reserve all the replicas and once granted we start scrubbing.

D

That's the basic idea and everything is fine. Unless there is an error now we have. There is a problem here. If okay there's a problem here, suppose we try to scrub. Apg is the topmost in the queue and we try to scrub it, and then it fails to reserve the replicas in we discuss in a minute some reasons. But what will happen we've already?

D

They they stick this OSD take.

D

We will not be trying a any more pages, we're not trying to initiate a scrub on any more pgs in that specific time frame and because, uh as a trying to get trying to reserve the replicas and being rejected might take time might take seconds.

D

We are wasting in quotation marks. We are wasting more than one tick and then decide, and then we decide. We understand that this PG has failed in secured resources. Okay, suppose that happened. What happened the next time? The next stick? If that PG is still the topmost, we will again try and scrub try to initiate the scrub on that specific PG.

D

Now this problem was solved mostly a couple of years ago, when we added when we modify the scrap queue to have two cues. Instead, I don't understand this two cues. Instead of one.

E

D

Here we have a PG, we have a queue of the regular Q2 scrub and a second cue for those pgs that have failed. Securing the replicas in those pgs are penalized a folder failure. Everything's. Okay, do you know again no feedback.

C

Everything's good.

D

Okay: first, one I wish I understood why my slides are moving around okay. So this is a.

D

Okay, so this is the state as it was, and uh mostly this is how things are increasing and it's pretty good. It works pretty good apart from some issues, it isn't perfect.

D

For example, we have handled one of the failure modes of a scrub, but there are other some other, whatever things that might happen during a scrub.

F

D

And might cause problems with the general handling of the scrap queue, for example, suppose one scrub session is stuck now. This is always a bug, but it happens. Usually it happens because a we have a an object that is locked indefinitely, a.

D

In and this causes the scrub to block and then, if you consider what happens, we have an OSD that is running grab for some PJ and this scrub does not terminate, which means that this there is data specific or is they now wasted?

D

Has one of his resources a locked uh for long for a long period of time, and during that time, that OSD have has reduced capacity to answer other requests for scrubbing, for example, for those pgs for which it is the replica it serves as a replica not as a primary.

D

So the effect of a stock scrub is pretty spread. It's not just one PG that is affected, but it might be more than that. That's one example of a problem, and there are others.

D

I will try to vote.

D

So I once spent a lot of time trying to explain in this specific moment in.

D

uh Well, the problems are, but just to mention some. For example, if we have a group of pgs that cannot be scrubbed, but have they are high in the queue we are wasting a effort in trying to scrub to scrub them to initiate the scrap. Now it's not that we are wasting a lot of CPU, but mostly I, think it's a lot of logs that are that we are creating, especially in debug modes, and this is wasteful and disturbing.

D

um Another issue, which is a more design or design golden a problem, is the fact the fact that the scheduling decisions are distributed.

D

We are not every each OSD makes its own decisions regarding those pages for which it is the primary, and there is no Central Authority which dies which club should be a scheduled as the next which position should be scrubbed at any specific point of time. Now we wish to keep this this way the distributed the the distributed.

D

Okay, I see the.

D

Distributed scheduling has the benefit of being easier to maintain and easier to debug than any cluster-wide system.

D

I know we have customers, the USF users that build their own Central scrub Management on top of what we are doing here and one of the at one of the ideas. One of my goals is to create a system which will only make make this Central system not necessary, I'm trying to cover most of the 90 or more percent of what you might might be achieved with Central scheduling in what steep, by still maintaining the distributed security.

D

Another is the issue of cluster-wide effects. Problems like I said: I gave an example. What happens when one scrub on one SPG on, probably because of one object? It has a problem, it is. It is an issue and there is an issue that is not fully investigated, but happens. We have clients, we have a SF users that see a long tail of pgs that are never scrubbed.

D

I have I have some theories about that, and, and even some things that already some reason that already known, but you can see the points you can see. What might happen here suppose you have a group of pgs that for some reason, a fail constantly fail. If this group is large enough, it might mean that stampages are starved and never get to okay, 8.4 scrub and the last two issues observability.

D

It's not easy. Even for for those who know the algorithm and know the way, things are done to to understand what or quickly understand what scrubs will be performed on which pages in the near field, and there is the issue of code maintainability.

D

So, let's, let's talk about the main changes in the innovation and there are three changes there in Scott's, three main changes in club scheduling- and this is a here here- is the beginning of the main part of the short and it's actually the shorter part.

D

Where we have this.

D

We have discussed the the issue of that is object and that's called scrub. Job is something that is shared between the scrap View and the PG scrubber, and describe and holds all the information regarding the scheduling of a specific PG.

D

What was added in reef is a scrub Target. An object called scrub, Target car target. Has you can see it here? I'm, not sure.

D

If you see my pointer, but a scrap target has an urgency which we'll talk in a second target time, a deadline, Target time and deadline of what we have now the same and not before time, which we'll describe in a minute in each scrub job, which means that each PG, through his PG scrubber each scrub job, holds two Targets to schedule a targets, a shadow Target and a deep Target and the main chain is the describe queue. The OSD scrub queue is now composed of those targets.

D

Okay, take a look at the in this in the left of this slide. This is how a scrub queue might look like you have entries which specify the PG and the level okay. So in this specific example, PG 1.7 appears twice. The shallow scrub is first because of whatever parameters and timestamps relevant and PG, and the Deep scope for that specific PG will be later in the queue.

D

Okay. Is this clear?

D

Okay, I want and I won't wait for.

D

This change by itself helps the visibility, because now, when you take, when you look at the scrap queue and they will see how it looks in the logs damp play commands, you can clearly see what types of Scrubs are planned and we're talking. We see in a minute what other types of information will be available to make the scope queue easier to curse for the user.

D

Okay, this apart from the visibility, this change allows us to do the following the numbers: the next two changes. One is the urgency which is marked here with yellow.

D

We talked we talked about. uh We talked about this. The flags I mean to remind you. There is a set of flags now in two groups. One is the requested or the planned neck scrub, which Define. What? How do we want to see the next curve on a specific PG, for example? Should it be a dips? Must it be a deep scrub, uh should it be a high priority scrub Etc?

D

This is the list of drugs taken from current code, and we have the operation flags that affect the currently running.

D

Here we have a limited number of those.

D

What the other, what this urgency field is, is a way to Encompass most of the functionality of these flags as what is a priority or urgency priorities.

D

Other things, it's a way to specify how fast we want or urgently we want this specific, specific grab and variable. We have two Targets, which means we have different agencies for the shallow and the Deep targets for each scrub job. So it means we can specify how urgent is the shallow or the Deep Target for each scrub Loop.

D

And if you look here on what types of urgencies we have, you can see that we don't have the periodic regular, which is what we expect most of the time to be to see in in a regular cluster.

D

Okay, and but we also have the some other agencies, for example, analyzed is the implementation of the those the penalized Q which we saw earlier now we don't need it. It's just one more urgency, with some specific logic around it in the same hierarchy of urgency, we have the overview, which is another a Behavior or another functionality that is currently handed sometimes with flags overdue means that we are beyond the target of the Beyond them. So beyond the deadline for specific Target and again, because we have two different targets: the shallow and the Deep.

D

We can do that and then we have. The higher priority uh agencies operator requested must, which is specific to something that we have now, which mean which is a uh after a request from the operator with referral. Mostly. This is mostly why we have mass and we have the after repair, which is an immediate type of scrub, of deep scrub. That is both of which is performed after a repair in some instances and should be, and should be scheduled immediately after the repair, and that is why it's given the highest priority.

D

All in all those flags and copas, most of the functionality of the existing existing scrubs is existing Flags a for example. Suppose. There was the flag that I earlier the mass deep scrub. Okay.

D

Here it is looks: why is it moving.

D

We have the mass deep scrub here below the mass deep scrub is set for other scrub after repair or if the operator requested, like I, said, requested the either a deep scrub or a repair which means leave stop with the curved.

D

Now we don't need those this flat because the functionality is encompassed in the Mast and the uh and the after repair the agencies in the in the urgency enum that I showed you and the same holds for the other tags that you see. None of them was needed. Anymore is needed anymore, which means that a lot of code and a lot of a a lot of areas of a in Clarity were removed and improved.

D

For me, it's amazing not to say user. This is one of the main benefits of the new fermentation, the other way in the last, the other way it changed. Why? Why is it.

C

D

C

And yeah it's weird, but we can still get your presentation, no worries.

D

Yeah I, don't know why it's just moving around like.

D

I don't know anyway,.

D

The point is that not before that you see here again in the yellow there- yeah, oh yeah, I- hope it's yellow, probably might be cooling.

C

uh It's yellow, okay,.

D

Okay and then not before.

D

Not before is there is a times is a time, a point, and that said that is set for each Target each schedule Target and will prevent this target from being scheduled until it is reached. What's what it's good for?

D

This is what we said earlier that suppose a PG has failed, uh failed to start or a PGA fair to a scrubbing or fed during during scrub, for whatever reason, this enabled enables us to set a new note before in some time six seconds and 10 seconds in the future, which, with the end, depending on whatever factors for example, why? What was the reason for the failure?

D

How many other pgs are waiting Etc, but it at least enables us to set a time that and make sure that we will not try again for this, a scheduling, Target and, at the same time, enables us to to keep the urgency the target time the deadline, all those parameters that stay, how badly? We need this in this case, a describe when we, when we will be able to, but only when we will be able, we will be allowed to retry it.

D

Actually, the idea of the note before was why I started making changes to scheduling in the first place, but it it now with the urgency and the idea of the scare targets. There is a whole package that improves the behavior and the visibility of the of the whole is Co-op scheduling.

D

Now a few points that uh I want to make if anyone will be looking at the code- and we try to understand it, let's take a look at the scrub queue. The scrub queue is ordered a first by those jobs that are those not jobs, those those get targets that are ripe, which means not before has arrived and for those that are ripe. They are sorted by urgency, then deadline, then Target, Etc and all the rest of the jobs that are not ripe are sorted first by the not before.

D

Which means that if you have 10 jobs that have to notify in the past, the urgency will determine which which card will probably perform?

D

Okay, that's one imported and one is one reason that one complication in the code, but for now I think it's it's a major. It's a requirement.

D

B

D

Point to a just to mention is here the next channel in the next dip. There are many instances where A change is made to the Target, for example, when a scrub terminates or when a scrub ends on if a configuration changes or if the operator the issues are command. In all those cases, a Target is modified, but before what before it is modified a well, the specific Target that is modified depends on whether we are currently scrubbing or using the specific Target for scrubbing and to allow this.

D

We have some kind of double waffle with the next Channel Next dip entries had managed by the scrap job most of the code. That is not aware of this, but there is a most of the code. Just asks gave me the shell of Target that is modifiable and will usually we see, get back the shadow Target, the main Shadow Target, but in some cases, when the only case is, if we are currently performing a shallow scrub of the specific PG, a the modifiable target will be the next Channel okay.

D

This is some something that you might see in the code. Okay, so this is the the basic design here is more how it might look to a user running a query on the PG. You can see here the shadow I, don't know if you see my pointer when I move it.

G

I, don't see it no.

D

Yeah, okay, so I'm just pointing in India. Okay, you can see the uh anyway oh.

C

No I see it interesting, I think it just goes away after a while. It's very tiny.

D

uh Okay, I, don't see an option to start it, okay to start uh to show it, but anyway there is. If you can see, for example, the PG. You can see the shell of Target parameters, they left me the left and the Deep Target rights are probably columns and you see a reference to the nearest of the two.

D

This is something that is always maintained just to allow for easy listing dumping Etc the the jobs the club job maintains the pointer or reference whatever to the nearest of the shallow or deep Target.

D

um This is one example of listing here is an example of how much things might look at at the logs in the logs, usually a okay. This is how look at the first line. This is how a scrub queue is a wow, but this is how a Skype queue might appear in the in the log.

D

You can see that we have a pg.1.0.

D

Is not it's not ripe, the closest Target is on 7th of December, etc, etc, and we have the closest Target here. It is a shallow one. The not it's not before is is whatever it will be. A periodic, regular urgency and here are the rest of the parameters, and there is a the issue feed which stays might say why why we failed the last time if we failed uh scheduling this specific Target, okay and below that.

D

You can see a.

D

Another example of in Upper here, in this case an operator requested the enjoy Target, can see the specific urgency appearing here quite clearly.

D

D

And when you are looking at log lines that are not perhaps specific, as you know, whenever we have a PG mentioned in the logs, there is a lot. There is a long string of data about that. Pg and scrubs in the scrubbing state is one of the data items.

D

uh Usually it was the old Flags with and when I once those flags were segregated into two glasses in two sets, like I said earlier, it was it wasn't very easy to understand. What's going on with the with them now you can I added again a clever understanding of what a clear depiction of what this current scrub or the next scrub will be, for example, take a look at the first line. Second line, so PG 2.5, PT 2.5 is currently active and clean.

D

It is not scrubbing, but there is an operator request for the next car for a shallow scrub. That's why you hear you can see here. It's part of the PG status line you can see. The next verb will be a shadow and will be operator level.

D

The first line, PG 2.7, doesn't oh good.

D

The first ipg 2.7 does not have this a operator request, and this next Curve will be a regular periodic. In this case it doesn't appear. We don't need to just.

D

Ms, the logs with data which is meaningless and the last line the further PG 2.5 is now scrubbing and now in a different format, you can see that the current scrub is shallow is mandatory, which means it was forced, does have a meaning in the in the law in the code, and it was operated to the urgency.

D

D

I think that's a that's the main issues. That's the main idea of the of the change.

C

D

Are there any questions.

H

So at the beginning you were talking about idle slots when people use failed to get remote reservations did I and I thought you were going to do something to change that that I, but I didn't hear anything about. It did I misunderstand.

D

Okay, one second, let me just stop the okay regarding reservations that failed. There were a few changes that were in the field that a planned I hope to get them. A one issue was a we have. We have now A system that reports whenever uh reservation request is late or delayed.

D

Now this is in five little stages of testing I hope to merge it in the next couple of days this. So this specific issue is halfway to be handled and we have the we had the issue of a scrubs resources locked by jobs that are blocked, which I mentioned earlier now, the what they have water is already does already exist in currency is quite recent. Change is a warning in the cluster. Look there isn't much there isn't a lot.

D

I can do when a club is blocked on an object, but at least now there is a clear warning in the cluster log about it or what it's worth.

D

D

There is a fpr in the works which I I'm not sure what will be I will be able to uh make get ready in in time for with which will react to those to such a locked or would have blocked for whatever reason, scrubs.

D

Okay, that's more or less what was done.

H

So I guess it seems, like the main. The main changes are that we, you collapsed, a bunch of Boolean flags that were confusing if they combined into a single enum and then the scheduling has that not before feature and it reorders live.

H

I wasn't totally sure about all the data structures and which parts were like the key bits.

D

A

I think even I'm trying to get to that question is that you earlier mentioned that you're wasting a big cycle or more than one cycle. uh If something gets preempted right, um how is the urgency class or whatever the enum solving that problem.

D

Okay, he the urgency by itself, does not. This is not I mentioned it. Just is one of the problems, not uh not the only problem we have, and so, if I give the longer person that I'm solving this here, what I am what I am doing is enabling well I'm solving some of the problem of the long tail that is caused by such because by using the note before this is a this is this does have a effect on what what will be the changes of of uh starvation, of bridges.

D

um I am hoping to use the same mechanism, the same enum to have more to handle other types of a runtime or scrub time failures, but yeah. The main change in in solving this problem is the note before.

A

And from a user perspective, a lot of the old flags and stuff have now changed, meaning or do not exist right. So there is going to be a lot of documentation that probably would be required for this.

D

Yes, although even in the last already two versions of the we already have some name changes in the because of the separations of the flags we have, we had planned a a Legacy and the current Flags and names are repeated or changed, and it doesn't seem that anyone noticed that or the.

A

The PG state that you have now like, instead of like use, you have active, clean, scrubbing now the the shallow operator mandatory. All that will show up in Step status as well or it's just for the PG, uh like the log line.

D

Take it well, it's a good question: it should. It should appear if I didn't make it appear in the in the status.

A

Yeah, so that's where I'm coming from so if users suddenly start to see all these new fancy words, they probably will be confused about what's going on. So that's why I emphasize the importance of documenting this yeah.

D

Okay, I I try to clarify what changes and whatnot, for example, if you're doing the ls by PGA or whatever.

D

You won't see change I'm, not changing, because in this stage, I'm not changing the interface that transmit the data to a 3D monitors.

D

I'm not adding data to this okay in one second, and before there is one you know which I didn't, including this in this presentation. Greg. Regarding your what you asked earlier, one of the the next one of the next PRS that I'm working on was was specifically regarding the issue of a replica reservation.

D

I and I've already included the initial changes for that in the current APR.

D

The idea is that the idea is that, if that the request for reservation form, a primary to the replicas will include the urgency as one of the parameters and the in the in OSD, which refused an urgent or high priority, not as a high priority or not very periodic request for resources will be, will be locked from granting those resources to anyone with lower priority for time.

D

This will at least give a chance for a high priority, a scratch that cons are constantly delayed because of in osda replica OSD, which is very busy to be sure to be to advance in time, and this change is not fully is not ready for inclusion in at this point, but it is part of the solution.

H

So is there any user, visible change to like scrubbing constraints or inputs yet or suggest? You think that the by adding the knot before it schedules better.

D

Each sketches better I am exposing the data because it seems that the the clients want to know why things are scheduled or not. uh There are a few knobs that are added.

H

Well, where's it where's it exposed if you're not changing what the monitors see.

D

It is exposed in um in the inquiry wherever I can send a a Json. I can add the you can add data. If, if you, if you, if you, if you don't consider a query, if the user operator option okay.

H

So so I guess let me reverse that, are you like removing fields or is it merely a or is it or is it just the adding of the of a little more Diagnostics.

D

I can't give you a full answer. The idea was just to add I'm not, but I did uh I wasn't able to fully keep the what you had in a in a query request.

H

Well, okay, so so let me I I guess what I'm getting at is that it seems like you've added several new data structures, but the important parts as I understand it are that it's got the not before field which is used in like ordering the pgs that get chosen to scrub and all the all the fools that were mostly exclusive from each other, although I'm not sure if they all were got collapse into a singly numb. And that's you know, that's fine, if that makes it easier to reason about and improves the scheduling.

H

um But our more advanced users have a lot of knowledge that they've, you know developed about how scrubbing works and as you referenced, how to like control it by like, like they, people have built their own scrub, scheduling, engines they're on outside stuff, um and so if this is like at the moment, just these these little things but but works. But like then that's one thing, but if we're expecting to like massively change the way scrubbing works, so all their all their learned, knowledge and encoding knowledge and, like their algorithmic engines, breaks.

I

Greg, let me stop you. There.

H

That's something we need to be really careful about.

I

Can y'all hear me yeah, okay, the forcing a scrub is still going to work, so all that stuff's still going to work.

I

The data structures people rely on are the last scrubbed stamp or something in the stats and the ability to force the scrub.

I

And no, no that's going away anytime soon! Yeah work! You know ever.

D

And what did change is that when you do a dump yeah, this is a user, visible change, but you see the schedule. The time is now then not before not the short Target, because but before is actually when you might hope to see this club.

I

Greg, if people are relying on detailed information of patient inquiries to do this work, then they'll need to adapt, but I don't think they are I.

H

Think they're, just using Sports yeah I'm, just trying to make sure I understand the sort of scope of what's changing in what places.

A

Yeah I think we should definitely like add it in pending release, notes and stuff about like if any of these have changed, meaning or do not exist, and what is new? Yes,.

I

J

I

States, it's not enough to note in the release notes you have to actually find every reference in the current documentation and fix it as part of your PR. But hopefully these things did don't show up prominently in the documentation.

H

And if something's changed, meaning it just needs a new name like we can't be trying to remember which release has which interpretation of a keyword.

K

I think a lot of the things that Ryan is talking about here are really internal details to this drug Machinery that aren't exposed very much at all in the docs or the user interface more like the internal system. Introspection tools like evidence, documents that most users aren't using.

H

I mean yeah, but anyone who's providing support doesn't want to have to remember that stuff either.

K

That's a separate question from UI.

H

No, it's not it's. A portability is part of user. Whether.

I

There's separate questions or not I support the idea of not changing the the meaning of a of a thing without changing the name, it's confusing, whether it's in supportability tools or in UI, but we'll continue to take on a case-by-case basis. It may depend on the details.

D

I agree with the Davis domain, the lights down.

D

So, like I said, hopefully, I I hope to get such a way. We've been, but I hope to get the a.

D

Priority in reserving uh who is this time, but would be hard to get it into.

C

Jotted down some notes in the either pad, um but if there's anything else, we want to remember um here's uh ether pad to add to I wrote his note about. um We should add a release, note about anything that has changed meaning.

C

Are there any more questions for Ronan.

C

I

Think it feels like a good uh it.

C

Was like good, restructure.

I

Looking forward to the pr.

J

A

Yeah, simplifying all of these lives they've come to bite us every time. We've looked at something getting stuck and slow and not working so simplify this logic. Going forward, that'll be a huge achievement. So thanks for taking that um work.

D

Trying to simplify my life.

A

D

Okay, thank you. Thank you for the time.

C

Thanks Ronan, the next topic on the agenda is uh hit assumptions. There have been some ongoing conversations in the greatest uh Team about um what we should assume in terms of uh it's I. Don't know if there's any one person that wants to kick off the conversation um but I'll leave it up to whoever wants to start.

A

So yeah this is just a general topic uh for discussion uh based on something we saw in a recent uh case that came in uh probably sridhar. uh You have a background on what was going on. Maybe you can set context and then we can open it up for discussion.

L

Yeah uh sure, um how can you guys see me yes, yeah so, like I mentioned the um the this specific issue relating to Prince the scheme, um thus War, when uh one of one of our customers was trying to run a scenario involving restart of uh all the safe payments. So the scenario basically involved a graceful and ungrateful uh restarts of all these septemones, and the expectation was that, after the restarts, everything comes comes up, fine and then cluster, uh the connectivity and all the all the stuff uh happens uh properly.

L

But during reading one of the uh during the these tests, it was noticed that a bunch of osds they were, they were not coming up. Fine, they transitioned into the booting State. As you know, the uh OST essentially transitions from init to booting and finally, to the active state after restart, so a bunch of offices were noticed to be stuck in the booting State forever, leading to the mon amounts, eventually marking all those specific osds down after the time period of around 900 seconds.

L

So this is the observation, and this issue was consistently being reproducing the customers environment, and so we requested QE to reproduce it locally and they were able to do that as well. So, essentially, what uh what was happening was. This is a containerized environment and customers. Customer was running odf 4.1 if I'm not mistaken, so some analysis of the logs. It was pretty clear that um the OSD that were that were not coming up.

L

uh Those had uh those had uh those are those had a specific PID values being Set uh the the in a containers in an environment.

L

We essentially expect that the osds uh would have a feed value of one that essentially tells chefs that the OSD is running in a continuous containerization environment and then it goes ahead and generates a random 64-bit nonce value so that that actually helps USF to figure out that the the the the Incarnation of an OSD that that goes down and comes up so essentially in this case, what was happening was for a bunch of osds and the nonce value, didn't change across reboots and, as a result, uh the uh osds those overseas that came up.

L

They were marked as do boots and, and essentially they never really transitioned into the active stake, and uh uh it was through Greg's input that we finally figured out that the the the issue with the PID values and- uh and we essentially got back to the customer- to um recommend them to use the I think there is a environment where variable called self use, random nouns uh if I either that should be enabled or the PID or the safe Diamond should be set to 1..

L

So essentially that was the issue and um right now we don't understand how this PID value is getting changed to the PID of the OSD. That is not expected, but at least we have a workaround where we can set specific environment. Variable called refuse random nodes that should help the osts to come up with a new new nonce and uh and stuff identity, ideally identify that the this is a new incarnation of an exist University and that should uh eventually help the USD go into the active state.

L

So this is the brief background about what happened in this case.

A

Actually, when I pasted a couple of things in the chat to report context in terms of code around what assumptions uh are present and what assumptions were broken, it seems like the Assumption in Roku is that the PID is always going to be one, but that clearly was broken in this case, because of which we encountered other issues but um I think uh radic. You also had an experience of something like this in Crimson. Maybe you can elaborate on that and then we can open it up.

J

Yes, I encountered uh this issue when uh reporting Crimson to The Rook. uh The main symptom was were problems with uh with, uh during a lot of misdirected, a lot of things directed to mishandled uh Pink measures.

J

This was because in the nuns collection, okay, basically, one of the responsibilities of messenger is to provide each translate identity into a couple of network parameters like IP address this part, and also nouns, we use nouns to distinguish between different instances of the same of the same of the same demon. Basically, the detect the Earth to detect restarts and in Crimson the non's selection logic was that it was. It was depending on uh on on feed I. If I recall correctly, the Assumption for containerized environments was that.

J

Was that we are randomly are picking up uh and a huge integer, if our repeat is equal to one, in other words in it, but in the loop environment we failed Cooperative slant, another.

J

The pit was two and we ended up with nouns always set to zero I. Think yeah I think this is this is this is quite similar to what we was doing recently.

A

Foreign, so I guess the the general idea is a bit this new discovery um earlier, safe ADM, the pr that I just pasted, uh Was Always setting this environmental variable um now Rook is also doing the same now. The question is: is there any other Improvement that we can do in this area and Greg? You had some thoughts around this right.

H

Yeah, so I I think we're going to cover this one, just in short, like we use the nods to the same big way when entities are new at the same address and when they're in existing instance, um and so it needs to be different on.

H

You know every invocation of a demon um and on you know, respawn or whatever, and once upon a time we used the PID, because you know in 2007 pids were mostly random and then most of the Demons except the osds, now just use a random nonce, because if we got started seeing cases where they weren't sufficiently random- and so we were getting, these conflicts obviously comes all the time in containers now.

H

But it also historically has happened a little bit in just like you know, systems where you're starting a lot of processes and the and the number of IDs are constrained or something um and I I. Don't remember why we tried to just use the kept trying to just use the PID for the OSD, I, really I think it was just so that it was easier for developers to map the like messages. They were seeing on.

H

Other servers to which OSD demon they needed to go, run GDB on or something elsewhere, um in which case I think we should just get rid of the PID mapping into always randomize every time so that we never ever see this again, but maybe there's some other constraint I've forgotten about that. We need to account for right.

A

That's that is interesting because the the code that I pasted and that's messenger level code, so it's not specific to the OSD, so are the other demons doing something somewhere to overwrite. That logic.

H

uh No, you have to you, you can tell the messenger how to select your nons and and this information used to be given to the messenger it was put in like I.

H

Think these one of these PRS that Sage did is actually where it moved in the messenger layer from from the in from the invoker um and I'm, pretty sure the OSD is the only one that tries to use the use, IP ID version, the Monitor and the well, the MBS for sure and I think the monitor are just random and they've always been random for many many years.

H

So this is like a code pad that the OC chooses to use, and that was put into the messenger for the ostdus that the others don't.

I

Pits have pretty much one property that makes them attractive for this, and that's that the operating system goes to quite a bit of effort to make sure that it doesn't reuse pits for a while. That's the reason why we use pits so I'm with Greg I, don't think, there's any reason we actually have to do this. If we randomize an insufficiently large space, we'll get a stronger guarantee, there won't be any possibility of failing to set the environment variable and I've actually never used the nonce to map to a pin.

I

I actually forgot that it did am I alone. In that.

H

I mean I have done it at least once in my life, but I forgot that the OSD did it this way either. So it's been many years.

A

It never seemed to be a problem, maybe underneath it was causing issues in other edge cases. uh It probably.

I

Hasn't been, but this is more like do we need to maintain two code paths, or can we just maintain one.

A

I I'm all for simplifying and as far as I can understand. Historically, it was done for some reasons. Those don't stand anymore, so just using a random one makes sense to me unless there are any other District agreements.

A

All right, I'm not hearing anything, so we just.

J

A quick note: yes,.

A

J

Hi and I found the very old guest about the case in Crimson. Also, one of the comments uh is actually about the nuns selection logic. Let me pinpoint it directly here here is the link comment with non-selection Snippets.

A

So this is the the Crimson equivalent of what I pasted I think right. That's what it is doing if it's yeah.

J

Is the wrapper in the reference investigation a long time ago.

A

Yeah, so my point is that if you can see inside this logic, crimson classic and just use a random nonce and then set.

F

A

And whoever doesn't need to, you know, set environment where it is.

J

Here somebody could point out: uh okay, they are random, uh which means that there is very, very unlikely, but still existence case of running into non-nons clashes.

I

That's actually handled um when the Daemon starts up. It checks the ostm app to see if it actually grabbed the same knots, because.

K

That's already.

I

Possible for pits so yeah.

J

Perfect cool good to.

I

Know, that's actually what causes these Loops? The OSD starts up and goes. Oh I got paid one again: oh I got good one again. Oh I got paid one again and it just keeps restarting.

A

Cool, um that's pretty much any anything else on this, so we want the next topic.

A

Not hearing anything so over to you, Adam.

G

Hi guys I would like to uh very shortly talk about an issue that we got. um Basically, we encountered um customers who, due to some.

G

Problems, it's not even important we're able to run osds multiple times, meaning two osds using the same block devices were running at the same time.

G

um That of course led to all strange and initially difficult to understand. Corruptions.

G

That's it for the background.

G

It mostly happens when we use containerized environment and the reason uh the reason why it didn't in general, we do have in Blue Store um mechanism to protect against running uh objects for multiple times we just take a logs for each block, each device that will be used for data and, in addition, an extra file called fsid.

G

um The problem that we had was that two different runs of the container um recreated all that access files, meaning uh they used Seth, Object Store tool or equivalent to recreate uh Blue Store data path, deal and also create all the links to devices and the Locking we were taking was a inode based, meaning when uh files were recreated, we no longer actually were locking against the other other osds. That could still be running.

G

There is that I assume for best.

G

Make it work that the links in our Blue Store path were to block devices and our Deluxe were executing were against a I notes that were block devices. But if you, if, if there would be some mechanism and I assume there might must have been, and that will also recreate a block device, I node, then we got completely different set of inodes and two osds can run.

G

I tested that on containers and the only way I could see to block.

G

Ing on the other OSD was to open a block device in an exclusive mode. There is a it seems, a bit extra implementation in Linux kernel to handle blog devices differently, and if you open a block device inode, then you really get a locking for a device itself, not only the inode. Hence the VRI um I created.

G

It's not that I'm stating that this solution will be working or all the cases. It's just the one that I wasn't able to break so I would welcome any input on how can I make that exclusion, maybe simpler or more secure or more robust, whatever?

G

That's that's it.

F

uh Thanks Adam, so I can relate this one to this hacker as well, where we are seeing the similar issue in the odf environment, uh where the blue FS got corrupted, because the multiple safe parts were running against the same device. You can say it hello,.

G

Yep I can hear.

F

You yeah so that got actually fixed from The Rook side. Let me just bring you that one.

F

So they are handling it a bit different way uh in the root just to prevent, uh uh like multiple osts, uh sorry, multiple dimmers, using the same device. You can say osts.

G

Can you uh shortly uh summarize, what is the technique used in that VR, because I I don't have time to reread it.

F

uh So what they're doing at least I when I went through yesterday, so it's basically there using the kind of like the OSD pod uses like the uh either the host path host path. I think means like it's actually using the. uh uh What are you gonna sell it?

F

The system like where the container is running right, so it's using the that path and probably creating some file so make sure the if any other demand is coming up uh it. It can't lock that file. So let me just go through that PR, it's a rook one! So.

D

So the idea is that.

F

Sorry yeah go ahead. Try.

D

The idea is that the file is created with the name of the hardware and just to make sure that they are not uh don't.

F

D

E

I think the fact that it is a rook PR uh makes it makes it more or less irrelevant here, because the point of this exercise is really to uh to protect from something like like this happening at the lowest layer, possible yeah uh and the opening the blog device with oh exclusive, which is Linux only thing. But then we don't really care about anything else here at least four osds um that that should be fine, so whatever Rook is doing uh that.

E

uh First of all that clearly uh that kudu didn't work in this case, uh but also uh the uh it's. You know it's it's a thing that you know orchestration tools come and go uh things change. uh uh Those changes aren't always um act by uh you know deaf maintainers, uh sometimes they're. You know either considered to be. You know trivial enough and no one asks. uh Sometimes it's just that uh you know people uh just don't understand how dire the consequences uh can be uh of you know, for example, disabling pit files as.

G

E

Case as the case was so uh all of this really points at uh the need to uh to you know, have this implemented at the lowest layer. uh So um what work is doing?

E

um I mean that that that probably doesn't uh shouldn't shouldn't be a factor here, uh simply because you we need to make this work independent of Rook or any other orchestration tool. Yeah.

D

I think that's that's understood the grid. Just the idea was just to see other mechanisms not to use them.

D

Other techniques to learn what other techniques are there and not, of course, we need to solve it in the lower level.

G

Or if anyone knows how to cheat Linux into allowing multiple opens for inexclusive mode in containers or in row, then I would really would like to hear about that, because my goal is to make it also protect against running tools.

G

It. It happened that sometimes we were running test, Blue, Store tool against an environment that already OSD was running with that devices, and it let us do it so that I would also like to protect against.

I

So have we tested this if, um if Linux always respects the exclusive flag, even in a container, that would be great I.

G

Tested that in container on Ubuntu and on Centos 8. but I did not do any extensive tests.

I

um The other thing is we: when we, when we create these containers and run them, we could be mounting the VAR directory. We use in a mode that actually shares between containers. That would help um I actually think we should do ilia's suggestion anyway. I think the more ways to prevent multiple accesses to a block device the better, but we could also be doing that. It would mean pit files all that other stuff would still work, which would be nice I.

G

Would really investigate that to to accommodate as many possible logs as I can.

E

Yeah, well, that's actually, that's actually was like I I I. Think I I commented on the pr that was one of the questions like if all exclusive uh thing works and it so it really seems to be the thing like that actually works that that can be. uh You know, fooled around with uh then do we do we really need the existing uh open file description? Locks like the the F lock thing uh is it?

E

Is it still needed because the the last uh you know it's just additional code that tries the of D Lock, then, if that doesn't work uh falls back to F lock, and you know uh it's like if, if none of that is going to actually bring any value uh with uh all the exclusive change uh in then um I think we should consider getting rid of it instead of, uh like you said, you know, try all possible locks and you know locking, locking every everything that we can get our hands on.

G

E

I

Other proposal.

G

I, usually I am in the same idea. I would like to simplify by by.

G

Passing even one of those logs that could save us just seemed for me too great.

I

You mentioned that the block device- oh exclusive flag- is Linux only.

G

And BSD it seems but not tested. I did not test on BSD.

E

Well, I I've already started with learning somewhere, but maybe they picked it up as well. um I'm, not sure.

I

Do we do we care about anything other than Linux and maybe BSD like we hear about Windows, but that's only the quiet right.

E

Right, that's why when I was mentioning this I I underscored that this is OSD, so we don't really care about anything else.

I

Well, I'm asking because that would be specifically a reason to retain the old mechanisms, but if we really don't care, then I guess I. Don't care about the old mechanisms either.

E

Yeah so I mean we're, certainly not going to like I I. Don't think anyone is planning to uh run ceft clusters on Windows uh and uh we could research. uh The uh the PSD uh question, I I think it would take just a basic rep for the previously kernel code to determine that uh if it's not documented in their main pages, that is.

E

So that's uh that's a question that uh I'm not sure. If, if we say we care, then yes, uh but then again the F log mechanism can be bypassed. uh So uh uh it takes.

E

uh You know it takes some effort, uh but they contain a tooling makes it very easy um and again I'm, not sure uh what is the state of of like, because the container tools that we use on Linux that make this very easy are probably probably not support it uh on on these days, and that would uh that would mean that F log base protection, you know, probably carries more weight there, uh because it's going to be harder to bypass.

E

Then obviously we should just uh we should just uh you know uh claim that it's supported and, uh in my opinion, just dropped. The Vlog stuff.

H

My main question is: if this new method will work when we do user space, I o stacks or, if we'll just deal with that bridge when we get to it.

I

It's up to the backing store, so this stuff lives in Blue store right, Elia.

G

This stuff is in our kernel device stuff, it's not inside Bluestone parasite, but it's only used by Blue Store.

I

Yeah, that's I mean it's it's owned by the store. So if we create a user device stack that implements object story, it will be the responsibility of that stack to provide the same guarantee via whatever mechanism.

H

So that's fine, but do we have a good way to like make That explicit or make sure that's not forgotten when reviewing them? It's.

I

Not the most dangerous thing you could do incorrectly when creating an object, store implementation.

H

Yeah, but it's one of the ones very dangerous ones. That's pretty easy to not notice.

I

I mean we can add a comment to mount in the object, store.h file that says BT dubs. This is required to provide exclusivity, but that would be good enough for me, like I, don't know.

H

I

G

K

H

Function called it's like verify: exclusive, locking or something no.

I

No, no! No! No! No! No.

G

Why not? We can make a unit test that tries to mount twice and that's.

I

We don't have any unit tests that apply uniformly to all objects or implementations. Think of comments enough, like we have two optics for implementations right now, blue store and distort and file store. Right, like these things, don't come up that often we don't need to be that prophylactic about it.

I

Like this didn't happen because someone forgot to implement it, it happened because the implementation didn't work correctly in this environment.

G

And I would even say that it ceased to work correctly.

I

Right so I don't think we want to overdo the protection against that version.

G

Okay, so I will notify the pr to get rid of f-lock and just rest with all exclusive, and we will need to find a way how to best test it in most conditions,.

E

Well, I I think we agreed that we need to like. If we care about uh potentially supporting uh BCS, then uh we need to check whether oh exclusive, for blog devices is a thing there, because uh you mentioned like I'm, not sure did you did you check the main Pages or because for some reason, I always thought this was their learning songly um Edition, because.

G

I checked pages I checked month, Pages only.

E

Oh okay, okay,.

I

We've had a couple of people try to contribute BSD friendly code. We could find the email address for one of them and ask them.

K

They also only supported file store at the moment, BSD, because booster already used things that, yes, you didn't have I believe no.

I

There you go made the.

K

Effort to do it.

A

Yeah but the first two deprecation them walking around the issues they're having with blue stores, so with file store depreciation on the horizon. They might um want to use two stores, so we should cater for that case as well right.

I

I think sending a message to self devel and CeCe, seeing one of them would be sufficient.

A

Yeah I think William was the one who's Super Active in the previously communities.

E

Yeah and and just to kind of uh I I was specifically commenting on the on the Kernel device implementation, uh with with um with this exclusive PR doing two types of locking and that that seems to be not desirable to me and so uh uh leaving just oh exclusive.

E

There uh is probably the way to go, but that doesn't say that we can't continue f-locking one of the metadata files like, for example, the whatever it is, the fsid or something else uh in the you know in the OSD directory, and you know have that be common uh for any. uh With a user space.

E

um You know I O stack with the user space pump device. Essentially, uh so my comment only extended to the to to the implementation of the blog device interface, not to the to Blue Stone, so not to the object, store implementation. The object store implementation is free to use whatever locking mechanism to lock whatever metadata file it sees fit. In my opinion,.

G

Okay, you mean you would like to have um locking a separate thing from block device implementation.

E

I'm saying that we could like there is nothing stopping the object, store implementation from doing additional locking.

E

uh If it's you know if it wants to do that, uh What uh uh the the thing, the thing that you know I brought up and what I was mildly uh objecting to was the uh kernel device implementation specifically uh during you know two types of locking on the same entity like on the same.

E

You know on the same Vlog device right on the same thing, so that seemed excessive to me, uh but we could still like an object, store implementation could still log, for example, an fsid file or a superblock file, which is not a Blog device with conventional lock-in mechanisms which are portable and support it uh on. You know just generally on posix.

G

What are other users of kernel device component other than Blue Star.

E

uh The one I'm aware of is the persistent client-side cache in in RBD, so it has two two backends it can. It can use pmem device like a uh like a pmem card or uh just a standard log device. So with the intentive uh you know in being an SSD. uh So that's uh that's a second user. uh There might be a third one, I'm, not sure.

G

Does this user require locking or is locking only use needed for Blue Store.

E

um This user uh well so the logging is, uh there is additional locking at every RBD level, uh via the exclusive block feature, uh but.

E

I I think I think it actually uses uh locking uh already right because the uh lock exclusive uh flag or it's not actually a flag. It's a it's a field uh in the uh in the blog device class or maybe kernel device, implementation, I'm, not sure, but that's a it's. It's a member field and it defaults to true and any open uh method like uh when you call your call blog device create and then blog device open.

E

uh If you, if you don't do anything in between then open, uh would do an exclusive open uh by default. uh So this means that this second user, uh this SSD cache uh already uses uh exclus. uh You know exclusive mode. uh It just doesn't rely on it because there is higher level locking within lib IBD uh to the best of my knowledge. The only reason to request a non-exclusive open is to actually call a method in between create and open. There is um something named said: no exclusive lock.

E

So that's an actual method that that you call and that method onsets uh the uh the member field. uh So it's that's. You know it changes it from True to false.

G

The reason why we have a set, not exclusive open is because we have two components in Blue: Store one is blue star core that serves object, data and the companion blue FS that keeps the metadata and they use partially the same device with cooperation, but the same device using different block device handlers um for architecture, Simplicity reasons. That's why we have that weird uh open with non-exclusive mode, but rightly I can so.

E

G

Existing clocking and add new locking and that could be selected by by some flag but I'm not really sure. If that's the right way to go. Yeah.

E

But that's exactly what I'm saying that we don't need that. We probably uh like a 99 sure that we don't need existing locking and we can just switch to all exclusive locking and.

D

E

Not expose this experiment, this implementation detail at all, uh so the blue refresh code would still invoke this set no exclusive, lock method uh when it needs it, um but uh that was actually another concern that I brought up uh in in your PR is.

E

Are we sure, because uh the way like I I didn't follow? All the you know the entire cold chain, uh but um lock exclusive is on Set, uh so uh set no exclusive, lock is called uh in minimal, open blue fuss.

G

Yes, you are correct.

E

About that yeah, so that method is called through from from you know, from a bunch of places and I just wanted to you know you or someone else to verify that, uh like all those like that there isn't a case there, where you know that that is, uh you know where things could go wrong and we could still end up with the same old device opened um open twice uh in in you know,.

B

E

So uh in in the end of the day, one of these opens needs to be exclusive and the other then can be non-exclusive.

E

That would work for the case of uh you know. Opening the blog device in you know two different ways. uh As long as one of them is is exclusive, then we are protected and and I just wanted us to uh verify that that is the case in in all cases, and there isn't the corner case where uh we we are. You know, where said, no exclusive luck is called on. All um you know on all the ways that the blog device has opened for that particular OSD.

G

And you are, of course correct and I must tell you that, with your comment, you put it to light to me that our previous locking was not even used in some scenarios. So that was an error in our tooling anyways, even with the previous locking, so that that didn't change.

E

Okay, so it sounds like uh the all exclusive change kind of stands on its own, but there would be an additional PR needed to to fix this minimal, open blue effect, stuff right.

G

Yes, possibly in two PR's, because we should also maybe fix the previous versions. I I, don't know if we want back Port the same or just fix the tooling I would be mostly for replacing existing clocks with all exclusive I. Don't really want to track another customer problems related with suddenly running multiple osds on the same data set.

E

Right but uh once again, I mean going back to my point, that and and to what you seem to be confirmed, that there are cases where uh the existing uh the existing support for exclusivity, whether it works in all cases or not like whether the container environment can fool it or not.

E

um The um that that there were some there.

G

Were some we were not using clocks yeah, we were not using clocks where we should have been yes.

E

uh So the change to or exclusive is not going to take care of that because you would still not be using all exclusive in those places yeah. So that's what I'm saying it's either two different pris or at least two different commits, because these are clearly different. Different changes right, one is a change of the underlying login mechanism and the other is using the locking mechanism in more places.

G

E

Okay sounds good.

A

All right we're already uh out of time any last thoughts on this topic.

A

And I don't think we have time to take any more. So thank you. Everybody for joining I think it was a good meeting, see you next month.

J

Thank you very much hi thank.

H

You guys thanks everybody.

J

A

H