Kubernetes SIG Cluster Lifecycle, 30 Apr 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2020-04-30 - Cluster API remediation meeting

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, today is April 30th 2020, and we are going to be talking about a few proposals around remediation for cluster API. Take it away.

B

Okay, so so there's basically two two proposals that were submitted: the reboot remediation one is the more more recent one from was spinned of fire on through from Red, Hat and I. Guess, I, don't know any or if you were, how involved you were with that one, and then there is the the second one, that is the external remediation controller, that the idea is basically to move move the part of the remediation part. That is at the moment, just deleting machines from the machine health check in copy into an external provider remediation controller.

B

So we would like to well discuss both if we have time but probably focus first on the on the first one, because it's much more straightforward and like less things to change with regard to the with a current the current states in in in cluster API um I, don't know if you want to take over now file for this for this proposal.

A

Would work for y'all I think it would be helpful for me to hear your use cases and some of the problems you're trying to solve and try and tackle it that way first, instead of or before we get to anything about implementation details. If that makes sense.

C

Sure so, basically we are coming from bare metal world where we have bare metal servers which are which reprovision them I deleted. The machine means that we need to reprovision it and the usually takes a lot of time could be hours. We need to download, maybe some images etc so felt like a cloud provider where you delete the machine and we provision and you want in just a few seconds or minutes, so we would like to avoid machine deletion, it's possible.

C

So at first we would like to try maybe to reboot the O's. Maybe a reboot will be quicker and maybe it will resolve the transient issue that led to the fail to the health check fail and if it doesn't, maybe then we might want to reprovision it again. But the main motivation for us is to have to start with a reboot in order to save valuable time in transient errors where we want to remediate the machine. Does that make sense? It.

A

D

B

Were so maybe like considering a bit um some kind of escalation path, you know that you would try like steps. That would be like stronger remediation every time like so, for example, you would try a couple of times to reboot, maybe like once or twice, and then, if it's not successful, then you would take a stronger action that would be provisioning and if it still doesn't work, then maybe a machine deletion deletion. That means that you probably end up on another host and it could like fix the problem that you're hitting.

B

If it's a hardware issue.

A

Ok, that makes sense too.

B

So basically, like wait, wait, wait. The first goal that we're trying to achieve is is just just introduce something that is less radical than machine deletion, like that's the. What is behind this, this first proposal about the this reboot remediation, maybe not completely like taking out the taking the remediation part out in its own provider, but at least allowing for another way to another way to try to to to resolve the problem that we're having with with the machine.

B

So the the first proposal would be like, like kind of a first step, hopefully into the direction of having something even more modular than, and then I think this reboot option as a remediation.

A

Okay, so do you want to give a high-level overview of how you might implement doing reboot.

C

Okay, so, first, first of all, we are using metal cube cluster, where we use ipmi protocol to talk with the hardware and power it off and power it on. We are connecting the using the management interface of the hardware and we would like to first the flow. It basically goes like this machine is being detected to be unhealthy. We want to power it off.

C

Then we delete the node from the cluster and then we power it back on and the reason for this flow is mainly. We would like to delete the node only after we power off. They know power of the host.

C

Since, if we wouldn't do this, we might end up with corruption where we delete the node from the cluster. But maybe this node is still running somewhere and mineral. The scheduler assigned this workload to something else, some other node and we might end up with corruption for stateful apps that have run only once one one instance, so I.

A

Just want to make sure I heard you right, you want to delete the node from kubernetes as part of this process. Yes, is that because you've detected or we've detected, it's unhealthy and you don't want anything scheduled to it.

C

Yes, we first saw that the machine has two controller detected that it's unhealthy. Then we power off the host and that this is our way to make sure that the node is really down. It's not just hon LC and maybe still writing data into some storage or something else. So only after we power Yost off, and we know that it's off, we can safely assume that we can delete the node from the cluster.

A

Certain sorry I'm just a little confused when you talk about power off I thought we you were going for reboot. Are you like what happens if you do a power off and then you delete the node? What happens next.

C

After we power off and delete the node, we power it back on and then the node registers itself into the cluster.

C

Okay, so it's not a real reboot I mean it's more like a power cycle and that's exactly the comment I gave there to Andrew, which, if it seemed the right side of the screen, it might be confusing to call this a reboot. It's actually a power cycle.

B

There was a comment also from Michael.

A

Okay, so if we were to do this remediation strategy on the MHC, then we'd have to add some way to trigger a reboot through the machines machine, SPAC right right.

B

So the idea would be that we could not put aside of like the five putting annotation as the node is unhealthy, like putting a while something that identifies that this nodes needs to go through through the reboot, and that would be picked up by the by the infrastructure provider.

E

Would this be provider-specific so that if you were also having AI as clusters, I'm HC wouldn't treat a virtual machine the same way.

B

um No because the the the actual power cycling of the of the of the infrastructure machine is depending on the infrastructure provider so like. If we are talking about a metal cube node for example, then we will use IPMI to shut it off, but it will probably be a different procedure so to do that on any other infrastructure provider. Yeah.

E

I think this is one I thought there is that I, as providers will have a VM they'll, have their own fault, tolerant capabilities and services running, and it this might cause conflict if we're treating bare metal and IBM the same way, hey I, guess I, don't know how its implemented. It's just curious to me if this could be done where it's provider-specific. So if it's bare metal, you would maybe do this. If it's a BM, you would not do this. You would rely on the services of biiss hypervisor.

B

Exactly that's why there would be this new field saying what remediation strategy you would want to use okay and there's a comment or question from Michael. Yes,.

F

It was close because it's an issue basically along this line. Some things support reboot some things or power. Cycling, maybe not vice versa, and the thing I came up with was the infrastructure provider provides a list of capabilities, those get synced to the Machine and then well-behaved clients can see. Ok. What capabilities is this infrastructure provider support and then I'm going to make some change the machine and that will get carried out down the line so I can link today.

B

Sorry, when you're talking about the client, what plan do you mean then any.

F

Client, it could be the machine help checker. It could be something one idea: I had was like a snapshot controller right. So if you want to take a snapshot of it, since you need to know, if you can power down on a WS right most standard instances, that's that's pretty easy. They can be stopped started, however, on spies. This is they don't support stopping, and so that would not be a valid behavior, and so that would be up to the infrastructure provider to determine based on this individual machine.

F

What am I actually allowed to do as far as like power cycling goes.

B

Okay, so you would want to have a way for a provider to explicitly like advertise what kind of capabilities it has.

F

Yes, and in that way, your remediation control, like a snapshot thing, can look at those fields and say: am I allowed to do this action that I want to do or in the case of the Machine health check controller, it might have different logic for some platforms, support native reboot and some platforms only support power on power off. It can ingest that and then you know, make a intelligent decision.

B

Okay, so this would mean that we would be basically removing this new field like this remediation strategy, things so that it's actually an automatic decision made by the machine has checked, based on the capabilities of the.

C

F

F

Maybe some instances are like integrated with some kind of storage platform and I I want to give those a good, old-fashioned, reboot first, so, for whatever reason yeah this, this would still be there. It would just be informing the client like if the Machine, how checker says I want to this needs to be rebooted, but it doesn't support reboot well, I'm, not gonna, try to go through the steps and reboot this thing: cuz, it's not going to work I'm going to maybe throw an error of some kind or whatever.

B

Okay, that was you raised the hand.

G

Yeah but my point is no longer relevant because Michaels already said it.

G

A

Do think that I know that, like without discussing or debating the merits of adding power states to machines, I think you probably could get the same type of behavior that you're interested in. If we did external remediation because then you could have separate code that is managing the power states of your infrastructure machines right.

B

Yes, exactly the the, but the point of this PR was that it's kind of an intermediary step to in that direction that, because it's, it would be like more probably more consensual, to agree on something like just adding a reboot, then completely taking out the whole remediation Italy into its own, like provider.

B

G

So there's a couple of things that just came to mind, one of which is that I believe at the top of this proposal. It actually says that the power states proposal is a prerequisite for this.

G

You mean yes, the first non-goal there.

G

So the conversation there about whether we actually need to do that to implement this, we would Eaton.

G

The second point that I was thinking about is whether rebooting and the logic of rebooting should actually be part of the Machine health checker itself, as sort of Andy was saying like this, doesn't necessarily need to be in the machine health checker and could be.

G

External I was wondering whether a reboot could be something that a user may just want to do on a machine or some other controller other than the machine health checker might want to do to a machine for some reason so like actually having some way to request a reboot for a machine that is external from hc1 means that the machine health checker doesn't need to know anything about the machine, but two means it could be reused elsewhere by other people as well. So that might be an alternative that could be added to this and investigated.

B

F

One of the issue, proposals that are transformed into a G doc kind of covers exactly what your de'cine Joel, where the actual functioning of how the reboot works and how it handled that is kind of like contained in the machine layer rather than putting the logic on the machine held sugar.

F

So you basically need to come up with some kind of API that says: I want to reboot this thing and have I already Requested that so I'm not just constantly hammering it, but there's kind of a disconnect between requesting and reboot getting a reboot, especially there's multiple actors in place. And, yes, we also want to add the ability for eight end user to potentially reboot a machine or some other.

F

A

What about the fact that not all infrastructure providers support the same set of power states I mean you mentioned spot instances, don't necessarily, and there may be other providers that have mismatches. Are we potentially opening up the door for this to be a bit messy just because people could get confused with what's available and what's not as it pertains to remediation and power states? Well,.

F

That's the infrastructure capabilities things. I was talking about earlier, where the infrastructure provider determines what can I actually do with this machine and I'm going to copy those capabilities to the machine object, or rather the machine object copies? Then you know status field synching like we're doing with some of the other things and I have a PR out for this. Actually in so this way, well-behaved clients and administrators can refer to that status. Feel of a machine like what can I actually do it. This does it support reboot that.

D

Like what requires cluster API would still need to have like the definitive list of all the possible remediations. Yes,.

F

Exactly so it it's only gonna support capabilities that we know about and can advertise I mean we could I mean we could open it up broadly and just sync capabilities from the infrastructure and not care what those capabilities are called and then third-party actors, if they happen to know what one of those capabilities is, they can assume it. The PR that I have out right now is using constants, so the power off power on kind of stuff. So it is a finite set, but from a technical point of view, there's nothing really stopping infinite set.

F

It just might start to kind of get in the weeds about trying to control the behavior of the machine controller itself, whereas it's gonna have to know about some of those capabilities, particularly around like power and stuff, like that, depending on how we're actually going about setting the power state. If that becomes part of the spec, then the Machine controller needs to know something about a power off/on capability, so it can try to help validate or actually enforce that state.

H

So how do we deal with potential conflicts between different components and their expectations around these behaviors, for example, you know looking at a cloud provider for AWS, we can easily implement. You know additional power state it's there, but we would also be conflicting with the cloud controller manager and behaviors that it implements. So how would we ensure that, as we're adding these features that we don't end up with? You know conflicting services trying to take actions against the cluster.

F

Well, I'll respond to that. This is actually something I brought up yesterday at the meeting, I've got an upstream cap called node maintenance lease. So when you have multiple actors that are not necessarily related to cluster API that want to do disruptive things to a node that we're putting some information, that's tied to the node, rather the Machine saying, hey I'm got control this node right now, I'm going to disrupt it in some way or I'm, going to attempt it from being disrupted and we're all coordinating on that.

F

So if two components want to do some kind of power state management, the first thing they should do is check for that maintenance lease to make sure that somebody else isn't actively working on something in our particular use case. If we're talking about say the Machine health checks with the administrator wants to take the Machine down for maintenance, then we need to have some way to coordinate those things. We don't want to take the Machine down for maintenance by an admin saying hey power off, while at the same time the Machine.

F

How chugger says hey what happened? This node I'm gonna delete it right. We don't want those two things to happen, so we need some way to coordinate that action already, but what better place to do it, then, at the node that way, other things unrelated to cluster API can be part of the same extraction. I.

H

Don't disagree there. The only complication is, is that that doesn't exist upstream yet, and our support kind of contract, around management clusters and workload clusters are such that that wouldn't necessarily be in place. Yet so do we make that a requirement and a new minimum version requirement on either the management cluster of the workload cluster is needed to support that? Well,.

F

There's a couple different ways that we can go about doing this, so the input that I received from Clayton was I originally just want to make this inner patient. That's always he came up with using these lease objects. That's already in primitive in kubernetes for a while. So it's like a core core type, so we wouldn't necessarily have to wait for a release of kubernetes. We could do one of those things of exists or create if we want it to roll out the feature ahead of time.

F

I mean a lot of that kind of depends on what her out upstream wants to take this it's been kind of stagnant, so I'm totally up to this is or this is the direction I think overall, it needs to go I, don't really care what implementation.

A

I'm sorry I did want to go back to I. Believe you, you all said that one remediation strategy might be a bit evolving, so it's try to reboot a couple times and then, if that doesn't seem to work, then reprovision the escalation path that you mentioned.

A

I feel like that's something that, if you're interested in doing that, that we could come up with a POC for doing external remediation, like figure out what the API the minimal set of API changes, we need so the other proposal that you've got in another tab and start with that and see if we can get it working and then we can continue to discuss.

A

Well, that's a lot of straight text. Can we can continue to discuss machine states over time for power? Would that be palatable to y'all.

B

Yeah, so so you would basically rather like advocate for like going with this fully decoupled approach of having the external limitation as a proof of concept, and then we would take the discussion from there further. Yes,.

A

I think that would.

A

Hopefully be a little bit easier on the MHC API changes side and then y'all would have full flexibility to go implement remediation in whatever way you want to experiment with.

B

Sure and then we could take care of the integration with the maintenance these once it's ready on them on the community side,.

A

Yeah I think this is a hopefully a faster path to allowing you to get this remediation in place for metal 3, because I think it'll potentially take longer to try and get everybody to agree on machine capabilities and power states. Let's just my my gut feeling at least.

B

Knew you had raised your hand.

C

Yeah so I just wanted to say that even if we are going with this new CRV with this external remediation, but it's still not.

C

Not very complete because we still need to assume several stuff on machine health check. For example, after we power off power on the host we need to, or maybe we need to first agree when we need to remove the MHC annotation.

C

If we, if we, if we remove it too early machine health check, might we annotate it again and we might end up with some remediation loop, yeah.

A

I haven't had a chance to add this comment yet, but I'll say it here: I do think that we could, with external remediation, when the external actor has done its work, it could go back to the MHC and annotate it to say. I finished my work, please reevaluate and then the MHC II could go. Make that determination again to reprocess it rather than putting the burden on the external or mediator to validate that it's healthy and then remove the annotation and hope that MHC doesn't read it it's bad, because the timing is slightly off.

B

That could even be independent of the remediation, like that the Machine health check would just put the annotation and leave it there until the node is healthy again, then it would be on the remediation controller side to like half the proper timeouts in place. Saying like I, expect that, like I, don't know like five minutes after rebooting, the Machine, the annotation or whatever should be gone right. If it's still there, it means that might step failed and I should then and like take a further. So the step, probably.

A

Yeah I think like I need some more time to go through and think through the use cases or the you know, possible race conditions or order of operations issues, but but that's at least what I've been thinking about. I do see Michael until have their hands up as well. Yeah.

F

I worked through some of this kind of stuff for what I was saying about the reboot proposal and I think some some of the logic could be applied on external geishas CRD about trying to keep the MHC from fighting itself and over mediating stuff, because that revolves around doing stuff on the machine object and the Machine home controller tracking that, but this, if we're talking about external remediation, this is mostly pretty much the same concept with some other object.

F

Machine help checkers looking at and somebody else is updating it with information need to know about so I think I kind of maybe solve that a little bit. If you take sorry.

A

Which part specifically.

F

The part where how do we coordinate when the machine health check is done and making sure that we're not really mediating something? That's still in the process of coming up like that race area, there.

G

If I'm a chip in there, I was thinking about this problem earlier, cuz I've had a few conversations this week with various different people about this. The conclusion that I came to is perhaps the external remediated reboot II thing II should leave the annotation on the machine that says this needs a reboot until either some timeout happens say it thinks it turned the machine back on 15 minutes ago.

G

It then says: okay, this probably isn't great, or it detects that the load has come back, because once the node has come back and re-register the machine health check should be. It was redo. It's like proper health checking and you would assume it's either gonna be healthy or not. The conditions on it will be relevant again, because the time stamps would have just been updated because it's a new node so like as Nia, was saying a few minutes ago like if we can keep the annotation on there until, like the the node comes back.

G

Maybe that's one way of doing that.

B

Yeah related to this, we could also maybe I just leave it up to the to the health check to actually remove this annotation, because the health check is the one that actually at the end knows if that node is healthy or not.

B

So like whether the the machine remediation takes steps to remediate or not as long as the machine has check, finds the node unhealthy, it would leave the annotation there and then, if the machine has health check figures out that the node is now healthy again, it would then remove this annotation. Hence the the Machine remediation controller would know that the remediation actually succeeded.

B

A

That's in line with what I was suggesting. The only slight amendment was that we would have a different annotation that you would add, or that the the remediate or would add to tell MHC go check this node right now, because I think I'm done.

A

So it's all it's the same flow just trying to speed it up, because.

B

They're, basically having a to waste end of communication between the MHC and them MRC.

A

B

Really just it's.

A

An informative annotation for the remediated to say, okay, health check, go check. This note again.

B

Yes- and that was part of actually have the problem- well, I'm. Sorry I have no idea. Why, like all this text was like strikethrough, but initially the idea was like to to try to put on to have a status like put to Don one way or another so that to be able to give the the status back. But there's two people who raised the hand Michael first.

F

G

So I was gonna. Add the I'm not sure a a two-way communication would actually be necessary because the Machine Hale checker watches for events on nodes and machines and you'd expect that when it's becoming healthy again, one of those would trigger a reconciliation for the Machine of check. Anyway. Good point.

C

Yeah, but that's downstream, the machine health check doesn't remove the annual sanitation, even if you know comes back online.

C

A

A

Yeah we'd have the best the logic to have it. Remove the annotation.

G

Joy and I've added covenant sign.

A

So I know like if we proceed with this external remediation idea. Y'all had proposed a separate CRD machine, remediation requests or just machine or remediation.

A

Do you all feel strongly about having a CR D versus just using annotations on the MHC to do all that.

B

Well, the the reason we we proposed, the CID at first is well from our perspective. It kind of looks a bit cleaner with an an actual API that we could validate and and make sure that everything flows. What we are expecting and also like it would. It would allow us to have the same principles as there is now for the infrastructure provider, for example, that we would have the.

A

Jaw lose audio.

A

A

Yeah I don't know that I have a super strong opinion on annotations versus another CRD.

G

Tool if I think the the argument for me would be. If this is a temporary thing, then annotations is probably easier because it's easy to duplicate, whereas if this is going to be a permanent thing, then sure CIT seems like a sensible approach.

A

Yeah I think like annotations, if they influence behavior, which pretty much most of all all of them do, they tend to be considered as api's, just like full CR, DS or spec fields, so yeah. If we want to do another CR D. For this thing, we can.

B

I'm, sorry, would it be, then, like does it matter whether like, if we have multiple external remediation provider under the hood like they, so they would all have to be watching the the same C or D. If we have CR D, so it seems that may be clearer contract than having them watch. Another object just to check the annotations.

B

Maybe that's not a valid point. Obviously I.

A

Don't know I am curious and does this this is document talked about if there are multiple remediation providers, how the system knows which one to run I.

C

Think there were some comments on this may be raised by Joe I. Don't remember where we say that we might have their own air reference or something like this, but the document itself doesn't say anything about multiple, any remediation controller. As far as I can tell.

B

D

B

Would be like do we want to take a model like, for example, for bootstrap to provide away? We have like we expect only one in the cluster right. The.

A

Bootstrap provider is in practice. You'll probably only have one that you're using per cluster, but because the bootstrap provider or configuration details is a reference to another resource in the namespace. You theoretically could use different routes, bootstrap providers for a single cluster, so.

B

We could have a kind of a similar approach if we put in the in the mission has check a CR that is created for the machine has check. We can link to, for example, a template for a provider specific CR D, so that then the remediation CRD that would be created would be directly the the external remediation provider, C Adi.

A

Yeah we could go that route if there's a need to have provider specific information in a remediation request. Oh sorry, I was gonna say. Alternatively, if we don't need something, that's provider specific, we could have a generic machine, remediation, CR D, and if the assumption is that infrastructure providers are implementing remediation, then they could all watch all the Machine remediation custom resources and only act on the ones that are for machines that they are are owning and provisioning.

B

Internally, in the like, what we are planning for metal cube like we are anyways like planning to, even if we use annotation from the mission, has checked to transfer the information towards the external remediation. We are anyway spending in like to have CRT. That would be specific for our external remediation controller. So if we can make this like more general in the way that it would be like kind of a template, let's say that we would have, as we have now a cube.

B

A DM config template right that then the Machine ethic health check could directly generate the the machine remediation CR, the CR. Sorry, then we would be able to skip the skip the the step of the annotation and store all the states that we need in internally in this reconciled reconciliation provider in that CR D. And if we have a clear contract, then the machine has check would be able to fetch from the status whatever field is needed. The same way that the machine controller do with the with the infrastructure provider, machine I.

A

Like that idea, so you'd have in the MHC spec you could have an optional reference to a remediation template exactly.

B

A

That would be a completely separate resource that a user could create if they're doing external radiation, it would be infrastructure provider specific and if you don't fill in a reference to that in your MHC, then it'll just use the deviation flow. Yeah I like that.

A

Ben or Jason or Vince you'll have any and dear it said. You'll have any thoughts on our end.

I

I mean plus one to get templating in place. This is what kind of what we discussed like last week as well. I had like an infrastructure provider, specific C or the the like then could be created on demand. This would like effect. We give it a way to for external remediation, but users can attach any object to to a machine health check resource and then the whole lifecycle to go from unhealthy to either back healthy or deletion would be in the hands of the external controller.

A

And then for for things like the Machine set controller and the cube idiom control plane controller if they come across an unhealthy machine and there's external remediation I, guess they from a code standpoint, what are we putting in there to make sure that there's some sort of short-circuits so that it keeps on waiting until the remediation happens?

A

Is that something you all discussed ants.

I

We have, we haven't, discussed sorts short-circuit, but I mean if we I guess like. If we create like a new remediation request, we can potentially like the final contract that, like I, don't know, maybe says it's a like a ready state like something well I already have today, so that, like we can understand if, like the new object or we create a sobriety or not, and if it's ready, we can reevaluate yeah we'll need to also add waters and things like that. That's up right, yeah,.

A

I think it's maybe fairly simple just to say if you are a controller, that's managing multiple replicas, which applies to kcp and machine set and you're evaluating your machines. You look to see if one of the machines is annotated as unhealthy and if you encounter that you just well I guess if it's doing the delete remediation, the controller would do that itself. Right, yes, like.

D

Why not model deletion is just like the default external remediation. Well, I I have.

I

I think like we should offer still the deletion as the default case like the regardless, like we can, just like, add a new annotation that says like the strategies, external, don't do anything and then measure the machine set or kcp or whoever else like looks at piece as we're saying, and then that just looks to see if, like the strategy, is either empty or it's not there, and if it's anything else like that, you won't do anything. It would just expect something else to take care of it.

I

Does that make sense like or I mean I can rephrase I think.

A

So, and to answer Ben's question when I was thinking about this, the other day I had I think I ran into a problem where, if we have some other controller, that's handling deletion, remediation, I think for machine said it's probably fine, but for kcp there I remember was with the EDD management, I think yeah I think it was.

A

Basically we needed some pre delete, behavior, which ties into what Michael is asking for it, and one of the other proposals where we needed to be able to manipulate @cd before we delete and if the only the only actors involve our MHC marking it as unhealthy and delete controller deleting it as the remediation, then there's no opportunity for kcp to easily remove it from from NCD, or maybe there is that I didn't come up with it. Well.

D

I think what we're already seeing we've been trying to like work through the Casey peace story and like that, it's I.

I

D

Think it's ever gonna be as simple as like just delete well, even with just like just react to anything with simple scale down and removed from sed logic, and so maybe like what it means. It's like deletion strategy only makes sense on machine sets and, like you know, Casey Key has like its own remediation strategy.

A

Which you can also achieve, if you just say, if you're an owning controller- and there is no external remediation, then you go do what you need to do. I see a couple hands up. I also want to be aware of the time we've got about. Seven minutes left so Michael and then Fabrizio. Yes,.

F

Dearest friends, this is yet another proposal that I talked about yesterday. Maschine lifecycle hooks, so something deletes the whatever is in charge. He places that hook on top of delete, so they can do it's fun entity stuff. It's done, oops that hook delete continues, is normal.

A

Yeah I need some more time any more time of the day, but I want to get back to that proposal as well, and look at some of the comments that come in, but plus one on the general idea. Definitely the.

J

Ratio thing that, in my opinion, the deletion should be or is in charge of the only controller, so it should not be considered as a strategy as a remediation strategy. So basically the default is there is no remediation and we kick into the dish. Why reboot or whatever our elimination strategies probably is distinction.

I

Deletion is a strategy, though it's the default one about a business strategy. Yeah.

A

It's just implemented by the controller in yes, so.

I

I had one other comment like on pkg P side of things like maybe a mistis like, but like you mentioned, like that like case, if he might not have the ability to go and do these stuff from my TV right. But that's what was one one thing that you were worried about ending well.

A

That that was in the situation where you had a separate controller that was watching for unhealthy machines and just blindly issuing deletes on them.

I

Oh I see the way I think we should probably keep the same remediation strategy. It's like the owning in the case of like external remediation. If the machine is live externally, the okay CP is able to react to that and delete that machine from Exedy today, but you want to scale down, but it will be able to reconcile the HDD members.

A

Right so I see your comment about. If hooks existed, then we wouldn't need any external or mediation controller. The I think the one issue there is that you like we had. We hooks, are like pre drain and pre delete which both in the current flow require that you hit delete or that you delete a machine and that it's it's you know pending deletion, so I'm, not sure that hooks would strictly solve the problem.

A

G

That was I, so hate finish before I meant to that. No, it was more that the external remediation controller and the MHC wouldn't need to know what's running on the machines. They wouldn't care, if it's like a case EP or if it's got some storage or something like our oh yeah, like that's quite a nice benefit of doing the hook thing is that we we just need to come up with some mechanism, but allows these hooks to run before the remediation controller does its remediation like for deletion. That's easy!

G

Whatever the signal is that says, these hooks are still pending just blocks when the finalizer is like spinning for the reboot I'm, not sure how that would look. Yeah.

A

It's worth thinking about, Fabrizio is your hand still up from before, or do you have another comment, no from before sorry, okay near and then Michael, and then, let's see if we can wrap up with some action items, yeah.

C

Thanks so I just want to make sure I understand, so can anyone clarifies wha? Why are we focus focus on the CR D suggestion and not the first suggestion that is using the hooks and without any CRD? Can you repeat the reasons for that.

A

So I buy the first one I think you said hooks, you mean the power states, yes, yeah.

C

A

Think that I think the external remediation is going to end up being more powerful and give you the escalation path. That I know it's listed as a non gold, but I think if we go and just try and do the power based recovery that we need to do the power primitives, we need to add them to the Machine API, which I think is going to be a harder sell than and getting this work done with an external remediate.

C

A

F

I would just say my impression if we're rebooting a host as a remediation strategy is that that should happen quickly and we shouldn't really need to do a whole bunch of hand-holding. Whatever it's running on the host, because, like you know what, if it's already.

I

Locked on what, if it's.

F

Arcing down right, we think we're aiming to bring it back in the same exact configuration that it's already in so I, don't really see a need other than draining user workloads. You know it doesn't reboot it in so I, don't see where all the hooks are necessary for that. We could definitely come up with something like that, but that's my impression.

B

If I may to react to this, the idea of like doing it with an external, an external controller, would be that we are not only able to just say, like yeah reboot, but then also see that if the reboot failed, then we would go to a like. We actually would have a path defined by the controller on the actions that we would yeah. We would take yes.

A

All right so we're at time. Are you all? Okay with the idea that we proceed with the external remediation option for now and we'll flesh out, you know, continue to review and and get the proposal in a state where we hopefully can get consensus, and then folks can begin implementing.

B

Do you want to a POC on this first.

A

If you'll want to do a POC I think that's could be helpful. As I said earlier. I I personally need some time to go back through the external remediation proposal from top to bottom and review all the comments and see if there's anything that looks like a gap but yeah I mean Peter sees fine.

B

Okay, then, we'll probably like discuss this a bit internally, like with metal, cube forks that are involved with this this proposal and update the proposal itself and then see about the POC. If we go with that first or if we like focus on the proposal. First.

A

Okay, yeah and will will definitely spend some time looking at the proposal again and providing more feedback.

I

Just just to clarify Michaels like comment, though, like the lead doesn't, is not part of image, see the least will be part of the controller owner of an object of a machine if that make sense, mhm see only on the dates going forward.

B

Sorry now I'm the one unclear with what you mean.

I

So Michael said like as long as cord the lead stays inside image, C and externally. Soft team I'm, okay with it so MHC, will annotate as the proposal that merged and the deletion will be done from the owner of that machine external will be still be opted in, though,.

F

Absolute certainty, but if a machine is deleted, the machine controller did not do it and I've run into this with support like oh you're, a controller delete this machine without somebody telling it to and having this thing delete itself now is putting us in a scenario where we no longer have any bookkeeping.

F

We did what so.

I

It's not the machine controller that deletes itself, but it's the owner so like, for example, machines had in case you'd, be that deletes the machine. So this is to allow, for example, image image. She, because he'd been to remediate that machine properly by doing scale up before scale down or we'll be moving from a TV, and things like that.

A

I mean it's similar to with the machine set controller. For example, if you're doing a scale down the Machine set controller is gonna, go delete, a machine I realize it's, it's a healthy machine, most likely, but it's similar in nature.

G

This was added as like a temporary measure to allow us to do kcp remediation in the like short term, like hope for the long term, we'll get this hooks thing in and then we can probably move it back inside but yeah. This is meant to be like a temporary internal change, but it should be safe, yeah.

A

I'm fine like if we can put together a flow in the future where we have a pre drain, pre delete, hook: option where the MHC issues the delete and case EP does a hook where it says: go: remove this from Ed CD. If we can make that work, sure but I think we're in a transitional state as you mentioned Joel. So this will allow us to move forward and experiment. A little bit see how things go and we can certainly revisit I mean so, as I said, Michael I'm fully on board with the hooks.

A

We just need to get the proposal fleshed out some more and get it implemented and we'll see how it goes.

A

All right any last-minute things, or should we wrap up I.

B

Just wanted to say thank you to all of you for coming here and helping with this proposal.

A

Yeah, thank you as well. I think this was a really good discussion. I'm looking forward to getting this thing moving forward.

A

All right, I'm gonna, stop the recording and.