Kubernetes Storage SIG Miscellaneous, 23 Apr 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Storage - Volume Health Discussion 20210423

Description

Kubernetes Storage Special-Interest-Group (SIG) Volume Health Discussion - 23 April 2021

Meeting Notes/Agenda: -

Find out more about the Storage SIG here: https://github.com/kubernetes/community/tree/master/sig-storage

A

Hello, everyone thank you for joining today's meeting on boarding house. uh So today we are going to talk about what are the use cases for this feature and what we should do uh as the next step.

A

um So let me share a talk, so I talked to a few people, so I got some feedback from a few people, um so this is what I get so far uh so right now we know that warning house uh we are. We are just uh collecting that from the street system and then report events on pvcs or parts.

A

uh If there's something some abnormal condition happened on the back end.

A

So so right now the controller side. We have an external house, monitor controller, that's a side car and on note side we have that implemented in couplet. um So I think this is still useful to have those events, uh so the next step for this. For this, uh I think we can discuss whether we want to bring this to beta um right now. The only concern I have is right now there are no c-sectional implementation other than our sample implementation in this host path.

A

So that's the that's the current uh situation for for this feature, and then um I uh also, uh I also heard several people talking about they're using this feature for local pv.

A

um So what happens is right now, because we only have this uh as events, um meaning that we can't really have a controller, uh make any uh reactions based on that, because events can disappear. So uh what I found out is that there are a few vendors actually just started to have their own.

A

They have their own implementation, because this one, although useful, but since they want to be able to react to it, so they have their they develop their own crds or uh they uh add that as an annotation, so that it can do something with it, which, which is, uh uh I think, unfortunate, because we have this feature. We like uh vendors to use this feature, so you know so that's the.

B

A

B

Plan to eventually like add a new field to the pve status.

A

So that's what we want to discuss right so, um okay, so I think there are a couple more things. Maybe we can. Maybe we should go one by one or should I just maybe let me just go over all of those and then we can uh go through each of them just to see. um If you you know, people who are in this meeting here, um you know which use case is relevant for you.

A

uh What do you want to see moving forward is that is that is that, okay, so should I go over this list that I collected and then then go over each one.

B

Yeah and I have a use case that I didn't write down- oh.

A

Okay, yeah, I didn't know you.

B

Wanted us to write down the use cases, but I I.

A

I just uh this is this is something that I have heard so far, so I want to collect more from from this meeting, so maybe we can uh do you? Okay, let me put this link in there. If you want to add yours.

C

uh We have yeah, I will add it. uh We had a use case that we wanted to the led control. Actually, oh.

A

Okay, so let me actually, uh uh I will share this one so that everyone can.

A

A

Okay, so I will just you know, you know, as you guys are, writing down yours. I will just continue with this other use cases that I heard about, uh and then I think there are some people asking on mailing list and also on the slack.

A

So, let's see, if we detect something wrong, can we have cubelet try to remount?

A

uh Basically that means part where we have to forcefully delete the pod um either that's quiblet or another controller uh for surrey mount and then there is someone was also talk asking about a uh like a reschedule to another node, but that I think that was this front-end email through mailing list started a long time ago, um but I think that will be uh we'll build that, on top of the the csi capacity tracking feature that patrick was working on, I think that's, um I think, that's like one level up, so there are quite a few things uh and then uh and then also someone was asking if we can pass in additional information to voting uh the note guide volume stats.

A

I don't know if this person is here today.

A

I'm not sure, maybe not uh yeah, so because he was saying that right now, right now we have we pass in so right now we pass in the stage path and then we have this voting pass. One pass is either so this one is like either staged or published. So okay, so we have this.

A

um I think he was asking for like a read honey.

A

I so I'm not sure that we actually have that information right now, because the the volume get note get volume sets uh right now we call this periodically um incubate, but in that place we may not have that information so yeah, that's uh that those are what I collected.

A

Okay, let's see, ben is writing his use case.

A

Okay, so while you are writing down this, maybe maybe uh I I'd like to uh look at the the local pv case. I think I see how is here since you are.

D

Yeah you want to talk.

A

A little bit more about this.

D

Yeah sure um yeah. So uh in our case, uh basically, we have a csi driver that provisions the uh I'll provision volumes on top of the lvm volume group.

D

So it's it's basically a local processing volume that that is attached to a host, so we're using the using the house, monitor house monitoring components to detect the the failure cases for the volumes and what our customer requests is that um so, basically, our customers deploy their um application in a stateful set and uh when they're, when something happened, to the local person volume, for example.

D

If you know these down or something happens to the uh to the ssd, the pod and the powder will fail, but we won't be able to get rescheduled because um because you were always uh stopping that uh in that crash loop due to the volume failures. So what they want is to have something that automatically remove the uh present volume claim so that it can recreate a volume and unblock the recreation of the pod.

D

So the application has itself has some mechanism to handle uh the disk failures, so they um they can recover from an empty uh personal volume.

E

Would that involve potentially like changing you're, saying, like changing the node affinity for the volume for the logo.

D

um Yeah, so basically, what uh would happen is that, after deleting the pvc, uh the control, the provisioner, would just delete the pv and then it will like recreate. The server set controller will recreate the pvc and then it treated the new pvc and then probably your new pv and the uh bound them make sense, because you can't change the video right yeah. I can't change pvp, basically the application owner. They just want to unblock the uh the remediation for their pod.

C

But there but the so the new pvc should come up with the same name as what was before.

F

F

G

So the application can tolerate the data loss like the paris data in the local disk. It can.

D

Run yes, the application has its own repetition mechanism on top of the precision volume, so they they can reshuffle the data if they see empty uh prison, want to come back.

E

So so, do you like remove the finalizer on the pvc as part of this up to to allow the pvc to be disassociated from the pot.

D

um No, I so basically um right now we don't have. We don't have uh automation around this. We just manually delete the pvc got it like, including the finalizer for pvc protection.

D

I I'm not sure if we have that setup I have to, I will go back and check.

G

F

G

Oh yeah, I mean the parts actually need to be deleted.

D

Yes, so that yeah, the new part.

G

Can be started on different nodes? Yes, so once you delete the part right, the pvc can be deleted.

G

Yeah, that's right!.

A

So the application will be the one who's taking the part.

D

uh So application is basically a stable set.

A

D

uh I think silverstar controller will recreate the pod and the pvc when, uh when we delete them.

A

uh But I thought uh you were just deleting the pvc right, so you are also building the pod. Yes,.

D

Yes, also that's right, because the part we are stuck in a uh crash loop and then we have to delete it. I forgot to mention that.

A

Okay and then I remember, you also talked about some of the kin doesn't handle. Then you have another controller that will handle this, or is that I thought you also mentioned.

D

Yes, that's. The controller is basically to automate this process and uh uh we are working on that.

A

H

So sorry, uh you know I I'm still confused like. Why is that has to be stateful set? If you don't care about the state on that card,.

D

No, it doesn't have to be stable, set. It's just that our application uses state was that to deploy, and we that that's how it works with status at a controller. If, if you have another controller that provisions that creates the pod and the pvc, then I think it should work the same way.

H

So the other controller will, uh if you add the position volume claim there, it would create it right. Wouldn't it.

D

Yes, it would create it.

H

So I'm still confused can't you use the other controller and not stateful set here.

D

Yes, you can, yes, you can.

A

I think stiffer set is just one workload api, but right you'll be able to support other workload apis. I guess.

D

Right, so this is agnostic to what application controller you use it's just that for the pod and the pvc. This is how we recover from the failures in our use case.

A

uh And, and also so uh so right now we have on the controller side, we have events on pvcs uh and then on note side. We have those um parts. So for your use case, let's say: if we add this volume condition information in pvc status from the controller side. Will that be enough? But do you also need that from node side?

A

So since we have right now, they're like separate.

F

D

So not right! This was yes so right now we uh only rely on the two type of events emitted from uh from the controller side on pvc and the other one triggered by the node notifier.

G

So I'm just curious, like how often you see this happened for customer right, the local disk has problems and the cost.

D

Yeah, I think that's one of the pain point uh that our customer uh mentioned and we manage a fairly large cluster. Some bare metal posts today and the disk failures is a very common case that they need oh they're migrating to basically they're migrating to kubernetes, and they want to fix that problem in uh in kubernetes and they request that we automatically handle this kind of failures.

G

So you run environmental cluster.

D

Yeah right now that application is on bare metal and we're migrating them. So that's the current status of our application and to your question. I think this is going to be very, very common in data centers, when you have like hundreds of thousands of uh machines.

D

E

I would add: we also do a lot of local volumes like local pvs, on bare metal and run into lots of like smart reported uh disk failures like scale.

G

C

Thank you for just another data point for our like, at least in some of our customers. They were looking to get the health monitoring status and obviously like which, when this health monitoring demon detects that that okay, the disc, is failing, it should blink the light and they want to replace the disc actually, but that could potentially change the the device id on the on the node and it cannot adopt the we cannot update the pv, so it has to be recreated right. So at least right. You.

A

Need to do the same thing you delete just like the same thing, delete the pvc and pod, then that will be recreated. That does whether fix your problem.

C

Yeah I mean, if you are deleting the so deleting a part managed by spatial set, is tricky right. You have to scale.

A

C

Scale down the straightforward: if one disk is failing, they have to make sure that this, just that replica is scaled down and and yeah. I think the the customers are looking for. In our case, customers are looking to replace in place their pvs, but I think.

A

In place or you're saying then okay, but they they have to have a different pv right uh because they have changed the disk. So it's has to be different. One.

C

A

C

Yeah yeah almost uh sorry.

A

Okay, so it's almost like you have to um bind and then keeping the pvc somehow and the just like the in-place restore.

D

The pb yeah it's.

C

A tricky one but yeah yeah, so something that would result in like the least amount of interruption from the workload yeah.

C

It's not something we have implemented it just like use case that is brought to us actually.

A

Yeah, this is actually this is yeah. This is tricky.

C

Hello, yeah uh yeah, so open eps actually has a component which uh exposes the smart statistics of uh device block devices to prometheus, so the detection of uh right now it I think, doesn't support ssgs, but it works with the hdds and it can be extended uh to support ssds as well. But the problem is again, but the only thing it does is it detects it and exposes metrics, and uh so the it can send alerts. But nothing uh like the problem with in place replacement would still exist.

A

Okay, so it's a okay, so it's more alerting and then you can okay yep. It's not automatically fixing the problem, but at least it's uh one user.

B

Yeah yeah yeah, the the health monitoring or health reporting feature is meant to be additive with metrics like you should always have metrics. In addition to this, it's not meant to compete with that.

A

A

Okay, so all right, so so here, basically for this use case.

A

It will be helpful if we have those uh if we have those one health information in pvc status.

A

C

A pvc status are somehow in pv.

A

Not impede not not in p, but are you saying keep up in both pv and pvc or.

A

Doesn't both places right well, you're saying it has to be in tv, not in pvc.

C

I'm just thinking loud actually like, oh, um where does it belong?

C

I know that we have ran into limitation. Our pvs cannot be modified from cubelets generally like bike yeah.

A

Yeah so right now I mean at least right now we're talking about the controller. So right now we're talking about the controller side. That's why I was asking. uh Is this from the controller side or there's a note side looks like there's no side, we're not sure yet. I think right now we're talking about the controller side. So if this is like you can't you detect from the controller side, um if we have this information on pvc status, would that be useful?

A

D

I I forget to mention that we, so we actually implemented some of the aisle failure checks on the controller side, because uh for two reasons, why is that our our kubernetes version doesn't have the kubelet patch included and uh okay? The other reason is that we found in our implementation. We, our controller, actually talks to each of the node to basically do the provisioning and we um basically uh take it back to that route and uh get the uh volume health information. Okay,.

A

D

You have to move the node yeah.

A

Central controller that right, they actually communicate with all the nodes. Okay,.

D

Right right, so that's just how we implement that and.

A

D

That's why we can like solely rely on the controller to detect all kinds of uh volume, failures.

A

E

But but ideally with the distributed control the distributed provisioner and like the ability to support local pvs through that, like yeah.

A

That you can probably I'm not sure if you uh notice and there's a this new new feature distributed the production. You could do that as well. Is it.

D

I see I'll check that out.

A

Yeah in the external provisioner or two mode um yeah, but that's still on the controller side, I think so.

A

G

Yeah, uh I just don't ask that the controller like how mentioned uh is doing some provisioning for local volumes or.

A

Yeah but local volumes, that's for like if you look at the external provisioner right now that actually support two mode and then there are local, be safer drivers that are using that uh the two modes. What are the two modes the distributed and the central center, is the same as what we have before distribute it is. Actually you can put your progression in on the node.

A

A

So, okay, so potentially.

A

A

In pvc, that is controller side.

G

So this is the output.

G

This is the action item like me,.

A

I'm just this is just yeah, I'm just talking like this is so I mean for this particular. I was just thinking about for this use case um right now, because it's only events right. We can't do anything um so for this feature to be used for, but you know, actually we can do any uh reaction to. It doesn't make sense to add this to pvc status so that so that this can be uh acted upon.

A

This is something that yeah. This is like one one question that I I have um so this is we're talking about the controls and, of course, we also need to talk about this. Other this note side as well. um Does it make sense.

C

So when we say we are adding it to pvc status, sorry going back to, I have to go back to this, but but does this mean that the volume monitoring will be only the health monitoring will be only performed for pvs that are bound.

A

It is that is what we're doing right now. Actually we actually only doing that, because this is only for this is only for if you have already provisioned uh a pd pvc, you don't you start, everything is all set and you're not you know, kubernetes ever is all set, so you don't do anything anymore. Then this is the then we are actually checking. We actually don't check it. If it's like not bound, if you think we should check that and that's something else we need to consider. But right now, it's really. uh We assume okay.

A

This is like after you're provisioned you're, using it already something happened behind it.

E

So in case of google pvs, it's like if he, if he do not think about distributed provisioner right now, it's main uh late, binding right, like local storage classes, are typically always laid by as in delayed binding. So.

A

E

Would be the late binding scenario for in the context of so.

A

So, okay, so I'm not saying this is we're only adding this for local, I'm just saying that. Does it make sense? You add this. This is, of course we can't. We can't know, we wouldn't know whether it's a local or not local. It's a we don't really have a special type for csi driver. This is really depends. This is really depending on the driver implementation.

A

uh So this is like a jet. This will be like a general field for any any pvc status. Does it make sense to have this field?

A

um I guess we don't have to make a decision right now, but I'm just asking so because what I, what I heard is, I already know a few vendors actually just they like even nick right. He started this and when I talked to him the other day, he said. Oh, we are using our own crds because they can't really use events right, but he was actually started. This whole world house feature, um but he can't even use it so like from our side, uh we actually we uh use annotation.

C

So sorry for the for the local volume like like not the csi local volume like but the entry local volume because like as people, so I was saying that, but.

A

We don't, but this is only a csi feature.

C

A

C

Yeah so, but if you ever had to then that'll be a yeah anything that that's basically not delayed binding or like that's pre-provision, so we cannot report uh health on that. It seems like a limitation. Maybe we can live with it, but.

A

So will this be useful for you haman? I know you have a. I know this is before the in-place restore we. We need, I think, more uh challenging.

C

We are using look in three local volumes right now. Oh.

A

You should okay, so then it's okay yeah this feature, because we only added for csi right we're only adding new features for yeah, but subscribers.

C

Yeah, but I was interested in how the that events that information gets flowed back into pvr pvc, because.

A

C

Once we have a place for that, like listen, pc status, that that any driver, even entry driver could report it there.

A

Okay, yeah so right now it's just uh yeah right now, it's just events or even the problem. Of course it's a it is helpful, but it could disappear. This is not that reliable, but definitely that that can be very helpful as well.

I

um I think the the interesting part here is going to be what exactly that api is going to look like.

A

Yeah we haven't looked at that yet.

I

A

Because I uh so yeah, so if we, if we think we can go this rather than we can look at look a little more deeper and see how that would look like, because I know last time when we are adding the same csi, we spent a lot of time deciding right. So.

J

A

With some error code- and we couldn't agree on what error code and then we just have now- we just have this like a boolean. um It's, whether it's abnormal or not and with a message is, is that enough to starve this? So I guess that's the question, because if we wanted to decide like what error code we want to equal, I think.

I

A

H

I

Kind of next step might be to decide between so our first iteration, we said: let's remove programmatic use uh from the picture, to simplify the overall design. Let's just focus on what are the signals that we want to surface to the end user and we've accomplished that through events, it seems to be well received.

I

The second step, I think, is now we want to do programmatic use of these signals to have systems respond automatically to volume health. The question is: which systems do we want to target? First, do we want to target third-party integrations or do we want to target? You know cubelet and kubernetes schedule or responding like a first party.

A

Right, so we can talk about those and then see how how each one looks like we haven't looked at the second one, yet right, so this one second one is the keyboard. We can look at this this one as well. This is also we. We heard a lot of people asking for oops. Sorry right.

I

And I think uh it's good to figure out whatever our final consumer is going to be for this iteration, so that we can kind of ground it in concrete reality and say: okay, this is the type of behavior we're trying to support, and then you know, based on that, we can figure out if the api makes sense or not. I I think it might be, and my this is just instinct and it might be incorrect- is it might be better to start with a first party use case?

I

Okay, um because you know in general, with the first party kubernetes use cases, we try to focus on making things widely uh reusable for lots of different consumers versus if we focus on a kind of single third-party use case, it might work for that. One person or one use case uh but may not be broadly applicable, but that's just a gut instinct- might be wrong. There.

A

Yeah, okay, so we can, we can talk to. I think we should talk about all of those anyway right. So let's see how uh what does it mean for cubelet, so right now uh yeah, we do have a few people asking about this. So let's say on the note side: uh we detect that abnormal condition.

A

um So in this case, so right now those are um parts. Okay, so then does that mean we should have those information added to part status? I think that we probably need to ask the note the note team, because I I'm not sure, uh and then this is also like. If, if this uh pvc is used by multiple parts, we right now we have events and all the parts right. So um how does the people that react on like for each part it detects?

A

uh This is about how they delete itself or you know. So. That's are questions. I think we need to ask okay.

B

So then, I think you have a use case. What.

A

Is this case you're talking about my.

B

Use case is related to this one, so so the the particular thing that happens to us is, if you have a clustered nfs server where that has multiple ips and, and you know, is doing its own rebalancing type activities on the back end and it you can very easily end up in a situation where you have a pod that mounts an nfs share on a particular ip and then at some later time.

B

The data for that nfs share moves somewhere else and now now the pod is accessing its data through a sub-optimal path, and what you would like to be able to do is just go to the node and remount with the I'm sorry, you know update the mount with a different server ip, but that's not how nfs works and linux. Unfortunately, you have to fully unmount the nfs share and then remount it with the new ip address, which would involve bouncing a pod in order to get optimal.

B

I o performance again, so the interesting trade-off here is like it's. You don't have to do anything. It's just like until you bounce the pod you're wasting some. I o performance, and so it's it's really a question of preference like do you want to keep your pods alive until they're they're done, or do you want to optimize for io performance? And so I I feel like you need like what we would like to do is is flag the volumes that are in this situation, because we we can detect.

B

We can say ah this volume is being accessed through a sub-optimal amount right and we can flag that through the volume health feature. We don't do this yet, but we'd like to be able to and we'd like to have something on the other end watching that and then basically enforcing some user-decided policy like the user might decide. I never want to bounce my pods, because re bouncing a pod causes me to lose more than than just dealing with the worst. I o performance, whereas other users might make the opposite trade-off.

B

So I feel like there will be a policy. You'd have a controller that would decide. You know, okay, for you.

A

In this case, right now we have this uh on node side. You know this is on the parts. Will that be useful or you need this on the pvc.

B

Well like, if we, if we implement this in our csi driver, it's whatever the sidecar does with the information we return right, we can.

B

We can detect for a given volume if any client anywhere is accessing it through a sub-optimal path, and we could say that means it's unhealthy and we could return that information back to the sidecar through the grpc socket and then what happens on the kubernetes side is is up to what we implement as a community um and what I, what I would like to see was like a way some policy enforcement engine. That says you know if it's unhealthy bounce, the pod or if it's unhealthy, don't bounce the pod.

B

You know and it's up to the user, what behavior they want.

A

uh Okay, so the policy will be on what, on the part itself or where would the policy be.

B

You could have a global policy that says you know across the whole cluster respond to unhealthy pvcs in this way or that way, and then.

A

Which component you think of uh cubelet? Who is going to react to this policy?.

B

Why, as long as the output of that volume, health check is visible in the kubernetes api, anyone can write that controller right.

B

Netapp could write our own controller, that, with our own policy enforcement engine, that just basically watches all your pvcs and anytime anytime, one becomes unhealthy, go find the pods that are using it and kill them right. That's really easy to do in.

A

External control, okay, so right now it's on the event, so you're saying if let's say, if we can add this one to the pod status, that might be helpful.

B

Well, I don't know how you would find the pods, it would be the on the pvcs. I think, because because we would be returning this right.

A

But right now right now, I think we were the reason we choose to um add those two parts are because we were at that time because we have this from both on the controller side and the uh note set right. So I think at that time we we thought we don't want those two to be conflicting with each other. So that's why we said: okay on the no side, let's just add them to parts. I think that was a discussion that we made at that time so that we differentiate.

B

uh Okay, well yeah. We would be returning this information on the controller side, not the node side. Oh.

A

Okay, now then you're going back to the first one.

B

Yeah yeah, what I'm saying is like as long as the information that was returned from the controller got stashed in the kubernetes api.

A

But your controller said you know the no. We talk you're talking about amount right, but your controller say no the amount, because this is supposedly the note side. Looking looking at the month amount problem, controller side does not look at that. Are you saying your controller will be looking at the mount? You know this problem mounting problem.

B

Or what the controller can go to the storage and ask like who's using this volume and what paths are they using it through and if, if someone's using it through a sub-optimal path, we know that right? You just ask the server what.

G

Is the point so here you mentioned the bouncing the part: what does it mean bouncing the pawn delete the part and recreate it.

B

No, no, you delete the pod and let the replica set. Do the pod, recreation, okay,.

G

So so you are doing kind of a similar thing uh so using this way to trigger mountain raymont, if you are not.

B

Well, I'm proposing this. We aren't actively doing this um but like. I would like to be able to do this um and I'd like to put the the policy decision in the end user's hand, so that they can decide which behavior they prefer.

G

So there are two up two options you can do. One is delete the part, the recreate part. The other option is to just uh trigger like a month and a remont that both can achieve. How do you trigger.

B

The amount of remote.

G

uh I'm just saying if it is possible.

B

Oh well yeah, but I mean you basically have to kill all the processes because you're going to be modifying their that best I mean yeah. You could recycle the pod, maybe but.

A

But are you saying like like public, somehow go, do or some publish and republic or something like that or meet gene? What do you mean.

B

My personal preference would be let let the replica set recreate the pod and let it get rescheduled.

A

Then that's just deleting this same. It's the same as what we were talking about earlier, deleting it right so that it will handle it.

B

But, but with a with an option to to decide whether that's what you want or not, because it's it's perfectly valid to not delete the pod and continue running with a less than optimal, I o path. So it's it's up to the user like what? What do they care more about podca pods lifetime continuity or I o performance.

G

Right the reason I asked about remount is you you put in your use case, like the property fixes amount remond uh for certain, like uh informal volume right, we kind of have a remont behavior like periodically trigger remont, uh because for a secret, for example, if secret updated, we want to do.

A

Something for only foreign.

G

uh Yes, I think so: okay.

B

I guess I don't understand how that would work if, if the, if the pi, if the application running in the pod had open files in that mount.

G

Right, the only thing I imagine this is because you mentioned the appropriate fix is amount remount. So, but if it's not in your case right, then it required delete the power and recreate it.

B

Yeah yeah, I mean you need to go all the way through no node, unpublished, node on stage and then node stage, node publish again. You got to go all the way through that cycle, to get the correct behavior or to get the fixed behavior.

A

Okay, okay, so okay, so this use case sounds like it's. Actually I mean yeah. I know this. Maybe it shouldn't be like under or local, but also controller. I should say controller, probably it's better. This is like controller side.

B

Yeah yeah, it's just the.

A

The key is, is.

B

That the output of that volume, health check needs to show up somewhere in the kubernetes api, that a different controller can see and then execute its.

A

B

With respect to pod deletion, okay,.

C

I think the yeah exactly so yeah are we going to plan to keep it like csi? Only at some point like once, we put it in the api.

A

Yeah, we, I think, for new features to be only adding to csi right we're not adding new features to you. No, no, no.

C

Yeah, but but the pvc spec or pv spec or even put it in port status, is basically generic. We cannot enforce like okay. This is so I'm just like uh asking like right.

A

Yeah, but it's like one snapshot, is only for csi right. I think uh we're not adding that.

C

But this mechanism that we are talking about, like you say we put it in pvc status, the health check and an external controller- deletes the part that resolves the replica set of stateful said to recreate the part or another node or something. So that's this mechanism is basically generic. It's not nothing. Nothing is css specific.

A

Yeah so yeah, so if we do that, maybe we need to have some like. Maybe we need to add some flag in cisa driver or something to differentiate, I'm not sure.

C

No, I was actually saying that we shouldn't probably differentiate and feature get it like unless.

A

If we don't differentiate, but then we don't know like how do you, but where? Where do you get the the the house? It's because we only added that in csir right? So how do you get the same information? If it's not it's easy? I think.

J

A

My question: how is that useful, I guess for other.

C

Non-Csi type, now that's a good question, and that also brings the question. What removes that flag from wherever we put it like like? uh Is this? This controller removes the flag that we put it or.

A

Let's say I'm sorry so.

C

The volume like the men's use case like there was a there's, a volume that a part is using a pvc. It's it's, it's not optimal. It's not healthy. Okay,.

A

So let's say we have like something in: let's just assume we have something pvc status say abnormal, something like that right. Pvc, abnormal just assume that.

C

And then it is scheduled to another another node, and now it is fine, so something the controller has to periodically so.

A

Yes, if it's a then if it's a back to normal, then that needs to be changed back so yeah, that's true! So um yeah! I don't think we we get to that part. Yet right now we're just so. Basically, whoever that is doing. The reaction should change that back because right, because otherwise, uh who.

B

A

B

That gets really sketchy but yeah. It's an interesting engineering concern with uh yeah. If you do write the enforcement engine that kills the pod. How do you know when you can stop killing pods uh we'd have to figure that out.

A

A

Okay, so those are the.

A

A

Okay, uh okay, so I think okay, so we okay, so I think of this. We talked about this and we're going back to here um now I think there are some other people talked about um if cubley detected, this one then terminate the pod.

A

So this is uh this is from the note side.

A

So if we uh oh so so, do we need to do this for both places? We okay, so we have this we're talking about here. So basically, um let's say: if you are implementing this driver, if you're implementing this volume house in your seaside driver, then you should try to avoid sending the same information from both the controller and node side, because, let's assume that we have both, you don't want it to be reacting from both sides. Right so let's say we control the side.

A

There's uh some controllers trying to do some reaction and then from cubies that we're talking about also talking about like deleting the pod triggering vermont. We don't want this one to be happening from both places right.

A

B

Well, certainly not, for the same reason, yeah I mean the whole.

A

B

Had volume health coming from two places was because they have non-overlapping access to information right. The node knows.

A

B

That the controller can't possibly know and the controller knows things the node can't possibly know so. You have to ask both to get all the information.

A

Yeah, so so maybe the social maybe can choose, maybe it's just if they just want to rely on controller side. Maybe it doesn't need to implement the note side then, or it just infinite side, only for uh the events, but not for the reaction. I just I'm just talking, assuming that we have reaction. Let's say if we have reaction both places.

A

um Maybe we don't want to yeah. We don't want to turn on the reaction in both places because they could be uh conflicting with each other overstepping.

A

So, okay, okay, so going back to okay going back to this one, so I think we're uh now. In this case I guess I'm not sure. um Where should we have that? Have those uh information? Let's say if we want public to do something for the for the amount right so well clearly to let's determinate the pod trigger a remark, then, uh should that information be in the pod itself or is that I still have this question? Is it because we are going to have that in multiple parts?

A

If multiple parts are using the same pvc, does it make sense? um Maybe it makes sense? That means that particular part should be deleted.

E

Like that would be for a read, only scenario.

E

Like multiple pods accessing uh volume, on the same note, yeah.

A

Yeah, that's only the yeah. That's.

E

The reason yeah so.

A

I don't know if that, if uh yeah, so that's the read army case, um so does it make sense, you have those information in the in the part yourself that I'm not sure I guess this is seems like a question for the note team. Is that because.

C

So what happens? I had a question like what happens. If, okay, we find a pv or a volume that is not healthy and uh and we have a part running on it, we kill the pod and but we want to uh block any part workload from using that pvp vc combination. Do we have a use case like that, because or we expect that after like, for example, I'm thinking of locals to local volumes, maybe csr.

A

C

Local volumes, a disk is detected as unhealthy and it shouldn't be used by any workloads. In that case like it was bound to a pv pvc, but there should be a way, maybe so that no part can claim that so.

A

Yeah, so so I get because this is a trigger right. If we only delete part, I think that's, the volume is still there right. So actually, if you're looking at this case, actually we're talking about deleting both pvc and parts so that they can be recreated and rebound and re-mount, um but if we only delete the part that seems to be not right, because the volume still not not house, it's not a pod, it's the volume.

A

So so I think this one. Yes, this one seems to be. I don't know if this, but if it's, if it's only the month, if you delete the pod trigger remont, so basically we don't recreate the volume. Maybe it is okay.

A

G

Yeah, I kind of feel like um here. The focus here right is volume, health, monitoring and but in terms of how to react, uh volume like abnormal behavior, it's kind of up to the application or controller.

G

So we could separate that because customer or the system right they have their their own way of handling this case, it's hard to have.

A

G

J

A

Solution: okay,.

C

I I agree, but this the determining that, like we can ignore that part how to handle take action, but but that determines where to put that information. For example, if you put the health stuff in part and part, is deleted, then that information is lost, pvp, which is still there and can be used by any part, and that's.

G

C

G

Yeah we can, we can focus on the problem where to put information. No, I I I don't think part is a good one, but the pvpc we can this more like focus discussion over there and uh I didn't close, follow the original design, but um maybe we can also iterate right now. It's the design only check the available pvc and the monitor whatever the pvc like point to the pv house.

G

It doesn't matter if you have a pv. That is not bound yet so you won't check the house of that.

A

Yeah right now we're only checking the one that are already creating the bond, because that's the because we know there are already controller handling the case if they are like not bound. That's something that's different right, so we are looking at after it is already creating a band. The health status.

B

I want to push back on what jing said so so, there's.

F

B

Certain types of health problems that are fundamentally associated with the volume and those probably should be associated with the volume at the api level, but, like my particular use case, the the problem isn't with the volume. The problem is with the pod volume pair, uh where the particular pod is having a particular problem with the volume, but it doesn't necessarily implicate any other pods having problems with the same volume um and, unfortunately, the way the csi rpcs are written. That there's no there's no information about like which, which pod is being asked about.

B

It's just saying this is the volume how's it doing, um but if there was a way to sort of understand the health relationship of a pod volume pair and then put that information on the pod, then that would make sense.

A

So that is the note side. We have that for a note side, so the yeah for note side. We have that information. We actually have. uh You know what volume, what part you get all of those information in that in that message that we have right now we only since we only have a message and whether it's abnormal or not, you will see uh the you will get the voting information and the party this.

B

Is just does the csirpc.

A

This is the this is the from. This is the uh we have the sidecar right so after we collected that information? uh Well, no well, actually, no not set card right now. This is the equivalent sorry yeah yeah.

B

So right now, actually we have.

A

That information, we know which part which volume.

B

A

The message on the product.

B

Tell the node plug-in, which pods are using the volume or just ask the node plug-in.

A

So well, actually it already has those information who really knows.

B

A

Plug-In does not.

B

A

But that's where we are calling right. So we are, we are making this checking cubelet the kubulator already knows what is the? What is the pvc? What is the pod? So we are, we are checking there we are checking. Then we are getting the information from that particular pv.

B

I'm just wondering if there's a case of two different pods in the same node, one of them. One of them is.

A

Okay and one of.

B

Them isn't: okay, even though they're sharing a volume.

A

B

A

B

Know if that could happen,.

A

Yeah, that's actually that that is. That is possible, I think, um but then we are getting this information from okay. So we're getting this information from the storage side right, so your node plug-in knows yeah. So if you are.

B

But the node plug-in doesn't know anything about pods. It just knows about it right.

A

Right right, so basically it will just tell you if there's some problem from that yeah from its point of view, uh and then we are going to just bubble up and tell that it's going to be the same message for all the parts that are using that pvc. That is true yeah, I don't know. If I don't know if, like one part, is okay, the other is not I'm not sure. What case is that.

B

I don't know if that could happen, but but from the controller side it could be the case that the controller side could know that some.

A

B

Users of the volume are healthy and some users of the volume are unhealthy and it would be very hard to translate that into which pods are the unhealthy ones, but uh but in principle it could be done. So I I don't know, I saying all the information has to be associated with the volume doesn't feel 100 right to me, although it might be the expedient.

J

B

G

Okay, yeah, that use case I mean from the design of current volume house feature. I feel it's monitoring the volume side right. It's not monitor any pod volume pair relationship like.

A

It doesn't go check if the I don't know what to check on the pause site. To be honest with you, what what do we check, but I mean, but it does have that information. It does have pod information, it does have. uh It knows which part it is. um I actually don't know what or you're saying like, for example, to check the like particular path.

A

Is that a valid or something.

B

The uh the external health monitor sidecar doesn't know anything about pods. It just knows about volumes right.

A

You know what we're talking about the what's going on accumulate. We have the parts we have the yeah, even sidecar, watching that I mean the the old agent right that also we're also watching up. Another controller controller does not.

B

A

B

A

Doesn't that's the pods yeah like that's the yeah, that's supposed to be it's only the.

B

In my particular use case, the controller is the only thing that would know that anything bad is going on, because you have to actually ask the storage device.

A

uh Okay time check looks like we are on top of the hour uh all right, so we didn't looks like we didn't finish. Did we fit okay? So I think there's this one more thing we didn't do, but but uh we we talked about this uh two two cases at least um okay. So what is the next steps?

A

um So should we, okay so sounds like we actually need to know a little bit more about this, because there are a couple people who ask about this particular public use case, but I don't know if they are in this meeting sounds like we may want to get a little bit more information from them about this.

C

As far as I know, the plane mount oh hyphen remount that won't solve any any problem. Actually, oh.

A

You're saying that's not going to solve any problem.

C

Yeah I mean unless it's like by remount, they mean unmountain and mount again, but even that is.

A

So now this one this actually this includes it deleting the pause. Actually, that's what we were talking about.

C

A

But yeah, but, but still I mean I don't know from the cubelet side, is that giving part is not enough. I think we're deleting the part what you keep the volume is, that does that make sense, I'm not sure actually.

E

But what do you think I want to reschedule the pod right.

A

E

I would assume they want to reschedule the pod so that they get a new volume. Okay,.

A

So if then, then, that warning needs to also need to be deleted right. So if we are only leaving the part that doesn't mean the warning will be deleted, but then from kubelet it doesn't know it's not supposed to also delete the volume that is attached. That seems to be a little strange, because if this is part of the staple set, it just seems to be the wrong place to deleting this, I'm not sure you.

B

Might want the volume attachment to be to get deleted, because that contains information that came from controller, publish.

A

Right, but I'm just saying, since this is like a wrong place now, I'm feeling like a wrong place to delete all of those information. um So my.

G

Point is like we might not guess like. What's a controller want to do here, so maybe the focus of from volume, health side right. We discuss how to update the status like what api to add it, either in pvp pc, not um like too much talk on like a controller said. What controller will need to do either daily, pv, cpu or pvo part?

G

Maybe in the next step like we can start talking, but right now, I feel like there's no uh standard way of doing this right, it's up to customer controller and also really to how they use the power to use this volume right. The stay for set controller also recreate the part etc.

G

I'm just thinking, maybe we can focus on more, like the api, like also the how to update the status part.

G

So this kubernetes do something is definitely yeah, so the couple said something: definitely uh it will be long discussion like yeah able to do anything is uh involved. A lot of decisions need to be made um yeah. It will be wrong.

C

Go ahead, I will, I will jump the gun and say that maybe we should add something to pv dot status.

A

Anywhere, I'm sorry pb.

C

Dots pv pv dot status. I know pv, doesn't matter.

A

Oh you're, saying pv says now: the tv status.

C

Not pvc status, pb, dot status.

A

Are you sure, pv and not pvc.

G

uh Yeah, this is a discussion like we can focus uh next time and also what api we want to put in the status.

J

Yeah, let's count right: yeah, okay,.

C

That should be the next step, and I would also request to not like feature get it somehow just for csi, because it seems like it could be useful for anything.

A

Oh you're, saying that: okay, okay, uh don't don't disable the feature for now csi audience.

E

Yeah, I think that aligns with the suggestion from saad as well right like party first, I thought like right, like you know, a few bullets up we're thinking of third party use case later, so maybe that aligns with that as well.

A

uh So uh so I think we right now. I think we are talking about like adding maybe adding a field in either pb or pvc, so that is a first class. We can see uh how that can be used, um but I think it's it's more like we are talking about this api, but then we are not really going to add the reaction directly in the in the kubernetes controllers right now at in this step, I think that's that's what I heard so far right.

I

Yeah, I think uh the suggestion is maybe think about the kubernetes reactions first, because the kubernetes reactions might be more generic and reusable than a third-party reaction.

A

I

A

I thought we were talking about, we were just talking about like we don't does not look like. We have a common way of uh reactivating it. We can, we can, we can definitely discuss about it.

A

At least we haven't uh figured out. We were just talking earlier right. We couldn't figure out, but yeah.

I

I think uh the thing is it'll be a forcing function right. I think the natural reaction is to try and go with the path of least resistance, and in this case it'll be, like you know, there's a use case x by you, know consumer foo and if you focus on getting that working it'll be easier to get that working end to end, because it's a very specific use case for that consumer.

I

But uh the challenge in this design, I think, will be to make something that'll be generally reusable and functional across multiple different consumers. Yeah.

A

I

Good forcing function to do that might be to go through the kubernetes first party reaction cases.

G

Okay, so direction uh like has two areas: one is recover, the other for recovery. That's what I'm saying like uh it's hard to have a general way to like um fit every cases, but we can discuss that and the other is for scheduler set that can help like better schedule. Your uh schedule, part, if you know some pv the volume, has issues right. You can avoid schedule on that that I feel might be um easier like to to think through. So.

A

That would be uh kind of combine that, with the css capacity tracking. I think that has to go with combats tracking.

G

It's kind of related right, so yeah.

A

G

A

So, oh, okay, if that's the case, we probably okay. I need to take a look at this one see: okay, I need to take a look and see where should we add this because we actually talked about some time ago of adding that in the, but it seems to really the weird to add that in the css capacity, so, okay, so, okay, let's let's talk about this one later yeah, so I think for better scheduling.

A

Then this we have to sync this together with this at the second city, scheduler csi capacity tracking feature and see you know where this field should be added so yeah, okay, so maybe that yeah, I think that makes sense in uh think about how to do a better scheduling um rather than saying yeah from public side. This is a little messy if you're like okay, so I think we are out of time um uh yeah.

A

So I will schedule uh another meeting in the future just to to talk more about how this would look like in api and how cubelet can use this to recover. Potentially.

A

All right anything else.

A

Okay thanks, everyone.

I

Thank you. Take care.

D