Kubernetes SIG Node, 28 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20210928

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

Oh yeah, because earlier I have the audio problem, so I worry about that. I cannot kick out meeting on time. So that's why I connect the grinder circuit the host and then just rejoin just disconnect it, but I kill. Oh sorry, I already recorded go ahead. Dude, derek.

B

uh No, no, that's fine! I was just uh uh I guess it's the september 28th signed meeting. uh We have a few items on the agenda. Some I'm not sure we'll get to full resolution on today, but at least we can get awareness. um I guess sergey. Would you be kind enough to want to run through today's agenda.

C

Yeah, absolutely um hello, everybody. uh I think we can start as usual, with uh some uh uh formal reduction like if you missed last two weeks and don't know what was happening in terms of prs and work happening. You can click all these links on the created, prs and closed merspiers.

C

I went through that like there is no like some. Nothing wrote in the way that needs to be picked up and uh we're doing very good. Job in updating prs, like replying to pr's, like 146, is unusually high number, so we're doing great, um so yeah, and I wanted also to remind everybody that the soft code phrase we announced before is in two weeks and one of those weeks is cube convict.

C

So that would be nice if you're running a cap- and you have your pr out, at least by october 15. So we have time to review and go through all the necessary step to merge the pr um so now yeah for.

D

For the for the soft code freeze, uh I think uh I sent an email about this, so you should have it in your inbox on the signate mailing list. um I just wanted to add. This is the first time we've done this. uh We were hoping to try to get like some features merch earlier. I know that you know everybody is very busy, uh but certainly like one thing we really want to avoid. Is everybody not having their prs ready until, like the last week of code freeze, given how long the development cycle is?

D

So even if we don't get everything that we want to get merged merged by then, like we've got a bunch of beta things, it would be cool to graduate by that point, but please have a pr ready for review by that point, so we don't spend all of like the last week of code freeze uh just doing that. uh Also. I suspect that that 148, updated might have been uh somewhat my fault since I got through a very long github backlog this week.

D

uh Do we want to go onto the agenda sergey.

C

Yeah, okay, let's uh go into agenda and uh uh I think uh first skylar um enable static, pods.

C

Are you in a call.

E

She doesn't seem to.

F

E

On uh basically she's implemented the pinned images, uh pr where, like in cri, you can have specify an image. Spin and cubelet does not try to gc it. So she wanted to quickly check whether we can get it in or we need to open an announcement for that.

B

So now we have, we had had a cup- that's already merged for this, um so that's that was. Is it really just for trying to.

B

Is that the issue, or we just need to make sure that we do the review.

A

B

A

Want to enable this, this expanded scope to the other static power. He uh he or she. I don't know and uh didn't, make that clear, but from the agenda and it looks like they want to expand it. I think that we we do have concern on those things we talked about last time when we discussed the ping, the image and but once we expanded those title parts, then we need a wire.

A

We may have to configure more things right, so when the local disk it is out of disk and the other part, is the static part, what we should do, and also people there's the different use cases in the community people using static part, but the upgrade.

A

So if that's the static part, are we going to uh think of what the current state, what kind of the static part which worship even at the same part, but different image, different image, worship, which one should be evicted, which one should be evicted all those kind of complexities should be considered here.

B

So, let's get feedback on. I guess what I was trying to figure out was the length kept off the pr which was 2694, basically, that the runtime decides which images were pinned, and I thought the pr that was presented just had implemented. That is there a place where this was tied back to static clouds.

A

uh Agenda fellow agenda actually asked: can we end the stack apart from this one, and we need to open our enhancement for this one? I do think about. We shouldn't enhancement, that's more complicated than the current, really small, thin image we discussed even that time. Actually we didn't mention that.

B

Okay, I was going to go through the pr. I just didn't see anything that was unique to static pods. So I was confused.

E

Yes, I missed that one as well.

A

I also goes through that pi. Think of here, should it be high uh most uh okay, but once we, but I do have concern about the people, maybe abuse that one. Then we end up having to buy the not by the node. We don't know how to recover right. So today's we just force uh uh first delete because when once we have run out of the disk so and what's the rescue, so if we have those kind of problems so so this is why we need to discuss anyway.

A

I will comment on this uh this request, since she cannot attend and ask the enhancement request.

C

So we have action items on that and uh I'm a little bit uh surprised that it was added into uh existing cap. So is it on the easier like I do know if it's only uh iv1 or it's uh blackboard to current version of cry.

E

So I think when we, when we first discussed this, it was only the scope for smaller, like the weight is implemented in the cap, and I think I missed the middle discussion around static parts, so I'll have to uh catch up on that.

B

E

I just don't even see anything on the pr.

B

That is tied to static bots, so the pr when I looked at it was basically saying preserve the image if it's been pinned, which inherently does mean that you're giving higher priority to any pod that is consuming that image or could consume that image.

B

G

By the way, obviously, the the speaker isn't here to give us more insight.

H

I am here, I think, speller just joined. I just showed up sorry, I'm very late.

B

Yeah so scholar, I guess we were trying to understand what the relationship was to static pods. With the item you put on the agenda.

H

So um basically, what was going on is that um I had made a pull request that created, um like it enabled for pods to be pinned um in order for them not to be removed at garbage collection, and there wasn't.

H

Give me half a second.

H

So I we needed it to be um like probably to be put through the process of like adding a feature because pinning the pods is a feature and that's the plan.

B

um I apologize.

G

B

I thought the pr that was linked is just pinning the images, not the pod.

H

Oh yeah, sorry, the images.

B

And the kept that is linked off that pr is had been merged, which let the cri say that this image is pinned.

B

I guess my my first feeling is it. I look at the pr, but the pr looks to implement what we had proposed in the cup. So I was just trying to understand what the delta is or if they ask us that we do something different.

H

No, I think I think the ask was just to see if, like the the process is being followed correctly for this pr and we don't need to like attach it to some other kept thing or anything. So.

D

Does the does the cap have mentioned in terms of like how you want to handle this for static pods currently and all of the uh api changes.

B

I alana, I guess maybe or someone could help clarify, is why do why do static pods matter at all? This is just saying: don't garbage collection image if it's been pinned.

B

That's using that image, but the deployer presumably has asked that that image always be present on the node.

D

Skyler, can you clarify because the pr in question that's linked has nothing to do with static pods, but the agenda item says enable static pods.

H

Maybe I wrote it totally wrong um because I'm I'm going from advice that bernal gave me about bringing this up at this meeting, so I probably um totally messed up how to write it.

C

Maybe it was meant to be enable static images.

H

Yeah, probably.

C

A

um I I want to this one like what derek already mentioned, that uh we already agreed discussed that the ping, the image in the past and at that time why this static part is being brought up. So, let's just forget about steady power make this is the narrow scope? What we agree if we go static apart, we need a more complicated policy, so, let's just make sure we are just loaded up for the pr. I did look at the pi. I also didn't see that expanded.

A

So, let's just make it clear see if we want the stack of part of support, come up the new policy and the enhancement uh cannot piggyback a previous characterizer so, but the surveillancing kpi actually is aligned with the cap. I think, if.

D

This is just a typo uh from what I can see. The thing that's being proposed here has already been merged to the cap, so I think that's fine. I don't think there's a new cap needed or anything like that.

H

Okay, yeah, that was my bad sorry.

A

Don't worry, we just want to make sure we are aligned, yeah yeah.

C

Okay, then, um let's go on. uh There was a striking item about dynamic config. I replied in a comment or mentioned. I think parker cannot be here. That's why it's track and over so reminder if you affected by dynamic google config and you have a strong case to not delete it in 124, please step forward.

D

C

D

Comment as well that caused this to be crossed out.

C

C

Adrian checkpoint, restore.

I

Yes, I just wanted to mention that it's ready for uh further review approval. um I think we we from ronald- and I are at the point where we think it it's ready. We trimmed it down to the just to the core. You stay and try to remove everything which is not directly directly related to the to the cap.

A

Yes, I saw that I also sent you your message and the call- and I is going to review and- uh and we are going to be the reviewer and also approval on this one.

I

Okay, thanks all right. Yes, thank you.

C

So this is for the 124 right. I just want to make sure that everybody ends and it's not coming in one thing: three.

E

Yeah, of course, yeah. This will be four.

C

um Jordan, um do you want to talk about pdb and eviction?

C

J

So I sent a I started a thread on the mailing list this morning. I also started included sig apps and put this topic on the sig apps, uh the next meeting, because it sort of touches apps and touches node.

J

But and then I tried to link to some of the issues and discussions and stuff that's going on here, but the short version is.

J

Eviction currently sometimes prevents uh eviction of pods that are not ready and sometimes does not, and sometimes preventing a not ready, pod from being evicted will make it impossible to drain a node which there's an open bug for so at least some people are negatively impacted by that and expect eviction to not have opinions about pods that aren't ready, um but other people uh are relying on eviction blocking uh deletion of not ready, pods uh for uh for other reasons.

J

So some I think michael um mentioned, that in the pull request and then the the thread. So uh this has come up before and it's kind of been unclear what the.

F

J

Should be, and there's been, a variety of adjustments made to what eviction allows over time to allow deleting pods that are already in the terminal state, no matter what the pdb says or allow the leading pods that haven't been scheduled to a node.

J

Yet, regardless of what the pdb says and recently recently like a year ago, uh allowing deleting a not ready if the pdb status says there are enough healthy pods and so there's a mismatch between what the api server allows and what the disruption controller considers to be a healthy pod, and uh I think what we have today doesn't really make sense, and I'd like to try to figure out how to resolve this. So the questions that I asked in the signature thread um there were three three questions.

J

One was is a pod, that's not ready, considered to already be disrupted.

J

um The second question was: does it make sense for eviction to block deletion of pods that are already disrupted and then the third question was trying to get feedback on people who were relying on the current behavior and how they were handling some of the races and uh lack of guarantees around the current behavior.

B

So joy, one question I had was in some of the dialogue on the issue. um There was special behavior documented around how you can prevent any voluntary disruption by setting, I guess the pdb to zero. um Is there any express desire to change that behavior or is that independent of the pr that was linked.

J

um I don't think, there's a desire to change that behavior. I think the question is what counts as disruption and so that that was my first question like if a pod is not ready, is it already disrupted.

B

Yeah, so I will admit that some of my confusion this morning, when trying to speak to both david and michael on the topic, was with the confusion and terminology we sometimes use so uh ready versus available and the pdb referring to available were available only had meaning on replica sets, but not on pods themselves.

B

um uh But I think I think the definition of ready, uh implying, I'm already disrupted or not is is a good way to clarify the question.

B

What I was trying to figure out was if there is an unspoken use case desired for um don't disrupt the scheduled non-terminal pod and that's what I feel like is coming up in the existing dialogue on the issue and so to me. I wondered if it made if the api is as documented or if the api is as implemented and do we want to only tighten api behavior when the as implemented behavior can still be re restored, with maybe an undefined knob.

B

So that's what I was trying to figure out, but I I didn't reach my own mental closure on this. um I I also wondered if.

B

Well, yeah, I guess I'll pause. There does.

A

That make sense.

B

Though, what am I, what I'm communicating.

A

um I I get partial, what do you say, but I want to first get back with children? Yes yeah. It is true. We've been discussed this many times. I haven't looked at that issue and the pr in detail, but because it's not surprised to me, we have this kind of problem in the past for the pdb. For me, that's the cluster level of the wheel for that workload or services right so so. This is why I think about a lot of terminology is being misused like, for example, eviction when we first design eviction.

A

Actually it's the local decision, optimization preemption actually is the cluster level of the things like the eviction. When I decided evicted, it's like I'm going to know the resource establishment. Oh, and when I like to drink the node. I want to evacuate all those kind of things and then to then claim that note it is ready to be removed, properly removed, but now it's kind of a mix all over the place and but on that particular pdb issue, we have this kind of problem in the replica side in the initially.

A

I also have the lung discussion with the uh magazine: even don't have the sig ipf. I have the lung discussion with the api machinery teams.

A

So on this kind of things I I really think about the the lord writing part. It is disruptive and but that's not the decision. Our decision is pdp as the controller, their decision right. It's not our decision and on the node side we already market this part it is running, ready or not ready, and you will have the ready, plus states in some so basically pdb based on that one.

A

They should define the policy or the services availability and make decision how they are going to delete this partner right at the end, it is just what's the policy they wanted to eat and kubernetes cannot make that decision, because we could see that this cuban this part at this given time due to all those problems due to all those kind of available resources. It is in crash low pro, and so we mark that is not ready, but it is, we admit at kubernetes.

A

We only run it once succeed, run once and their running state based on the restart policy will keep running that all terminate that all those things it is on the class level of the control controller decided to make the services level availability or whatever availability. They decided what's next step, so that's kind of the things that we agree upon at the earlier stage of the kubernetes.

A

That is one of the main things here, so that so so I think the terminology is pretty important. We need to figure out. What's that, then we know what we are talking about here.

D

So, uh jumping in not on a terminology thing, but on a concern I would have jordan if we change the eviction logic to be able to evict not ready pods. So one of the things that I've seen uh there's been a number of different pr's.

D

I think that I've reviewed for this there's a bug on the backlog that states that, when you restart a cubelet, a bunch of pods go not ready until the readiness probe has a chance to run, and so is it possible that that sort of behavior could intersect with this, where you restart a cubelet and then all of a sudden, a bunch of pods get evicted, because now we're kicking we're allowing us to kick out pods that are not ready.

D

um I think there's a lot of things that, like not ready, is not necessarily a good enough signal. I think, to evict a pod on its own. I would say that this is. It has to be an app level consideration, uh but that's because there's not enough context available in the cubelet to know.

F

I I completely agree with that, and I would like to say that pod disruption budget was the wrong name for this. It should be called application disruption budget. So, if you're thinking about disrupting a pod in the context of an application, that's different than thinking about disrupting an individual pod, and my premise is the eviction. Api should take no action unless it can know for certain that the action it takes will not result in an unhealthy application.

F

If the application's already unhealthy, because all the pods are in crash loop or what have you, it should take no action kind of like a circuit, breaker, failing open, like you requested this, I'm in an unknown state. I can't proceed and that will cover these cases where readiness is flapping, because the kubelet restarted or the kublet might be going unreachable temporarily or any or if the pod was restarted.

F

For some reason, readiness can be quite transient, um and I think this is primarily targeted around like automated behavior so trying to drain a node and then remove it from the cluster in an automated way, because, obviously, if you're sitting at the keyboard and drain fails, it's not a big deal. You can kind of work through that quite quickly, but if you want a system that is constantly maintaining itself, you need this logic to account for lots of different scenarios.

F

And so I think what we're talking about here is preserving applications that are in, like.

A

F

Unknown state because we don't really have an application level signal, not all applications are behind services, so just because the pod is labeled as not ready doesn't mean it's not doing useful work, especially if the node is unreachable, and it also doesn't mean that there won't be data loss. If we were to remove that pod and that's the point I made on the list regarding our use of ncd and openshift, we use these canary pods because we run ncd as a static pod.

F

You can't drain them anyway, so if we were to have a condition where those pods went, unready.

A

For whatever reason,.

F

Maybe there's a bug that we wrote in that canary pod definition or any number of things. That would allow, then all of those nodes to be deleted in an automated way and that's pretty much. What we're trying to prevent against.

J

There are a few things that I'm trying to reconcile um the the first is that we already allow deletion of not ready pods in some cases. So the idea that eviction can guarantee.

J

Like the gate on eviction can guarantee applications remain healthy, like that's already, not safe, so if you're, depending on pdbs to like well, I.

E

J

Data loss, or something like that, like pdbs, are the wrong tool for that today.

F

J

The basic master.

F

They are a tool they're, not the.

K

F

But they do absolutely prevent from involuntary uh prevent voluntary disruptions. So um there's always going to be a race between involuntary and voluntary right. If I want to evict a pod and then all of.

A

A sudden another node crashes.

F

In a different pods impacted, there's nothing. We can do to prevent that, because that that crash could happen before right before or right after that's, that's just what's going to happen, but if we have a reasonable level of certainty that the data we do have is accurate, we should be able to make the good decisions with that data.

J

Yeah, I think the exceptional cases that would be like a node crashes and then another node crashes, and then a controller goes bananas and is like deleting everything like. Those exceptional cases are exactly the types of cases where we don't actually have great confidence that the data in the pdb is accurate, like if the disruption controller, isn't keeping up updating status and we're like.

L

I don't know, looks.

J

Like I had three healthy things- and I only wanted two and all of these are not ready, so cool delete, delete, delete like that hap that happens today. um So I'm trying to reconcile like the use of pdbs this way with the current behavior we have, um I think.

G

J

To figure out the.

G

J

The other is that um for the controller to use readiness as a signal for like the status number of healthy pods, but for the api server not to use that seems confusing to me.

A

uh Jordan, I totally agree with you. There's the I expected event. uh The problem is like. I also agree with michael those unexpected event. Kubernetes at the know, the level we cannot predict right, even kubernetes, don't know itself will be that crash together with node. So so all can do it is the data to generate when it is node is available, generate available data, and then there is a way to detect in the class level, detect the network partition detect of the crash load.

A

So those kind of information should be collected, which is the kind of what we have like the node uh uh time step right, send the status timestamp.

A

This is what we are trying to do so so then pdp or other controller based on that one to make a decision, so they cannot just based on the part, readiness or single status they need to take other status came into this controller and to see this kind of debate is in the java controller, and also the replicas that have been I've been debating many many times it's just where, as the node level, what I can do. It is just management of the node level of the device resource and everything.

A

So we make our decision to object uh animate of those, but we give the data feedback. Loop is pretty important here. So.

J

Like we tell people to use drain as a tool for managing nerves, but it's pretty trivial to deadlock. It in a way that, like you, look at the node there's, not the nodes, you look at the pods and they're, not healthy and they're, not available and until they're deleted. The replacements can't be spawned.

J

But you can't use the tooling built on the api. We told people to use because we're blocking it based on this deadlock, so yeah.

A

I don't have a strong information about resolving it. This way.

J

But it seems like that deadlock has to be resolved in some way. If you can't use automated built on this, it's not useful.

F

We have resolved the deadlock if you only uh are attempting to evict an unready pod and you have enough ready, pods covered by that pdb eviction, checks that today, so if you lose one node and and there's only one replica right and we have properly tuned highly available workloads right, that node can be successfully drained today, no problem- and um I think the concern is okay.

F

If the disruption controller has crashed right, that's an oversight that I hadn't previously considered, because that logic today is, if those counts are never updating how many available replicas and then we go boom boom boom and delete three pods.

F

We should definitely close that gap. That seems like we could put just like something in a status field like like just something that needs to be popped, so um the disruption controller does the popping and the eviction does the pushing. So the eviction won't be able to complete that request unless that field is empty and that field is only emptied by the destruction, controller, etc. That's one idea, but definitely there's no deadlock today.

F

If you don't have multiple node failures, if you have multiple node failures, then I don't, I don't know if anything can help in that scenario,.

J

If you have multiple node failures, like your application, is already disrupted like this, this bit is not doing the disruption it. The disruption already happened, and this is what's going to unblock automatically recovering from it. That's why.

F

I think the fundamental.

J

Question is: is a not ready, pod already disrupted or not like, I think to me that's kind of the primary question and if we don't consider it to already be disrupted, then a lot of what is being discussed here makes sense if we do consider it to already be disrupted.

J

Then eviction like pdb shouldn't have an opinion about it. At that point,.

F

I think if we could, if, if ready, this status happened immediately as as soon as a pod came online and magically it was just running and ready, then we could definitely say that, but because pods can get restarted, kublets can go unreachable.

F

Then there's a very large. You know time period where pods can be running, the api status can be incorrect um and but because it's unready in the api we're saying we're going to delete this and check nothing else. I think that's the wrong decision.

A

um I think the most the time it is not ready. I realize our intense it is here. It is like the not ready, but not right. It could mean like the wrist kubernetes restart immediately later back to the ready right could be like the crash loop and with back off and the kubernetes will for a long time, not restart that, and it's totally disruptive to me on that state if we are about application level availability.

A

So I think that's uh that's the missing signal for the for the pdb, because when they only look at the writing, a lot right without more detail like how many things this crashing the initial crash and now how many restart, how many retry those restart the counter.

D

Yeah there's some chat happening uh in the text chat and I just wanted to address one of the things that was written in there uh because I mentioned we have this bug. There is another bug uh that was filed as a bug. I don't actually think it's a bug, uh but there was a bug filed against us uh talking about.

D

Well, you know if you restart a cubelet, then pods show as not ready, and so that's the sort of thing that we need to consider here, like not ready, doesn't necessarily mean there's anything wrong with the pod. I don't think that we can always assume that that, like we need more context in order to be able to evaluate if something is actually disrupted or not uh because, like I don't think it's safe for a like, it would be a very big expectations for users to change the default of oh yeah.

D

We just assume the pods are ready, unless proven otherwise to me that doesn't seem like a safe default. We should assume that the pods are not ready until proven. Otherwise I think which is the current behavior, so that was filed as a bug against us. uh So david had a comment like that seems like that bug that we should fix. I don't think we should fix it, because I don't actually think it's a bug. I think it's just a mismatch in expectations.

D

Similarly, with this bug, uh a lot of the folks on the issue are saying things like well, but like a pod, that's in crash looping, it's already dead like it's. You know I consider terminated well, a cubelet doesn't consider it terminated and there could be any number of reasons like if it's crash looping sure that's probably disrupted, but if we're just looking at not ready, there could be any number of reasons why it's not ready.

D

It could just be that there was some issue with, like you know the the readiness probe flaked or something like that. So I don't. I don't want uh people to be like okay, it's not ready, definitely disrupted. I don't think that's true and I don't think that's a bug.

A

Yeah, I really agree with you. This is uh yeah again. I keep mentioning that we debate right, like I said the controller in when the original implement.

A

They basically have one problem, particularly one problem is just like: okay, the the part it is uh crash, and then we keep uh dot righty and then we have like the send or restart account, but the replica side look at the lottery only and then they keep creating of the new part and keep creating the part, but they didn't didn't remove previous ones.

A

So the end up, like the uh next replica equal to three, they end up quite off like more than ten, but it doesn't have the tons of like the behind and it is in the not ready state and the take off the uh node of the resource, so those logical um yeah. We need the title. I I I just read off the direct chat, and so we need to tighten off the behavior and here make that more clear, so yeah. I agree.

B

So I I guess jordan or david- um I I don't have any. As I said in the chat, I don't have any disagreement that there's a coherent argument to make that a non-ready pop pod is potentially already disrupted.

B

What I was trying to figure out was: um uh do we actually feel safe making this change in the community without allowing the ability to fall back to prior behavior? This seems like a tightening without an ability to to loosen which feels risky for a gi api, which is why I was trying to find a word that says like I I I can have macs scheduled, but not yet terminal pods be treated differently than uh scheduled, but not yet ready pods. If that makes sense.

J

I need to see that written out, so I can parse it. I broke my brain.

B

I mean the tension here is like do we want to treat not ready, pods as disrupted or not, and so basically having the option that allows you to say I don't care about the readiness state. I only care if the pod had ever started or was in the process of starting that. That seems like a use case that might have been missed in the pdb discussion.

B

uh Some of the other things that jordan you raised, I mean I I think I was the one who said pdb should ignore terminal pods. At least we all agree that one was safe, but I think this one there's reasonable uh ways to.

B

To ask the pdb provide enough knobs for the in ways that.

I

People may or may not be using it to be clear. Like I'll put my red hat.

B

Out on right now, I I don't. I don't think the open shift use case. That michael is communicating, is a disastrous like product posture in any way, and I don't actually feel like the present behavior or even the updated behavior would have a material impact, because if the issue, as we've talked through with michael is, uh is what happens if you had two nodes lost and you've already had a quorum failure and there's some issues with that independent of of this capability.

B

To me, it's more like can reasonable people be depending on behavior, um and is it right to tighten that behavior unexpectedly upon them versus um find an alternate path to both support tightening for correctness, as well as capture the that gray area in between right now,.

J

I I would actually consider the current state to be sort of the worst of both worlds, because drain is vulnerable to deadlocks and uh depending on it, to handle um like not ready, pods in a way that doesn't deadlock uh is vulnerable to, like the controller not running the train.

F

Is ours, as michael described,.

J

Like if we improved the, if we made the not ready, handling safer, more correct, then it becomes more vulnerable to deadlocking.

B

Those are already vulnerable to deadlock with staple sets right, and these are the exact pods you're, most likely to put a pdb around.

B

uh The staple set pod won't be deleted until the note said. I have deleted it and so like that is the pod you're most likely to put a pdb in front of anyway.

F

B

Actually, the entire.

F

Reason I implemented the check, readiness and check pdb counts was to close a bug related to stateful sets because they don't get a new replica for everything else. Would.

J

Yeah, I so like I I'm not particularly attached to a particular outcome. I just want, if, if someone is expecting this to be safe for them to use with drain, I want that expectation to be met if someone is expecting pdbs to keep disastrous things from happening. I want that expectation to be met, and I don't think either of those are completely true right now, and I.

F

Don't I don't see why this needs to be part of eviction? If, if your organization, you know whoever not you specifically, but any any organization, decides that I don't care about unready pods delete them. I mean that is a that is a trivial line to write right step, one delete anything, that's not ready, step two drain or reverse them. If you feel so inclined, I don't see why we would need to put this in eviction specifically.

J

I think the mismatch between the controller and the server is at best confusing and at worst uh opens the door to like mismatches and guaranteed behavior.

J

I I just want to see it be coherent if that means another option on the pdb to say like what do you do with these things, then the controller can honor it and the server can honor it, um but the. But the mismatch of the two is pretty confusing. It makes the admission code very difficult to reason about.

B

Yeah, just out of curiosity, like I'm, trying to think of other edge cases that might come around pdb with this, and do you feel jordan? There's a expectation mismatch on how a replica set would define ready versus how a replica set would define available and do we want the pdb controller to maybe align how it views available? In the same way, the replica set view is available, which is ready for some min period of seconds, so at least that you avoid flapping.

J

I haven't thought about it enough to know. I.

F

Don't know that seems like we're working away from more protection if we're only considering available pods, then that will be like more criteria for the pod to meet before we say: don't disrupt it. um I think.

J

There's like we're thinking of these pods, as only in two categories like disrupted or not safe to victor, not, um uh and it might be, there's like three categories: there's like the ones that the controller considers to be able to count in current healthy, and maybe that's actually what we have today implicitly, because the controller considers readiness and the api server doesn't. So maybe we do have three categories of pods, but that's just never been formalized um so like safe to evict unconditionally or safe to delete unconditionally.

J

uh Is one category like perfectly ready and contributing member of society. Pod is another category and then there's a middle area where it's like. uh We, we don't want to just delete these things like they might be doing work but they're, certainly not healthy, uh like they're, not ready, they're, not being routed to buy services like we don't know what their status is and that just hasn't been formalized, and so a lot of the code like is basically these binary, like boolean, good or bad.

F

Functions, there's also there's a fourth category of canary pods. Like specifically, we point out the documentation schedule. A pod put pdb allow disruption, zero and that is basically a proxy for you cannot drain this node successfully um and and that could also represent static, pod or just be a part of an administrator procedure saying nobody is allowed to to get rid of this one particular this class of notes until they contact me.

F

So I think readiness in that scenario is not desirable. We we don't ever want that pop to go away.

J

So I mean a part that doesn't define readiness checks, like is always ready right. So if you, if you haven't indicated any readiness, then.

F

Unless the pod had to restart or the kubelet went unreachable temporarily or the kubelet restart it.

J

So I yeah and we're not gonna resolve it. Now I don't wanna, we can move on. um I I don't.

J

I think the current date is problematic uh and confusing and people are relying on pdbs for things that pws can't actually guarantee, and so, if a change is needed in pdbs or in the api server or in the controller, that's fine. I don't think like our current state is like this is pretty much fine and if someone feels like pushing a big change here, that would make it better, like there's bugs around deadlocks and there's bugs latent bugs in the uh the api server implementation that make what some people are apparently depending on not safe.

J

F

Just want to find out, I think it's pretty much 99 there and works great, and it sounds like we identified some small areas where we can fix some of our assumptions, but I think it works perfectly um well. We've been.

D

Discussing this for uh a while, we have a couple more items on the agenda, so I, uh in order to make sure that they have enough time I want to. uh We can maybe table this discussion to next week and uh move on to those.

A

Yeah, I totally agree with alana. Let's move to yesterday, can we you share a dog? Maybe people haven't read that dog.

C

I wanted to just uh say one thing about it. I I can present next time, but uh what I wanted to say is: I think it may be good idea to send a questionnaire to users to understand what preventing them from other like from migrating over to kershim, and I put a questionnaire in the end of the document. um So if you have an opinion, what question to ask please comment on there and next week I can present more.

M

C

Okay and finally, last one is poli.

K

Okay, so short introduction, because I'm new to this group here I've been doing most of my work in six storage. Recently I started looking at the structured logging effort and I'm helping out there, I'm also maintaining some of the upstream projects together with tim hawkin, and I noticed while doing that, that a few things have been pending in the sig note area. It starts with the cubelet flags that refer to config files or that have a corresponding entry in the config file for cubelet.

K

I think the deprecation remark is now several years old and I was just wondering whether there is still a plan in place to actually remove the parameters, because uh eventually, I might add one more to the list that will be basically deprecated from the beginning, and I was wondering what what the status is here. Does anyone know.

C

So we're clearly working on some flags migration, so there are a couple prs currently in flight. uh I don't know what the status of a rob uh processor. I think we at some point. There was like a single issue, taking all the duplications and uh like they were like separate issues, and we created single issues that taking all the migration.

C

K

D

Discovering uh I think, sort of piecemeal at this point. There are some things that have flags that are lacking a corresponding uh cubelet configuration value. So I think, as we've seen, this come up, we've been filing issues to add them to the cubelet configuration, because those probably need to all be done individually as separate changes.

D

um If they're already in the cubelet configuration there's no api change and review required, uh but if it isn't there already, then we do need to do that. uh Hence the sort of like we have a mega issue for one deprecations versus oh there's, these things we need to add and each has to kind of go through review separately. So I know that that is ongoing, uh because I keep triaging new issues they're like oh. This thing is missing: it's only available in a flag, so yeah.

K

No, I was thinking of those things, but I already have a config entry and still have uh command line flags, so both.

F

Landline flags.

K

Are basically obsolete now and need to be or can be removed at some point but yeah, okay, but yeah.

D

From an operations perspective, it's very disruptive to remove command line flags uh without, like some sort of deprecation period, so we'd have to follow this. The standard deprecation cycle, I'm trying to find this in the minutes, because this was previously raised in sig node. uh I think a couple of months ago uh discussing like should we put resources into this deprecation and whatnot, and I think the ultimate conclusion was a working group component base.

D

I mean, I think, that's now dissolved, because there was no active leadership there so like that working group had lost momentum and other sigs weren't actually doing migrations. So, given everything that's also in the air at sync node, we didn't want to prioritize doing a refactor that nobody else was doing.

K

Yeah, as far as I can tell coming coming from that as a as an outsider, the flags were already marked as deprecated in the command line at least four years ago. I don't know whether that officially started with application period. That was probably a different discussion, but anyway it's not that important. It was mostly around what do I do about the new things? Let's, let's continue with that part.

K

So I was looking at the logging part of the current cubelet configuration, and I noticed that the v1 beta is the version for the overall config, but it embeds the v1 alpha for the called login configuration.

K

So my take is that this logging part is still alpha, because what that's? What what the type says and that's just the missing comment, so the users do not necessarily see that they are using something that is actually still alpha, and I was wondering about that is also the opinion of everyone else here in the group, because then, if it is when I can create a pr that just adds a comment to the documentation but saying that the logging field or the logging part of the configuration is alpha.

J

I think I'd actually made a comment about this when it went in uh maybe at the at the time. It was part of an alpha feature. I in from my perspective, if there's a struct, that's used in a beta uh config api. It should be a beta level stability. So I I would encourage like creating a beta package under component config and like having the logging trucks there and having people that reference.

J

The data types like from a user's perspective, they're not going to thank us if they wrote a data config file, and then we like do something breaking over a single relation. We're like oh, but this substructure is alpha. Didn't you know like no one will.

K

I agree that it's sub-optimal, but this is what we currently have and I'm not sure I need to talk with with uh with the folks in infrastructure logging working group, whether they are ready to commit to a beta version of the logging configuration. That is a bigger question and yeah.

D

Surely you've read there's a cap uh that uh was reviewed and approved this cycle for deprecation of most. I.

K

Know I'm working I'm working on the implementation of that cap exactly.

J

I would say unless, unless the, unless use of that is guarded behind like an alpha flag or something like they've already signed up for beta stability, uh it's in.

K

A beta, well, it's big pile yeah. It currently has fields that are unquestionable, it's vlogging format and well. I know that the other field is actually. I think the experiment is actually experimental me see whether I can have it readily available.

K

D

Yeah, like I reviewed this cap, I don't I mean I know that there was talk of adding a new field and a new command line flag. I don't know like it hasn't gone through api review, yet I don't think uh it looked quite straightforward to me as a cap reviewer. So I wasn't like a stickler well.

K

You know the cap is: the cap is mostly about the k log flags and they they are not in that config file, so you're you're safe there.

A

J

Yeah, I recall it being kept very small, like very constrained yeah um intentionally.

K

uh The sanitized option- that is our experimental, so they sanitize flag for removing secrets or something like that- that is part of life.

D

Yeah and if I think, if that doesn't get graduated by next release, it should be ripped out uh because it's alpha and it's been out sitting in alpha for like four releases.

K

Okay, so that that puts it a bit into perspective. So let's talk about the things that I'm currently planning um deprecating for k-log flags is certainly one of them. I'm not even sure whether that affects uh cubelet or even needs approval from sicknote, particularly because it's mostly just in component base- and I think cubelet will just inherit that without any changes, so that might be fine.

K

The more interesting one is around replacing support for different output streams. That is part of a cap. We have that feature currently for plain text, so you could configure k log to write to different files and then process those files. The different priorities and one of the agreements as part of adopting the cap for deprecating k-log, was that a similar feature should be possible for json output and long-term, also, of course, for for traditional plaintext. Perhaps.

K

But that means we need to have that feature for the json output in the first place, because currently it doesn't exist, and that means making it configurable. Somehow. My plan for that is to have new options underneath the currently v1 alpha logging configuration for the json format.

K

One is a boolean split stream that just says, write info to standard out and error messages to standard error, and then that can be consumed separately and there's also one thing that I plan for for buffering, which was a feature request from six scalability, that there should be a possibility to have an in-memory buffer and not write out everything at once or immediately, but rather do it in batches.

K

So these are two new options that I need to put into that config file and they will appear in that login configuration.

D

That sounds good. We are currently over.

K

D

So I think we need to wrap up the meeting.

K

So I guess that will need an api review. uh We can talk about it more when it's, when it's ready, I'm currently in the process, with tim hawkin, to actually really write some of the part of the command line parsing and once that pr is in, we can come back to roswell, and I just mentioned thanks.

C

Thank you. Thank you. Everybody have a good day, bye.