Kubernetes SIG Apps, 19 Sep 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Apps 20220919

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Good morning, good evening, good afternoon, depending on where you are today, is September 19th, and this is another of our bi-weekly Sega School. My name is mache and I'll, be your host. Today, um before we jump into 126 items that we're going to cover a quick reminder, uh the enhancement series for 126 is roughly two weeks away. It's on October 6th, I didn't even properly calculate it, but I think that's about two weeks, two and a half weeks, it's Thursday from this Thursday in two weeks.

A

It is when there will be the uh the free dates. uh I link, their 126 schedule, actually try to get your reviews for enhancements sooner than that, because the production Readiness, which is a required step for all the enhancements, uh the soft treats for that one- is actually a week before uh the actual uh cat phrase.

A

Because from what I remember, there were only two uh two persons assigned at prr reviewers for this cycle, so try to get through most of the reviews. Sooner than later, uh with that I think we can jump over to main topics unless uh Ken wants to add something, not sure if I missed anything with regards to nonsense. Ken I think.

B

A

It at all, okay, so first one on the list is Nihao Aldo and Abdullah I see both Miho and Aldo. Take it away.

C

Okay, let me uh let me speak I hope you can hear me right loud.

A

C

uh Great, so what I would like to draw your attention to is the enhancement that we are currently working on beta uh called the archet Rival and not try a tribal pot filers for jobs.

C

So what we are working on beta is to extend the feature with handling pod failures initiated by cubelet. So maybe a couple of words about this feature in general. So in this feature we tackle the problem of handling pod failures, a little bit for jobs like a little bit more flexibility than what the standard portfolio policy of backup limit gives.

C

So, with lack of limits, you can just configure the limit of red rice. However, in many scenarios you can tell already, if it's promising to retry or not permission to enter a certain pod based on exit codes is completed with or based on its end state.

C

However, what we discovered in the experiment is that the Pod end state is like podcasts can fail in many different ways, and it isn't like standardized that you can look at reason or status to to tell why the Pod failed or what was the cube component that initiated the failure. So a big part or a part of the feature.

C

Basically, there are two parts of this feature: one is to introduce the Pod failure policy based on the rules that allowed you to express, like the policy based on the as I said, before, exit codes and what conditions, and the second part is to standardize the for the end state in terms of this condition.

C

So whenever in Ideal World, whenever we kill a pod or what is terminated, we add some pot condition that is then easily easy to be interpreted in terms of do I want to retry it or not so for Alpha what we did. We introduced a new spot condition, type called disruption. Target and we add this disruption. Target condition whenever the Pod is, is five due to disruption. So, for example, we added when we add, we evict the Pod due to uh no execute stain or due to preemption or due to API eviction.

C

So in all these cases, it's not like the Pod is it's a fault of the Pod. It's just a disruption, and that is the reason of the Pod failure. So we don't really want even to count this failure against the back of limit, so this is yeah. This shows an example uh about failure, policy that uses the disruption Target, but also what it's using here you can see is resource limit succeeded. So this is already hinting to what we want to do in beta.

C

So in Alpha we added disruption Target, but we only annotated the failures um that I were initiated by the either scheduler or controller manager, but with beta we want to cover cubelet and cubelet is responsible for initiating eviction in a couple of scenarios um and I think the most problematic, or at least the one that took us. The most discussion time is when it's evicted due to the limits exceeded for resource like memory or disk.

C

um So what you see here is the proposed name of the new condition resource limit exceeded that generalizes for exceeded memory or exceeded FML storage.

C

um However, we intend, as a result of the discussion, to change this name, because how do we tell if the limits are exceeded? 12 now, so in our invest from our investigation? When cubelet initiates eviction due to this limit exceeded, then it's easy, because it's cubelet that checks the if the limit is succeeded in its code and that evicts the port if it's exceeded um and then we can just easily inject the adding of the new condition there.

C

However, for exceeding memory, it's a little bit more difficult to interpret because cubelet is not doing the check itself.

C

So it's out of memory killer that avoids or kills the container and then runtime container items sets the reason out of member kill, and this is what we observe in cubelet, and the issue is that in some situations, actually the out of memory killer can kill a container even if it's not exceeded the limits, um but it was close um like the node itself was under pressure and then the out of memory killer has some allegory to determine which spot or which container to process to to kill, and but it does not necessarily mean that the limits were exceeded.

C

So that's why the name may not be accurate and then, as a result of the discussion, I think. The currently leading approach is to name it like resource exhausted or something like this, um but in general. I I would like welcome some input on how to and if it's either possible to to to to to determine if the limits are exceeded or we can or what we intend to do is uh is best what we can have.

C

um There are also a couple of scenarios when cubelet initiates changes which we can interpret as just disruptions. So, for example, if there is no the pressure, then we, but there is no indication that the limits were actually exceeded. Then we just add the disruption Target as as in the other cases, so um there is also admission hours, but I think those those are easier in general, uh so I think that's will be for my brief introduction of the feature and what we want to do in beta.

C

If you have some questions, I think you can then expand more, um but in general you know any input either now or at the at the cap will be is welcome. So that's me.

A

Thank you. Does anyone have any questions we have.

A

Okay, here in none, we can jump to the next topic.

A

In that case, uh of course, everyone are more than welcome to have a look at the cap uh read carefully through it and leave the comments or suggestions and the PRS next on the list is Philip.

A

uh We're going to talk about consolidating workload, controller statuses, go.

D

Ahead to it yeah so I would like to follow up on the discussion. We had background regarding the consolation code control status.

D

We originally planned to introduce the available condition for other workloads, but we identified a couple of issues with the available condition, since it is mainly tailored for the deployment use case and it's hard to reuse or have a similar construct in other workloads.

D

So this so uh I posted a cap which is introducing a new condition called operational, which is still up to the discussion, and it's mainly to start a discussion. uh How this condition should work and how it should operate. So we could reuse it across all the workloads and have uh like well-behaved Behavior Uh for consumers to use. uh So please uh take a look if you want uh on this cap and yeah thanks.

A

Okay, does anyone have any questions for Philip at this point in time.

A

Okay, hearing none, let's jump over to the next on the list. Ravi I see you have two of them, uh but I didn't see an issue linked to the first one. Do you have that link handy I? Have the hot healthy policy open, but I don't have the other one that you started.

E

I did not create any issue for.

A

That I will open a.

E

One so the goal here is to uh have a cut by end of this uh release. It's not that this has to be merged within this within the next 15 days or 20 days. Time frame, so I will open an issue and an Associated cat.

E

uh The main thing that I wanted to get uh some feedback on is is this some problem that uh rest of the folks are facing to, especially in a multi-clustered world, where uh we have workloads that are spanning across clusters and since pdb is actually specific to a cluster, uh it is actually causing the problem there. uh We wanted to count pdbs across clusters to ensure that the disruption is allowed or not. So uh we have couple of ideas on how to implement it.

E

Deep is also here from Apple who works with me on on those areas, but we are wondering if uh anyone else has similar issues and how they have dealt with.

B

So it depends on the workload and how you spread it across multiple clusters. Can you talk more about your specific use case where it's insufficient, like? Are you spreading the workload across multiple clusters and because of the way you spread it there's like one instance per cluster.

E

Yeah, there is more than I mean most likely more than one instance per cluster, but at the minimum level we have one instance of the workload per cluster and we want to ensure that while the disruption is happening, especially within a cluster, we can use pdbs to ensure that there are no disruptions.

E

But if, if you want to extend those pdbs across multiple clusters, is there a way to do it?.

B

I mean the the short answer is no right now, right, like pdps, are really like. When we designed them, they were meant for controlling disruptions for things like automated um node repairs, right, like so you're upgrading all the nodes in your cluster, and you want to find a way to go through and knock over a one by one, maybe move faster than one note at a time, and you want to try to not be disruptive to workloads that can't tolerate the disruption from the underlying infrastructure.

B

um Modifications right, um they're, not there's no notion of that being multi-cluster, aware and like there's no link between the API Machinery across multiple clusters, so there's no real way to communicate um an ongoing infrastructure disruption that potentially spans multiple clusters within the same region or even within the same zone right. So typically, what I've seen people do for this? Is you use the pdb to ensure that you aren't disrupted within a particular cluster and then for whatever higher level mechanism you're using to orchestrate infrastructure across multiple clusters?

B

You have some other mechanism to manage disruption like for Rolling cluster upgrades or for tolerating Regional outages right. um So, like pdb is not the mechanism that I've seen people use, not the mechanism that I've used either.

E

I see be like say if you have a multi-cluster controller, which is aware of multiple clusters that are happening and it can actually interact with pdbs. Can we actually say that, uh for this particular workload, I do not want the existing PDP controller, which is in my cluster, to take care of the disruptions, but I would like to delegate it to another higher level controller, which is multi-cluster aware. Can we have some mechanism or mechanics in place which, which tell that uh I do not want pdb controller to manage the disruptions for this particular workflow?

E

If we say in the pdb spec, something which actually says I do not want to be managed by the default pdb controller, but rather I would like to.

B

Have I, don't think, there's a mechanism by which the eviction controller is able to call out to a third-party mechanism in order to assess um the decision of an eviction right because, like the eviction controller, is ultimately like, if you're using evict in conjunction with pdb the way you would do it. One thing that I've seen people be successful with is the creation and mutation creation and removal of pod disruption budgets across clusters by by a higher level mechanism um during infrastructure orchestration.

B

So that would be like you want to maintain a certain availability across all of your clusters across all regions or across all zones, and then you lose some portion of your capacity and all of a sudden, you become less tolerant um to to the to the intended disruption.

B

You know like let's say conservatively now: you don't want to lose anything so high disruption, budgets aren't really mutable in a very good way, but what you can do is create a new disruption budget that targets it like a more conservative disruption budget and then just delete the older one, um and that would raise it and then, after the disruption, assuming it's not extremely long running has passed and you recover the capacity and your application is now there again.

B

You can go ahead and you know conserve like be less conservative with the disruption budget in the cluster and that's the primary mechanism that I've seen it used um and granted that the orchestration to do that is complicated. I'm, not saying that's uh trivial, but um it's the way I've seen it done, uh be super open to trying to do something easier. It's just again.

B

It's not like there there's no notion of multi-cluster API Machinery inside of core right, like I. Don't have a mechanism for the API Machinery to hook up if the eviction controller, which I'm actually sure of Sig apps, owns like if you wanted to try to do something where we modified the eviction controller, to be able to call out to um a third-party mechanism via some object.

B

I wouldn't do it with pdb I would have it or maybe with pdb, but have some type of mechanism that triggers the eviction controller, to call out to another piece of Machinery that helps in the eviction decision. That might be something but I'm not even sure that cigaps actually do we own the eviction controller. We all.

A

B

A

Don't think so, maybe we did touched on it, but I would I would probably still given that this is touching the API surface, because that lives in the uh uh in the API server I'll, probably even if we would be involved I, would still reach out to the API machinery and ask them for uh for the feedback, because um this will affect uh the API throughput in the long run, because you will be basically, an eviction will be extended with I.

A

Don't know it's some kind of a third particle and I think that we're not doing this because uh scheduler has that capability that it will reach out to um for a decision to external uh plugins, which I'm not saying that it can't be done, but um I'm just saying that we should sync with our API yeah performance.

B

Of the entire project right.

A

B

E

A

Regards to the what channel is explaining I'm, not sure if cute Verge wasn't doing something around manually managing pdbs, because they were overusing pdbs for managing the their machines, um but I I. Don't remember this just yesterday.

E

Who is that much Cuba cubavert I see.

A

Yeah that I I remember when we were talking about the the PDP it's called healthy. uh They were the ones that were talking and having some ideas around how this could potentially work, because they were also reusing or, like I said over using pdb for an unmanaging pot, we're actually managing or ensuring that a particular number but yeah I would probably agree with with what Ken was explaining with regards to having a a third-party tool to manage the uh the pdb slide, especially that you already I assume since you're talking about multi-clusters Solutions.

A

So you probably already have some kind of a controller written that is responsible for uh creating those workloads in all the Clusters. Accordingly, to whatever roles that you use, codified, I guess.

E

Yeah, to be clear, what we are thinking of doing is what Ken was suggesting, where we wanted to modify the pdbs uh for the individual clusters like have a higher level controller, which is multi-clustered aware and uh I mean not modifying the pdbs but creating those pdbs that that would allow the disruption or not allow the disruption to happen.

E

But we we are wondering if the second possibility that Ken has mentioned, which is like have the eviction logic, be uh hooked up to another uh external third-party controller or some other entity which can actually make that decision. For us.

B

I mean it's possible. My advice like trying to put myself in your shoes the reason that it depends on what your time Horizon is really right like if you want to modify the eviction controller and get it to a beta level where it's available on most Cloud providers or most distributions that are enabled, and you that that's going to take a while right. So this is a problem that you have like right now um that that might be a longer way to go about it, but it is feasible.

B

um The other thing I would say is that it are you like it's a question. Are you dealing primarily like? Is this a problem you have with stateful workloads, primarily.

E

B

Are you doing this with um stateless serving workloads.

E

uh State, for the most part, they are stateful protocols.

B

So the other thing that I've done in the past is, if I had, are they heterogeneous or is it like one particular application that is uh kind of core to your organization that you're worried about.

E

uh There are a couple of applications that are close to the organization that we are worried about.

A

So it's typically homogeneous like say: Cassandra.

B

So what what I've, what I've done in the past there and I've seen other people do that are successful, is so for the for the staple applications. You can write a custom orchestrator using a crd that manages the cluster itself so get outside of a staple set world and orchestrate it directly that way, you're you're able to manage disruption in whatever way you want, um and it's easy to kind of it's easier or lower, lower barrier to entry to get a working POC out right.

B

um We for stateless surveying workloads generally like I, would not even you like managing capacity across multiple clusters is usually a function of managing traffic. Ingress like that would be the way I would handle it. So just you know like whatever you're using for load balancing across the original clusters.

B

If you have identical instances of the application, when one part of the capacity dies, more traffic should get pushed through the additional capacity and then use horizontal pod, Auto scaling there in order just to scale up to handle the need right, so the disruption budget isn't as important there.

B

um Well, there are other ways: mechanisms that kubernetes has to get around it, um using like writing, operators or custom controllers for the workload that are capable of handling the intricacies, because with staple applications a lot of times that disruption budget isn't like it's sufficient only for availability, but the intricacies of the storage topology aren't captured by the disruption budget anyway.

B

So like with Cassandra, for instance, a lot of times, you can say, don't take down more than x of these, but it doesn't necessarily provide availability for any particular partition of your key families right and same with. Kafka right, like you can say, don't take down more than exodies, but if you're looking to have like a a stronger semantic around, like you know, I I. This is how my partitions are replicated across the topology and I need to make sure that these two don't go down right, like there are other kind of gotchas there.

B

So you may want to consider that approach to be my advice as well like doing something custom, that's specific to the needs of the application. I've seen that kind of provide a a great user experience for consumers of my infrastructure.

E

Got it yeah, so I think we have some customizations in place by that. What what I mean is we have our own crd and there is a controller, uh but but high level we have been using pdbs as as the uh disruption building blocks through which we can say. I would like to get this uh workload scaled down to these many parts or scale up to more than these number of pops. But uh at this point of time, what we are wondering is: can we make it uh Community compatible where we can say that hey?

E

This is something that is kubernetes is providing out of the box, or there is an extension mechanism in kubernetes that we can go ahead and use instead of uh instead of having something that that we have to maintain on our own.

B

Yeah, no one's saying you can't walk down that path or that it would be generally unsupported. It just might be hard as well, because, like magic said, it does definitely affect the core API and the interview on API and then on top of that it would, if you modify the behavior of the eviction controller, there's like conformance implications as well. So it's it's just a larger conversation, but it's something that can be done.

B

It's just like if you so typically like if I've written up like if I deploy an operator, that's managing Cassandra or Kafka or other Staple workloads in the Apache ecosystem. I do deploy like PDP I'm, not saying pdbs. Aren't part of that solution. It's just. It provides an entry point inside of the in-cluster controller that operates for that specific workload where I can already modify those PB pdbs in an intelligent way as necessary right so usually that's where I've captured the logic in the past and where I've seen other people do. That kind of successfully.

E

And you would put those those uh those uh rail guards in in our guardrails in in The Operators or the workflow controller. Instead of.

B

E

Mean something yeah.

B

Yeah, that's what I've done in the in in the past right, because the controller is aware of the specific considerations for the workload and also like from from their perspective like if I'm using this, because I have teams that are turning up um Cassandra rings on a regular basis. Right like a lot of times. They don't want to be concerned with configuring, the pdb and the API Machinery. They want some API where it's like.

B

You know, Cassandra ring or Cassandra cluster, and they put that in the yaml and then the controller looks at bat and decides what it needs to do in terms of creating pdbs, creating Services rolling out pods, whether they're using Staples, that are using something else under the foot um in order to make sure that works. Setting the naming conventions for the Cassandra nodes, all that stuff, maybe things and tolerations, and um to try to keep things.

B

That might not be good tenants to run alongside beside Cassandra right because, like you get a lot of things that um hit the uh the page cache at the same time and create memory pressure Cassandra's not as performant as you'd, like so all of those considerations usually um and like the thing about it, is like. If you have a framework like that, and then you open source that whole thing that's another way to contribute it to the community in a big way.

B

That would help other people who have the same workloads that you're running, because it sounds like a lot of what you're running are primarily internal or external versions of Open Source workloads that are in common use. You're not running like this is this custom thing that, like yep, we built right, yeah.

E

Like most of them can be uh open source.

E

All right, let me do one thing like deep and I will go back and then we will uh see if we can make those changes within the operator so that those uh I mean we already have those operators being open source. But we wanted to make sure that we are not doing something that is Apple specific or if there is a way to do it in a community way. We would like to do with that.

E

So we'll see if those things can be uh put in the open source and- uh and we will get back to you.

B

So good and if there's anything we can do to help if you want to try to move an eviction controller modification forward in terms of communicating with API Machinery or the other parties reach out to me or my Jack, her Janet, and we can help you with that too. Yeah.

E

We'll do that. Thank you.

A

Cool thanks a lot Ravi uh on then on to the next topic: Philippines Health policy for pdb.

E

So I think I should give an update on this. So uh I have taken the pr from uh Martin and I've started working on the changes, but I think I made the mistake of uh making API changes and the implementation together uh and I'm I think I'll not have enough time to work on the implementation side of things. So what I'm thinking of doing is I'll make the those API changes in a separate, PR uh and then open the pr for reviews and uh Philip is going to work on the implementation side of things.

E

uh The battery changes that we need to make in the eviction controller, and uh that's that's the plan for for this release.

A

But you're planning to split the work but still get it all for 126.. Okay,.

E

Yeah because the kept has already merged- and we are not deviating from the cap- passage- it's just the implementation that is remained.

A

uh But you still have to update the cap with the.

E

Release yeah yeah the release versions, yeah we'll do.

A

That yeah um I can edit this description, not sure if more time, it's reachable to paying me about updating this, so that it matches the current state um and the cap will only require updating the numbers so that it matches what we are actually doing, because I guess it is currently um pinpointed to those versions. That's describing in this issue.

E

Yeah make make.

A

A

uh Does anyone have any questions.

D

A

About health policies that could be.

A

Okay hearing none uh last topic for 126 is something that Matthew carrier was working on for a little while um this basically adds a new field in a state tool set which allows you to express what should happen, whether PVC, uh when I, say post, that it's either scaled down or being removed, whether the uh the PVCs can be safely removed, or they should stay.

A

If I remember correctly, there is an updated uh cap. So for those this feature is currently in Alpha. That's probably the most important bit and my mine was pinging me about pushing this over to Beta, so I think he has APR open to address that if you have a little bit of time and interest in Statesville set, have a look at the uh at the issue and the attached vrs I'll sync with him about updating the description as well.

A

Okay, I think that's basically concludes all the topics that we had in is in the agenda. Are there any other topics that folks wanted to discuss.

A

Here and none so with that, I'm gonna give you back 22 minutes of your time. Thank you very much all folks for uh for hurtful discussion as usual and see you next time. Bye all right. Thank.

E

You thank you bye.