Kubernetes SIG Scheduling, 15 Aug 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Scheduling Meeting - 2019-08-15

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, let's start hello, everyone, as you know, this meeting is recorded and will be uploaded to public internet, so chances are whatever you say will remain there for a very, very long time that, let's start the meeting, I have a couple of items to talk about. Hopefully these are not going to take much time and then I know that there are a few folks have some issues that wanted to speak about. Forgive me how actually a demo about the scheduler.

A

Let me actually give you the quick updates, so one one issue that we have actually Robin had found or was that the scheduler does not terminate in all scenarios when it loses leader lock. So let me run schedule early in high availability, midorijima.

A

The schedulers try to basically.

A

Choose a leader among themselves by acquiring your leader lock and once a leader loses the lock a real real action happens. Sometimes when I mean all the times when a scheduler loses this liter lock, it must restart, but apparently, in some of the cases that draughty had tested, it didn't restart itself, so he had sent a PR to fix this issue. Some other folks in the community believed that restart happens.

A

So, while the PR is fine and we can actually manage it, there is a question whether this is actually a bad fix, or this is just a cleanup. The reason that it makes a difference is that if this is actually a bug fix, we need to back port it to all the versions that we support. Basically, all the way to 113 I believe that we support today, I, don't know if we're always in a meeting now, I, don't see him Mike do you know, I probably don't know about the status of this change. I.

B

Don't know the status of this, but Ravi said that he's not gonna be able to make it to this call.

A

Okay, all right yeah, so we can actually mmm I will follow up the tabulator.

A

Another update, quick update that I have is that we removed critical pod annotation a few weeks ago, but it caused some issues with static pods, we'll fix that issue.

A

With some agreement from the signal we decided to mark all the static pods or consider all the static parts as critical parts, and we do not need static pods to actually have a priority or have any annotation or anything to be considered as us as critical. All of them are considered critical, no matter what their priority is for what and whether they have an annotation or not. So after that fix is managed.

A

We now have been able to reemerge the removal of critical part annotation, and that has happened yesterday. These are the two updates that I had so, let's see, but if swathi you have some she's on the call, she has a demo for us, I believe but I don't know if she's here all right in the meantime, is there any question or comments around these from other folks in the meeting I.

B

Have one small thing: I don't know if this is the right place to bring it up, but we had a test that was flaking around paints and Toleration. That Robbie tried to merge a fix which broke the test, so we reverted that fix. I've got another VR that I just opened today, which basically brings in Robbie's fix and tries to complete it so that it shouldn't break at this point. I just put a link to that in the doc yeah.

A

Please do if you can, please add it to the meeting notes and another question I have for you guys. Is that Robbie reached out to me and said that you, you guys planned to rewrite some of the preemption tests, because there are fledgling in your environment in certain scenarios. I actually don't know. Do you plan to convert them into immigration tests or you plan to just rewrite them in a more reliable way as an tantus.

B

From what it sounds like talking to Ravi we'd like to move as many as we can just integration, because we have an open ship, we have a lot of really high load tests. We've got a bunch of other and knowing tests running in the background with all of our operators, and so I think that our goal is to move as many of them as we can to integration and then rewrite others where we can't move them. Yeah.

A

We definitely need to have at least one probably more than one end to end preemption test, because we want to make sure that preemption actually works, so it can definitely delete parts and we can update API server and everything. So we we definitely want to have an entrance for preemption, but generally I am okay with converting and when test two integration tests. If we don't lose any functionality of the test, yeah.

B

We did talk about leaving a couple that I think you had pointed out, at least, as you know critically, into end that we don't want to move over. So it's it's not like. We want to move everything over I think that that sounds good. All right.

A

Any other comment: oh.

C

Yeah Bobby I was looking for review. I have pink gloss, but haven't got any response. I'm done.

A

C

Close yeah I was looking for a review like there's only one changes its pending, so it's been like a long time and only one comment which is pending and it addresses functions outside the BR as well. So I was looking for some input on that I see.

A

So, yes, you know I actually I. Also Chad I told you that I'm not fully aware of that changed. So I would rather someone who is was review. The software I know Abdullah is on vacation, but he's coming back, so he can probably provide a response for those early next week. I hope, that's fine with you, because I we still have some time before the code freeze right so code, freeze it at the end of the month. So we still have some time. Hopefully we can merge it before that and I mean.

A

Hopefully we should definitely yeah before that. I think.

C

Next week sounds fine Vishal yeah.

A

And I can actually think cause as well yep good, so source is now in the meeting. Sorry, do you have a demo for us.

D

A

D

Yeah, so let me know when you can see the screen. Yes,.

D

Okay, oh, it seems to have taken the other route.

D

Okay, oh no charge which screen are you think.

A

I'm, seeing a slide propose the strategy scheduler remove parts violating new things, all.

D

Right, okay, so the whole idea is that we proposing a new strategy in the de hielo project itself, and this was one of the one of the strategies that were being that were mentioned in the roadmap of the digital project, and the idea is that we look at the new schedule, teens and Toleration x' and see if those gains and toleration czar are compatible after part has already been scheduled, and the scenario here is that what happens in case the node change is updated after a pod is already placed so in the slide now you'll see so we have a pod which is scheduled on the node for the first node and it has attained, which correspond.

D

It has a toleration which corresponds to the taint on the node. However, in a scenario that taint gets updated on a node, the pods, this particular pod is violating its placement intention. So it needs to be rescheduled. So that's essentially the idea behind this. The strategy that we're proposing- can you see the shell here? Yes, okay, so I'll show you the nodes.

D

I have a two node cluster over here and what I'm going to show you is that I'll deploy a deployment with ten nodes and they'll be distributed across both nodes, and once we paint a specific node to is deploying those pods should be moved off to the other node, because the corresponding pods will not be compatible with the Toleration that has been specified. So let me just show you deployment that I'll be shuttling.

D

So you it's a simple deployment engine exports, ten replicas with with a toleration key and value, and and the important thing that that I need to mention here- is that this is for no scheduled teens and toleration.

D

So these are the boys. Let me just quickly show you that so this is this. This is a small script that shows the teens on both the nodes. At the moment, we have no taints on the note and when I deploy the deployment.

D

Deploy the deployment.

D

So you see these deployments are being scheduled on would the note so we have four one: nine and eight seven seven notes the shedder tries to distribute its deployment across multiple nodes. So that's what it's essentially doing and what I'll show you here is that if I assign attaint to the to the node, so change node.

D

So I'm painting this node with a key and a different value. Eval too.

D

So my my master node- this is a master node and it has attained been assigned which violates the toleration of the deployed path. So the pods continue to keep running here. So the poles are continuously running so I just go to the dich editor shell and exact.

D

So I'll pick up the policy configuration file so.

D

In this case, there are a bunch of ports being deleted. So if you go back to the previous shell you're, seeing that a few of the pods are getting terminated, subsequently, a few pods are getting recreated and at the end of this d shuttling process, all the pods are on the other node, which is no longer tainted.

D

So this is the strategy that we're proposing, and it also shows that the D shadolla, which is running in the cube system namespace, continues to run, even though it was on the master node, because it's a special pod, so that kind of summarizes the demo. We have a PR up in up on the D jeweller repo of a she's supposed to review it, but if, if we can get more feedback on that, that'll be great.

A

So far is that looks like new scheduled taint causes a node to drain basically and with your change right. If I wanted to achieve this, they could have used no executing right, which would have the same effect. If they'd really wanted to drain a node I think they would probably put in no executing I, don't know it. Probably this is gonna, be something unexpected to happen. I would say, probably someone puts a no scheduled chain, I, don't know it because they don't want the existing cards to terminate on the node.

D

Yeah so yeah I think you're right in that sense that if they'd want this behavior they'd use when we execute ain't, but the problem that we had was in a in a deployment that is up and running, and you have no schedule if you want to clean up your environment, and you want to make sure that you're running pods are consistent with the cluster that you have. You'd probably run this as a clean up operation to ensure that everything is compatible and in a desired state.

D

So that was the whole idea and I think there was so. This issue already existed in the DI glory point. There were people who wanted this features that that's where it kind of started. I.

A

A

So this this definitely should be configurable because it could potentially cause surprises as well, because you know people may think that, oh if I, just but I know it's controlled, nobody's not going to get new parts, but suddenly they could see all even the existing parts disappear, which is not exactly what the API says so yeah. Yes,.

D

So this is again it's a strategy that is configurable in the dish and it'll configure itself. So only if you want this behavior, you can include it in your configuration and you'd get this behavior.

A

Thank you very much. Thanks.

E

So hi I still didn't get that. So you just said the main reason is there. You don't still want your, for example, your book demos to be compatible as well. It was before right. So that's the main reason right.

D

Yeah so like at the time it was being scheduled, it was complying with the taints that were on the node, like the Toleration that the pod had was complying with the taints existing on the node. But in a certain scenario the taints are updated or your fluster gets updated. That compatibility is no longer correct, so it's no complying with your provided toleration so that that's the whole idea yeah.

A

You know one of the one of the reasons that you were thinking about. Having a new scheduler is that the scheduler tries to bring the cluster at runtime to a state that is specified by the API, for example. If I know it has no scheduled no pods, it is scheduled on the nodes or, if you know we have other scenarios here. For example, anti affinity could compile a tit, so he could bring the schedule the cluster to the state that is compatible with the specified API and so on.

A

This is kind of aligned with the same promise of the scheduler. So if we have a new schedule taint on them I know, then we want to basically ensure that at runtime you're adhering to that desire. Basically that's the idea. I.

E

Know this scheduler that the purpose of the scheduler birth I think no matter what the era of additional feature is the idea to discover the first employees. It still needs to align with the API right and for this example. Actually, each disobey, the API, in my opinion, okay,.

D

Can you explain that a bit because, because.

E

No schedule it just means okay, the past on the pass arm, the knows them. We don't touch that no matter what kind of tent is added later. On that note right, there is the semantics of the API yeah.

D

I think yeah so yeah. What do you think is correct in a way that so the part where the pod is scheduled is only considered when that request comes initially, but the de leur project itself talks about scenarios where you would want to actually change the placement of the pod to achieve a desired state in your cluster Saudi settler project kind of talks about like scenarios like violating the pod and the affinity policies having do pod, pod, anti affinity or node anti affinity policies, so things like that are considered indicia trigger.

D

So what we are doing is proposing another strategy where the taints are no longer compared. The teens are no longer compatible with toleration that the pods have you.

A

Know since we have change, I can agree that this could be a surprise for users, especially if they know about both of these two taints and they they purposefully, put and no schedule taint in order to save existing parts. I don't know, then some feature like this could cause surprises for users. So what I would suggest is that if you can add this strategy to the scheduler but keep it disabled by default, so only users that really want this feature can any of it sure.

D

Yeah, that's not an issue at all. Yeah.

E

Leave this question ready.

E

D

So I think I mentioned a pointer to the PR and obvious the reviewer as well. So if I get some feedback, I don't take that into consideration. But this is a good point that we will have it under a feature: gate or kind of hidden by default, unless users want to actually enable it explicitly.

F

D

The project it proposes and it it says that it should be run as a cron job to clean up your cluster. In this case, I showed you I explicity run the command to just show you the logs and the pods that are being evicted.

F

That it might be useful to for this kind of behavior because it violates the API. It might be useful to have it. As a audit system tell the operator this, these bots are in violation but actually afflict them. Okay, maybe I doesn't say so, I, don't know if it's outside of the goals of that, but just to be compliant with the API. That sounds to me like on.

F

Your hand if I'm an operator one to one point clean out my cluster thing, I run I run just this one time and then I think I'm happy with yes.

D

So far like the use of digital or I seen this is this has been used as exactly the way you just described as a clean up operation once in a while and not as a cron job. But in the repo itself like it's, it said that you'd run it as you'd run it as a cron job, like maybe running once a day, or something like that as a cleanup operation, but yeah I, seen being run once in a while, like whenever you need to achieve achieve a desired state.

G

Wondering if you could share a use case that that might explain why someone would want to automate put this into a cron job for these d scheduling activities, because I'm unclear how you would how you would have changes in the taints and the pods already running, without how you'd have that coming up. So often that you'd want to run a cron job to check to remove those workers.

D

Oh, like I mentioned so this. This was an issue that was already existing in the de leur repo itself, so I kind of picked it up from there and and kind of it was an attempt for me to get get an understanding of how the de Sheckler works and so I thought. Okay. This is a problem that exists in the in the repo and I'll go ahead and try to solve it and get an understanding and understand the architecture of the shelter itself.

D

So that's where I came from I I, don't have a specific use case itself, but the the the issue described it exactly. The way I did that I can link the issue as well in in the agenda item as well to describe it well, but the idea was that in scenarios where you're teens on the node get updated, you want to ensure a compliant yeah.

A

I think just question is more general: I would say: are we asking why do we ever need to have the scheduler for any reason, or for this particular feature no.

G

I think she was addressing my question. I just wanted to make sure that, when we're submitting these these types of changes that we know exactly or we have some examples of users who actually have this need right. So if we're just coding to respond to gap set that we see in the API or something that affect that to me, is it doesn't seem very valuable unless we're making contributions that are in response to the needs that users are actually having yeah.

A

I mean some of some of these changes are useful like to kill you, because our own scheduler does not really check anything for running pods right. So it only cares about pending parts which are not scheduled yet so if cluster conditions change and turn time, of course, the scheduler will no longer care and the scheduler helps with bring the clustered to desired estates, yeah, I, understand and.

G

And I think it's a valuable process, but like one of the other people here who mentioned the fact that running it on a cron job is different than having an admin executing it on a as-needed basis. I. It scares me that somebody might set this up, not knowing what types of jobs would beat. These schedules. Yeah I know.

A

G

A

Once it becomes like a established feature of the cluster, people will probably think about it and care about it. No matter I mean if it matters. We also have something similar. My Google on board has something like these kids were as well. That checks the cluster and tries to bring the cluster up to the standard state or did configure to state. So it's not something completely uncommon in cluster management.

G

Well, and with the nature of what we discussed it that it may stray from the API, you know I'll flip my opinion here, a little bit and say you know it certainly is the nature of declarative.

G

Infrastructure that we want to maintain the state that the the clusters has been asked to to perform so I I would agree with that. It's just the running this on an automated way was.

D

A little nervous without it.

G

D

G

Understanding what the need was for that, but certainly I like to have a cluster, maintain a state that I anticipated it operating in so I yeah.

A

No I have something.

G

Laughing a little bit, why.

A

You do this; this definitely could cost surprises. If, however, what it exactly does.

F

To follow up on gets common about the use case. I think it would be useful to know which use case it's deserving. That is not served by not not execute, because that this deduction already yeah.

A

Yeah, that's what that was my question as well yeah. This is, does it actually something that could probably cause surprises? If user is not fully aware, but you know, as we discussed a user can probably I mean we can believe it as disabled by default in the scheduler and the users who really want to have this kind of behavior and check checks at runtime for no scheduled things as well, then they can enable it, but otherwise you're right I, don't think that she's much more than what no no.

D

I try to follow up on that specific use case itself like to identify what application or what use case would benefit from scenario like this just to add more details in the PR itself. So it just helps everyone.

A

D

For your feedback, everyone thank.

A

You very much and yeah the other time. Is there any quick question or comments.

A

Alright, let's end the meeting, then thank you very much. Everybody for attending I'll see you guys next week.