Kubernetes SIG Scheduling, 23 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Scheduling Weekly Meeting for 20210923

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, hi everyone, um so we have um a few topics. uh The first one um so is about like one to one point to node scheduling, um and then I thought that we could have like a quick status, update on both v1 beta 3 component config and simplified plugin configuration um and, although just added uh another topic for discussion related to probably like um scoring for um node resource utilization, um so I'll I'll just start quickly and then probably leave a lot of time for that for the last one. Hopefully it's not too complicated.

A

um So if you remember like uh two releases ago, I proposed adding namespace selector um to uh pull down the affinity to allow like one of the use cases, for it was allowing one-to-one port to node scheduling.

A

um We've discovered discovered that it is not as easy to use and might not be that scalable, especially when we consider cluster or a scalar in the mix.

A

So um what I'm proposing is um I'm gonna create an issue for this, for discussion is to have a new feature, scheduling feature a new filter. Basically, that says um a pod can only schedule on a node if there are only daemon set pods on it. So basically, we're saying that only quote unquote workload pod can't execute on a node.

A

We consider demon set pods as, like you know, agents, etc. So we don't care about them. They have to execute anyways, um and this is um typically needed for hpc workloads um where these types of workers they usually don't, want to set requests.

A

They don't want to calculate how much the pod should exactly have and they basically want the pod to use all the available cpu on the on the node like usually, this is not um like, like on the cloud you have all these types of uh and shapes of vms, and so um wherever they execute, they want basically to use the whole machine for that part um and so explicitly having a feature where they say.

A

I want only one ending quote: unquote: workload part to execute on the node is becoming more and more common, as we are seeing more batch and hpc workloads executing kubernetes. um So that's my thought. I'm going to create an issue to have a discussion around this. What the exact api should look like, I don't want it to be, for example, an annotation. That's um I don't want to go that route. I want to have a discussion around like what the api at the spec level should look like.

A

Maybe a scheduling mode, where you say uh like one to one modular demon, sets um something like that. Any thoughts comments.

B

So is the intention to just assign one single, like hpcc part, to uh like to call them the nail or you can schedule multiple, this kind of special hpc part two are note.

A

It's exactly one two one like exactly so. Basically the filter will be really basic basic. It would iterate over all the pods on a node like when we examine a node, we'll iterate over the parts of the node. If any of these parts have um a controller like an awning controller that is not demonstrated, we will consider the node uh not eligible we'll filter it out.

A

So until we find a node with pods that are only demonstrated or nothing basically, uh then then we will be like.

C

There is already a part there that has this field.

B

Right yeah, another question is that so, if a nerve is caught in actually the colon just symmetrically means there is a pin on it right. So how to deal with the case that, upon a regular path, that with the corresponding toleration and then the part gets scheduled on that accordant node, so that it compared with the one one mapping dedicated htc part.

A

I'm not sure I get the case like. Why is this? I.

B

Mean yeah, my point is that so, even even if a node gets gets, tended or gets caught in, because cordon is just a special case that we have special things on that. So how do you prevent the regular parts which happens to have the required colorations?

B

And then this kind of regular part gets completed with the dedicated hbc part on that specific node.

A

So again, this is like a resource for every pod. We are running on the node.

A

If there is um a part that requests one to one mapping, then that node is gonna, be it's not going to be a candidate.

A

So um there are two cases right, like first case is the pod. Has this scheduling mode one two one right: we the annoys candidate, if uh the pods that are already scheduled on the node are only demonstrate.

A

There is the other case where the part is not requesting one to one. A node is candidate using this filter.

A

Only if the node does not have a pod with this energy, with this uh scheduling mode right like if there's already a pod running on the node that is requesting one to one, then this node is not a candidate for the incoming code, that's not requesting it.

A

Does that make sense.

B

Okay, so it seems this kind of filtering. The logic of this kind of filtering needs to be sort of uh mixing some other filtering plugins luggage. So it's kind of it's like a combination to track a series of rules.

A

Not not really like it's. There are only these two conditions right like if the. If the incoming it's like you, know, pod resources, we run the filter.

A

Sorry not resources. It's the same thing here. You have two cases: okay, okay, I got it yeah yeah, one of the part, is requesting one of if it's not requesting the uh the two one mapping, and that is like it's going to be way more efficient than um node anti-affinity.

A

uh I believe- and it will work way better for cluster or scalar, which does a ton of simulations and node affinity. Sorry, pod affinity is always like a thorn in this backside of the cluster or scale because it doesn't work well when we try to consider cluster level status, it's too expensive for them, and I think I would argue that this is a cleaner api but more explicit api for this type of use.

A

Cases like even if you want to use node anti-affinity, um it is kind like you need to have the labels on every note and on every point right like to propel each other from each other right here. It's just! No. You just specify that feature that that intent of having one-to-one on the pod that wants the one-to-one mapping right.

A

You don't have to do anything for other parts. You don't need to add a label or whatever, like would like what we would have needed to do with poddant infinity.

A

Okay, I'll create an issue to have uh more discussion there about the api, etc. So I thought it would be a good idea to bring it to this sig, just as a heads up uh and see. If there's appetite for this, I think I think, um like I've seen enough cases really from our customers internally and even outside, um they are not like they don't feel like kubernetes is giving us this simple scheduling. You know primitive, which I kind of agree with.

A

um So just quick status update if anybody knows uh about the one beta three like I saw br raised, but I don't know if it was raised by the person who was actually supposed to implement this or not.

D

So I think I could talk about that. That was ravi's pr. I think you're referring to.

A

Yeah, maybe like I guess, someone yeah.

D

Yeah yeah he's uh last. I knew he was waiting for api review from. I think jordan for that. So that is the status that I know about that. But that's introducing the view on beta3. I think I'm gonna um work on some of the kep and, uh like logistical, uh you know opening the pr for the docs, um but I can poke him about that pr and see. If there's any other uh review comments that he needs to get to, um there's.

C

One there's one comment and with that it should be right. I can. I can ping jordan after after that that one is resolved.

D

Yeah I'll remind him, thank you.

A

Okay, um simplified plug-in configuration status.

D

uh Yep similar report, um I'm kind of waiting for the v1 beta 3 new api to merge so that I can put this into that.

D

But the good part is that I think during our kep process we went through some really thorough review of the approaches, and so I think that, hopefully that translates to a quicker code review process. Obviously we can't assume that, but um you know that, should that shouldn't take too long and I'm also going to be doing the prs for docs and updating the cap along with that.

A

Okay um and last but not least,.

C

Hello yeah: I can speak to this one, um so this um the probably additional problem here was that when, when we are doing the scoring of a node, we assume certain, um we assume certain requests on the pots if they don't specify any these, and this causes a problem of under utilization in small nodes, because the existing logic would um so with the calculations.

C

With this estimated request, we would get to more than 100 um allocated resources in the scoring so and the existing score would give this zero uh would give zero to this spot so that could lead to underutilization.

C

um That's what this pr was fixing now there's. uh There is the opposite problem that this pr brings uh in the case where you want spreading uh the. um If you have the same setup, no requests, uh you might hit the 100 utilization for both cpu and memory, and then you are basically a perfect balance uh or the scalar thinks it's a perfect balance. So it starts giving 100 score for this node.

C

um So what what happens is that then this node starts being over utilized um in the case where you don't want that. um So there there's this two competing problems right uh this. uh This pr was merged first in 122 and then back border to all the supported versions um and well it brings this breakage of existing clusters or existing tests.

C

um On the other hand, uh dave send another pr. Let me put in the notes, uh also in 122 um word notes.

C

C

In this pr, I don't know, could you open it? uh Abdullah, it's in the notes. Now um so they've sent an rpr where we changed the uh scoring of balance resource allocation from absolute difference to uh to a standard deviation. So with absolute difference. When you have her perfect score, perfect balance, you have you get 100 score with stan standard deviation, you would get 50., um so um in 122, this problem should be, should be diminished.

C

I'm sorry that was my cat.

C

um With uh with this pr, we would get a score of 50., so the problem should be diminished, um but on the all the previous versions. Maybe we should uh revert this back port to to avoid uh breaking existing users, um so that would be 121 120 when 119 is already closed, so uh those those two would be the the ones to backport um a revert.

C

um I don't know when you uh what people think um do you think that's a reasonable solution.

D

uh Hi hello, just to so this was first brought to me by our, I guess the perf scale team here at on openshift um and when we were trying to figure out where exactly the new behavior started to come from. You know we found the first pr your pr there and so we're reverting that in um a 121.4 base. You know that that fixed the problem, but we are still last.

D

I know we're still seeing this issue in uh 122., so I don't know if this did address it or not, like you were saying we're still trying to gather some information on that, and I think that you know some of the people that we were talking to are going to open a github issue to track it.

D

um I definitely think that at least the older version back ports could get reverted to maintain that consistent behavior, like you were saying, um but we should probably look a little more into what could be causing this or if we can reproduce it in a vanilla, kubernetes install and to totally eliminate any concern that it could be openshift involving it.

C

Could you including your report whether 122 behaves as bad as as the previous version? Sorry.

D

Yeah, so we we actually first noticed it uh through a 122 version.

C

D

That was what started setting off some alarms that some of our tests were failing, so they're still trying to debug and get a consistent reproduction of it, but that was that was our first clue to this. So then we ended up finding that it was also happening in 121 as well. I see. May I.

C

Know, if like and are you running with small vms.

D

I don't know exactly the cluster state that they're running it on, I don't know the size of the nodes. I know that it's being run in like a like 25 node couple thousand pod setup- I don't know the exact size of those nodes or their testing setup, but I've asked them to try to write down the reproduction test steps as much as they can.

D

It's an automated test. That's running right now, so we're trying to find a way that we can reproduce that in an external way for other people to be able to see.

C

I see um yeah, I thought it would be much better in 122.

C

um if it's not the case, um not sure what what could we do better because, as I said, there's these two competing problems, right, yeah and arguably.

C

Having having no requests um is, is uh it's problematic in general um yeah? I don't know.

D

C

Anybody else has any thoughts.

C

Should we revert everything, or do you agree with that? We need to maintain utilization.

C

What are other people thinking.

A

Can you summarize the issue.

C

uh So, let's try to do it so dldr um there.

C

The balance is score, um uses this difference between the fraction, the the the usage fraction for cpu and memory and.

C

This, uh when you don't have requests, uh then we estimate we guess what would what could be the request of a pod?

C

uh And then this uh with this estimation, because it's it's a constant we can, if the if the node is small, we can get over 100 uh utilization uh that causes the balance score to basically uh start estimating that the the node is balanced because everything is at 100 utilization. So that's perfect balance.

C

C

This is causing over utilization of nodes. Now in the previous behavior, we would give score of zero, so this would cause underutilization um of the node, um so both both are problematic.

C

We're trying to decide which one is more problematic.

A

But what is what is it that uh my mike, what are you observing like yeah.

D

A

Problem that you're facing.

D

What we started to see from this is that, like up to a certain point, all the notes are distributed pretty evenly and then at a pretty consistent level across tests like 144 pods per node. It would suddenly start to just prefer one node and dump like the next 100 or so pods onto that node, and then I think it would go back to spreading between them. But I know that that that behavior was the odd um distribution that we were seeing.

B

Is it is that only the balance score or impact all the other score.

C

um So if, if you revert the pr you don't, you know, observe this behavior, so we've reverted.

A

So if we revert the pr what would be the behavior, it would um basically not consider the node balanced in terms of cpu memory utilization, even though that the pause is not requesting cp nor cpu, nor memory right, the it would yeah zero.

D

The weird part, the only part that we don't like, is that it's taking this sudden and odd, you know preference for one node and just dumping like the next significant amount of pods onto that node and overloading it a lot. I don't think that that was the intent of aldo's change.

A

That means that you're not setting your requests right.

D

Right, but you know saying that, like even if we get all of the pods in our cluster to set the requests, it's a pretty big ask to push that then on to users um and tell them that you know now you need to set all the requests on all of your pods, or else you could end up seeing this, especially with like a large cluster.

A

Okay, yeah, and not also like using some spreading like the topology spread.

A

Okay, so those pods are not related to each other, so maybe maybe they are not really each and you're still not going to get that spreading right through the college spread score.

D

C

I believe red hat is running with with list utilize, so that's the only score that would be competing.

D

Yeah, this is a just a default setup with you know the default registry.

C

um So another solution could be to lower our assumption. Our uh we have these constants right to give a size to uh give a a request to a pod. That's that we use for scoring. um If we reduce that, um then we we would, we would hit this with less frequency.

C

I don't know if that's uh something we could consider.

D

Yeah I mean we're open to any alternatives.

C

But for the record I think uh it's better to just revert the change for 121 and 120., um but I think we need. We need a better solution for 122 and 123.

B

Yeah we can bring brainstorm on the latest released on how to resolve it. Viral flight.

C

um Mike, could you craft a reverse and then we, if you can open the issue, we can start brainstorming. Some ideas.

D

Yeah I'll open up uh the reverts and I think one of the people that we were talking to because they this team has more test information than I did. But I'm going to push them to open up the bug with all of the information that they have and then we can get the discussion going offline.

A

Okay sounds good all right, we're just about time.

A

All right see you in two weeks see you alright.

D