Kubernetes SIG Scheduling, 24 Jan 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Scheduling Meeting - 2019-01-24

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

And, as you know, this meeting is recorded and will be uploaded to public Internet. So with that, let's start a meaning. I have a few updates for you guys. Let me just pull up the tool. I'll pull up the meeting notes quickly.

A

Renee is here today, cliven e. Let's start with you, since you may not be interested in the rest of the meeting. I, let you start and I can actually give a quick update about what this topic is, and then we can discuss it and after that you can feel free to disconnect if you want so for other folks. I have actually put a link to our meeting notes about the issue or the PRI dissent, which is a proposal actually for in place updates of pods.

A

So the idea behind in place update of pods is that sometimes positive wire to change their resource requirements that they have, for example, pods, may need more CPU or memory. In those cases you would like to have a system which automatically changes those resource requirements. This is especially needed because a lot of users don't know how much their paws require, and even if they know these requirements, often change during the course of execution of many parts. For example, a server may suddenly get more traffic. As a result, it needs more CPU on relevant subject.

A

So we would like to build a system that automatically changes these resource requirements based on the actual usage of parts. There is a proposal. I just sent a link to the proposal. We already have actually a mechanism for changing resource requirements of parts, but the current existing mechanism requires parts to be restarted after these changes are applied. What we are pursuing is another approach that does not require or doesn't necessarily require, a restart and that's the subject of our discussion. Today. The there are some design decisions that we would like to make and Rene is.

A

Is gonna, give us a quick introduction to the problem we're discussing, and then we can have a quick chat. Hopefully we can resolve the issue.

B

Thank you so the issue that I in this proposal that we were looking at there, the backstory, is that there have been a couple of design proposals over the past year and our design proposal. We looked at this issue where how to get in place update happen and what the flow should be. Control flow should be, who the initiates are initiators. Are the initiating actor in our case is a job controller. Our customer requirement stem from a long-running job.

B

We're having max resources allocated for the peak usage is expensive and they wanted to see if they could get something that can scale up and scale down, as the resource increases needs increase in decrease another. The main use case here is VP a which is currently using odd restart mechanism and, looking into the admission controller, to update the resource requirements on a port that's being created and before it's scheduled this, the proposal that we worked on the design that we worked on it involves keeping the scheduler in the loop today.

B

What we have is we see that when you create a new part, the controller goes and creates a pod, and then the scheduler sees that the pod, a pods been created, that's not bound, and it goes and looks at its predicates around such predicates on the pod and one of the predicates is a resource check and finds the best node2 for it to assign and fix a node and assigns the sensor to it. The flow that we want to want to use is similar.

B

We want to keep the scheduler in the loop, so we look at this is an extension. Ok, we change the resources and, since scheduler has a view of what resources are available, where it needs to know that the resource is being updated.

B

So, as part of this, the flow that we were looking to do is have the scheduler pick up the resource, update first and then either say yes or no, and the proposal that carol has here, which is this, which is what I'm planning to merge accounts for one other case where scheduler can pre and lower-priority pods and get things going, which is which was not caught in our proposal. He the issue with the current flow.

B

That's there in the proposal is that the couplet and the scheduler act on the update request at the same time and elimination can I share. The screen will be just.

B

Alright, can you even see the screen.

B

Okay, so this particular comment that we have just going into the details of this one. It looks like Bobby already commented and he Thank You Bobby for looking at this particular issue here. So the issue that I was looking at the current proposal has we update the resource requirements after validation, its updated for the in the pods back then the scheduler checks the pod first and preempts. At the same time, Kubla tries to apply this is where I saw.

B

We saw a potential race condition and we have documented that in this example, the issue is that if cubelet and scheduler are acting at the same time, then we could end up in a situation where the couplet takes the request and applies it.

B

Scheduler is at the time working on scheduling another port to the node against the capacity which it sees is available, and it's can you set pod and then rejects this update request and then the schedule pod goes to the coab, late and Kubb net sees it doesn't fit and rejects it, and initiating actor will have two couplets, then processes the update request and fails. It's saying: okay, no.

B

This node is full I cannot do this update now and initiating actor has to take that part out and couplets work is kind of double, because it has two scheduled pod, p2 again and then p1 its schedule, because initiating actor most likely will kill and then create the pod which increases the couplets workload overall. So this is the point that I I felt was an issue and I'm working on a flow proposal.

B

That's more in line with what's being proposed right now in the in terms of there are some nice things to Carol's approach, where his use reusing, existing pod condition and container statuses. So I'm writing some quick and dirty code to test it out and see if there are no major gorgeous here in applying our approach has used for the controller set.

B

Bpa is one use case where BPA will look at it and see if the kubernetes system is not able to do the in-place update, it will reschedule the pod so that it gets the resources sooner then, rather, rather than waiting. The job controller approach we are looking at is to have the controller.

B

Have some smart retry mechanisms where, okay, if the in place failed and it has to be in done in place, then what we're going to do is we're gonna, look at pods, leaving that node and when pods leave that node there is potential that capacity is opened up and then we retry that way. We are not burdening the scheduler for with random retries, and it's done when there is a certain expectation that it should succeed.

A

Much for the update so with respect to the race condition that we initially mentioned.

A

It's also join I, actually went and read the whole thing and I agree that it's not a good idea to have like racy sort of algorithms between two components. As you mentioned, it causes extra extra activity in various modules, including the scheduler which, which is sort of like a precious resource for us, especially in larger clusters. If this happens a lot, it could impact negatively disc the scheduling throughput, so we prefer to remove the races as much as possible.

A

I I would like to hear what Carol is thinking about this approach, but I add this one. Evans I agree with you that we should wait for the scheduler approve all the four before letting Hewlett to act on.

A

B

With Carol as well, another question that I was looking into Derek, had raised this concern about gamification of the resources so and then Carol had suggested using max resources max of resource allocated versus resource requested or the desired resources and I'm trying to understand that a little bit more closely, mainly from the perspective of does. It does seem to like act more on the conservative side and which is good. You will end up in a situation where you're not over-provisioning any node.

B

The one one thing I was looking at is: is that going to cause a lot of resource wastage if I'm trying to investigate that a little bit more? Based on your previous conversations with Carol or direct? Do you have an idea of what kind of scenarios the gamification can happen.

A

Remember this exact problem: my memory is not that good. So maybe I probably had this conversation, but I, don't recall exactly so on. Do you know what exactly this max resources is used? I mean, alternatively, if you're? Okay, with like having different key OS classes, we could use like limits right for setting the maximum to indicate the maximum resources that VP I could go for a particular part right. So I, don't exactly remember what was the max max resource is used Oh in.

B

This case it looks like there is a resource allocated and there is a resources desired, which is in the nude in the new proposal.

B

Today we have resources, which is the desired resources with which is going to be there once the part is scheduled, it's going to be scheduled against what's desired, and now we have this resource desired changes and resources allocated has to come in or job the customers job is to bring it equal to make them equal, keep the resource allocated to the resources desired, so in in that at any point of time, what he is saying is, whichever is higher, take that for accounting, not the lower one, and that seems to have something to do with the gamification.

B

That I think direct had commented on my design document earlier and trying to better understand what exactly that is.

A

B

And do the during the reduction? Also it might work out in the sense that okay, the your desired resources, is now lower than the actual resources when you're accounting for it. You account for the max so that you're not over provisioning or your like. To take a simple example: let's say there is one node with one pot which is using all the capacity and then it's capacities is you. The scheduler reduces it to half and then another pod new pod comes for scheduling which can fit on that node.

B

The one potential problem I see is okay, the scheduler goes and reduces the capacity and Del's couplet okay go ahead and reduce its capacity. The couplet hasn't seen that update yet and then this new part comes in gets scheduled to that same node. Now, if they, if these two updates are reordered, then the schedule pot will get rejected. That is one potential I see. Do you see that happening in could could that happen? Yeah.

A

Yeah so I think now it makes sense to me to take maximum of the two. Otherwise we could run into the scenario that you just described. Basing the scheduler may may assign positive. You know it's nodes, don't have enough capacity, so some parts could read could get rejected.

A

B

That's what I've done wrap my head around today and understand more closely what it exactly means, and maybe that's what the gamification word that term, that direct used refers to I'll, follow up with Derek and then see if this is what he meant, and this particular scenario is what Carol and Derek we're worried about. As far as the rest of the document goes, I believe you had a couple of questions in there.

B

In the loop has been clarified for you.

A

You know having without the flow it's hard to really tell what the algorithm is. Yes, if there is any major issue, so that was my comment.

B

Yeah, okay I will take this new parameters. That carol has, and probably tomorrow, I'll, be able to finish my quick prototype and verify so. The only outstanding issue in this which there is no good solution is when you have two schedulers acting independently, and in that case the best thing to do is the scheduler sees that the couplet has failed, the update request and it deducts. Let's say it's an increase and then the couplet couplet can detect the scheduler can detect a transition.

B

The old pod new pod can see the transition from requested to failed, so one I'm flying to add one more state in there called requested, which the couplet sets its degrees, and then it sees that state going from requested to fail. It knows that you know the accounting that it had is.

B

A

Basically, the idea, the basic of the idea, is to accept the fact that there is race condition and try to try to deal with it right. If it happens, we should not insist on the same decision that we have made already, because it is failed. Scheduler can try a different node, although we don't have any of that and this casual logic basically scheduler does not keep any history. So maybe we need to add that, but that's one approach, but it could be other options as well. Yeah.

B

The approach we have taken is to keep setting scheduler simple, the controller, the initiating actor in this case it could be VP a or the controller or job controller would see that it has failed and then retry it, and we were looking at a couple of different policies. So in the case of job controller, when it sees that it has failed and there it goes and retries it at a later point when the pods leave and we use the reason field in the pod condition to specify why it failed, it failed because of capacity.

B

Okay, the node doesn't have capacity, it will have capacity when pods leave, and this is something the controller has a view of, so the controller can decide. Okay, now is a good time when the quad that bustable pod leaves the node. Now it's a good time to retry the other cases. When you have deployment controllers, where you're resizing all the instances of the pod and if, if it, if the recess requires restarting, then you could violate the pod destruction budget.

B

So if the scheduler says ok, if I do this, the pod desertion will it is violated. So I'm not going to do it I'm going to fail it, then the controller can retry when the pod decision, but it is restored again. So those are the two conditions that I have taken care of in our code.

B

Vpa might have other ideas of how to handle it, I think they were looking at awaiting and then where the scheduler takes the task of kicking out low priority pods, which makes sense because the outside of kubernetes cluster nobody has control. The control is with the scheduler and controllers to see if they can move low priority pods off the nodes, so that the higher priority pods can get the resources that they desire.

A

Into stuff time, let's stop this discussion. We can follow up on this later and we can actually have more discussions on the documents as well once it's a little bit more company yeah we'll do that thanks for be jump.

B

Off. Thank you thanks. Alright,.

A

So there are a couple other of this I go quickly over those, so another one is the fact that we had a major issue with take notes, my condition that caused a problem and.

A

In the past, we try to change cubelet and add logic to attain to the nodes at the startup, but also upgrade issues. So we are now pursuing a different approach, basically changing the API server 2 to basically take notes at creation time. So it looks like the PR is now ready to get managed and we will go cherry-pick this change to all the releases, basically since 1:12.

A

Hopefully this will resolve some of those issues. Another issue we have faced recently Rey has been working on. This is that the scheduler sometimes leaves some parts in pending state and not retry these parts. So, as some of you may already be aware, we have had in logic in the schedule to not retry under schedulable parts.

A

Until it there is a change in the cluster that makes parts more schedulable, for example, no doubt place and stuff like that in the past, scheduler was reacting and redrawing all these unscheduled parts at every node heartbeat, which was arriving at like every 10 seconds from each node, so I mean a larger cluster. It would happen very frequently nowadays.

A

Scheduler is more efficient, but we know that there could be some races where the scheduler could possibly miss some of these events, when it's trying the part and if that part, which is in flight, is determined to be under schedule at all, then sometimes this part may not be retry, so we've had a mechanism to retry some of these pending parts, and but that mechanism is only in the master we may need to cherry-pick that into all the releases to solve this issue. I will follow up with way on this.

A

These are updates from my side. I know that many of you guys are present today in a meeting. Are there any comments updates from any of you, folks that you want to share with us.

A

So Valerie I know that you have raised interest for one of our issues, which is non pre-empting priority functions. I I do support adding that feature I'm, not so sure if I will have enough time to help with fixing the problems in the existing PR for this feature. But if you think you can help how with that feature, I would really appreciate your help.

A

A

C

For that issue, who would be the best people to kind of on that area? Right now to talk to so.

A

So there are so I would be probably the the person coupon tag, but so what is what is the exact problem? Are you looking for someone to find what the solution is or have you or have you already seen that PR that is already there, but it's causing their failures? I've.

C

Taken a look, I've only worked in the networking and testing area so far, so there's just a lot of areas. The code base that are new to me with that.

A

So yeah I mean for helping with the code base and sort of like mentoring, people, I I, don't know if I will find enough time to be honest with you, I will be happy to answer, maybe some questions which are a little bit more quick, but some of them which need a quieter more time. It's really hard for me at.

D

This point, so if I'm, a sick, rebec's has been working on the mentoring program for a long time to try to meet needs like this. If it's something like a code based tour or help on a particular PR and so I think it's it's quite likely that the knowledge that's needed is not just in Bobby's head but somewhere in this SIG.

D

Among a few of us and so I think you know one asking this SIG, who has time for, for example, a code based tour is one step, but another would be getting finding the the landing page for the mentoring program to see if you can, if putting in a request, is the right path for this particular issue. Yeah.

A

That sucks, these are all great points thanks to them. So, though these are great points you can seek help there and in fact the failure that we are seeing is not in the scheduler code. It's actually more on the API side, so it falls mostly in the guy missionary and how the API should be added. And since this change is touching the API there is some amount of work to be done there.

A

Some amount of code should be automatically generated and we should make sure that you're touching the right places to ensure that all these new codes are generated properly and I'm Finnish, or that the failure you're saying is because because of that, so yeah seeking help from say country bugs is probably the best approach at this point: okay, yeah. Thank you any other questions or comments or updates from projects you guys are working on.

A

Okay, one quick thing and Harry I know that you and one of your colleagues I believe, has been working on equivalents, cache or equivalence class. At this point, I guess is not quite a caches more like class I'm. Sorry that I haven't had the chance to take a look at the P R, but it's in my to-do list I will definitely take a look and hopefully we can get that going.

A

Just take your time, yeah! Thank you. So if there's no other questions or comments, we can end the meeting now.

A

All right, thank you. Everyone, let's see some of you folks next week, bye.