Kubernetes SIG Scheduling, 12 Sep 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Scheduling Meeting - 2019-09-12

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, so hi everyone. So, as you know, this is this meeting is recorded and be uploaded to YouTube.

A

I guess we can start with the items listed under our agenda, so vini. If you want to discuss the in-place pod vertical scaling, please go ahead. Yeah.

B

Sure so it's been some time since I've talked about this. So a short overview of this is that the desire, what we intend to do here today. We cannot resize the pod as in change the resources field in the pods back without restarting the pod. It's disallowed by by the validation check in most cases, and the intent of this cap is to make that mutable and then have a way, a mechanism in the core kubernetes to handle this change and drive towards the desired new resources that the user desires as a desired state.

B

So in doing so a little history, our first design was keeping scheduler in the loop, where scheduler would first look at the request and then approve reject, based on its view of the node capacity, and then the cube net would act on what the scheduler approved this. The signal felt was overly complicated to manage so and direct felt that we should it once the pod is bound to a node. The resizing should purely be a discussion between the API server and the cubelet and scheduler can potentially help if by pre-empting lower-priority pods.

B

But that is something we plan to take up in the future. It's not a not part of this cap and this time so once the pod is bound and resizes requested, cubelet looks at it and decides whether it can accept it or not, based on the available capacity room in the node, and if it does, then it indicates that by updating the new field, that's been added to pod spec called resources allocated, which is the Quest's part of the resources and says: okay, yes, I have room.

B

I can I take this recess request, so it essentially locates the capacity for that pod and if there are any changes to the limits it drives towards updating the limits for the containers, then the pod.

B

This is pretty much the short summary of this and yeah one more thing that resources there's a new field in the pod status container statuses, called resources allocated, which is sorry resources, which is what the final state is so schedulers role at this point is purely as an observer.

B

It essentially gets updates, it's pod updates are taken in and it updates the resources allocated section of it when a recess has been accepted and the change to scheduler here would be to start using resources allocated instead of resources to compute the pods particular resource requirements and when a new pod is being scheduled. It looks at the existing pods, looks at the resources allocated field and then computes.

B

How much was the total capacity that total resources that's been already been allocated on the node and what's available on the node, and if there is room it selects the node as a potential target among other predicates and finally makes a decision based on this.

B

So that is pretty much the change it's contained in terms of what the scheduler is expected to do here. Just.

A

To reiterate and make sure that I understand, so this scheduler actually will look at exactly the same. We were doing right now when we admit the pond, which we look at the pod resource requests like resources, dot, request, I, guess that's that feel yes for the new pod, yes, for an only for a new part, four.

B

Yeah four parts that are not bound that the scheduler needs to find find a note for for those years for the parts that have already been bound, it would use the resources start. Scheduler, pod, SPECT or container start resources allocated way.

A

And- and my point is pause that are not bound- are the pause that we are trying to schedule should so for any part that we are trying to schedule will always look at its resources that request to decide where to place it correct, but to compute how much resources and node is using. We will iterate over the pause in that mode and look at the resources allocated field correct right. Okay, so- and this is what you meant by like bounded pods for battle pods, we look at resources allocated and those bounded pods.

A

We don't schedule them. We just look at them for calculating how much resources were using on a node right.

B

Today, all it does is updates the cache when, like things like annotations, something changes, it just updates the pod I believe this would in the change would be due in its accounting. This one field resources, not requests, would resources, not requests, would be something that the user desires, so that may not necessarily be what the couplet has agreed to, so the computation should be based on what couplet has agreed to and for a brand-new pod in the API server early at the pod creation time.

B

We'd set this if it is empty and the current thought process is that if the user set set at creation, we validate that it should be equal to requests. If not, we fail the validation and this bringing this up, because David Rushville suggested there might be a use case. He can potentially see a use case where the user might have.

B

This was the allocation for a new pod be lower and we want to drive towards the resize after it's been scheduled. That's not considered today. I just want to keep it simple for now.

A

Right and and actually the case that David Nashville proposed could also I don't know, could also be useful for UNIX like this again. The second item on the agenda approaches to how in front of schedule of other compute workloads, so you could use that as a hat, basically to request more resources to present more resources than the node, but actually use less.

A

But let's defer that for the second for the other yeah.

B

I think that requires more thought. It's a nice to have I kind of see it as okay. This gives the user a way to say hey. This is my ideal resource requirements, but this is the minimum for brand-new pod and scheduling me at the minimum and, if possible, drive towards ideal that could potentially be a nice to have once this is in, and we see how this works. How well this works? We can take that up.

A

Right and has there been any pushback when the fact that you're introducing a field to a spec, but the user can't address it. A spec feel like to me. That sounds a little bit strange. It.

B

I, don't know yet so I realized couple of weeks ago, just before we kind of got consensus, that this is a way to go with sig node, that we need to do a review with cigar architecture, for a peer review and I have requested the APA review and we're planning to try and schedule that in the next few weeks, just wanted to before I went to them. I want to make sure that we cross our T's not arise kind of.

B

So the key impacts here is to cube, let's ignore, which is kind of looking like they're, going to agree to it, and LG TM already got one entity by David but Derrick and don't have to approve this SIG auto-scaling should have no issues I'm going to talk to them on Monday when it's their meeting weekly, weeding and six scheduling, which is you. We have to agree that this is not gonna, be a problem for scheduler. Once we have all these it'll be a strong case for me to talk to them so Bob.

A

We looked at this way before us I, believe you I think I think it's fine. This way, this approach is way simpler.

A

We've got anything we can easily integrate, but I will leave probably to comment on this. So.

C

The only you know concern that we have with this approach is the possibility of race conditions between scheduler and other components, namely BPA, because you know it can't it can obviously race with the regular scheduling flow of the scheduler as well as preemption.

C

So once the part is preempted, and now it may look like it has some free resources. At the same time, scheduler wants to use that resources, so we get many or cube-like may think that that resource is available. Similarly, every certain amount of resource may be available and I know. This is actually what I'm talking about right now is about regular scheduler, well or scheduling flow, so certain amount of resources is available, I mean no BPA and we're both at the same time decide to use it one for increasing resource requirements for it.

C

For a part, the other one for scheduling, and in fact, so there are these concerns, I guess at some point we decided that these resources, these kinda race conditions, are unavoidable in large distributed systems. We should instead of trying to solve them. We should accept their existence and try to resolve them after they.

C

They happen, for example, if this edge work assigns a new fight to the same node, the node or the cubelet can reject that new part or similarly for for preemption the same thing can happen and cubelet could potentially reject the part and the scheduler could retry I think this is slightly worse.

C

This approaches by the worse for preemption in particular, because the idea here is that preemption happens for higher priority parts, and if the scheduling of a higher priority depart is rejected, it could potentially cause more trouble for users who are expecting these parts to be scheduled, maybe as as fast as possible or with a higher guarantee.

C

So there are some of these concerns, but in practice we should see. We don't expect these to happen very frequently. As a result, I think this approach is fine. Thank.

B

You Bobby yeah. That's yes, that's correct! In fact. The first implementation that I used for demo in Barcelona coop con was based on scheduler, avoiding this race condition with between the scheduler new port scheduling and resizes.

B

However, the thought process is that this should be a fairly unlikely rare event that you know the overall system load and there's no way to know this until we have something in production is a chicken-and-egg problem. Well, once we have something that's in there and is used as a large scale that will have the data to determine okay. Maybe we should you know, update our implementation to have scheduler in the loop. That's something that I could not justify with data without having the data, because there is for me to get the data and the system.

B

Cubelet already does this today, because you have multiple schedulers and the kid there is a risk condition inherently built into that. So this is one way to locate is look at it, as this is an extension of the already present race condition between multiple schedulers.

C

Yeah well, one thing to know, though, is that you know they. These kind of race conditions are somewhat rare in the kubernetes world because most of the times in kubernetes companies clusters are created in and cloud environments. The sort of the assumption is that there is plenty of no that notes available for all the workloads in clusters. As a result, most of kubernetes clusters do not have a ton of pending pods waiting to be scheduled in on-prem clusters.

C

The story is completely different, for example, in borg they almost always have a ton of pending pending tasks in every cluster. Then there are so many pending tasks or pending pods in a cluster. Potentially these can erase condition. Can conditions can happen more often so yeah? We should see in practice and then based on feedback. We can, of course, yeah.

B

This feature actually helps in that there was the one of the cases that I there's a company called JD calm, which used our first implementation, and they found that if they see a lot of pending pods, they go in and see what pods are not really using their capacity, their allocations and they size them down, and that way they get more work running right. So this feature can help.

C

Correct absolutely downsizing. Absolutely. This is actually super valuable I.

A

D

It sounds like scheduler needs to be aware of every update of the bound.

B

It it it doesn't need to be if it ideally, it would see the update to resource allocations before a new scheduling decision is made, but since we're using the cache, you cannot guarantee that's going to happen. You cannot.

D

My point is that to make the engine or spend in cash always update today, each aware of every update of the path yeah.

A

We watch for part updates right like so when a when a point update happens, we're gonna, look at the resource. Allocator have changed and update the cache yeah.

D

So now seem sweet, maybe I'm wrong. I. Think our code base just what's on the pending path, which means no name doesn't wasn't being said. No.

B

No, it watches all parts, including the ports that have been scheduled for the ports that have been scheduled. It just updates the cache. It doesn't act on it. There is no predicate showing any work done in the scheduler. It's a simple update, function and exit, remove old pod. Add new pod.

D

Yeah I just need to check the code whether we already have the update the only idea. If it's new idea, we need to be sure.

A

We have two more minutes, so I'm gonna give some time to Nick to discuss approaches on how to inform scheduler of other computer codes. Thanks.

E

um So my my ideas are not very formalized: I didn't go to signal first, so the problem that we have is we have an existing VM scheduler and it's scheduling VMs on a set of physical nodes, and we were looking to add kubernetes to that management plane, but I think it'll be a long time before we switch the scheduler to kubernetes schedule. The VMS and I was wondering how we can have these two systems coexist.

A

What do you mean? Have the kubernetes schedule, scheduler schedule, the VMS? Are you going to have the answer if.

E

You look at like Qbert, where Qbert schedules they take I think was it pods to schedule odds, but they create the VMS inside the same namespace as the body okay, but if we scheduled the VMS outside of kubernetes and we watched kubernetes resource allocations, we also want to be able to inform our resource allocations back into kubernetes, and I was curious how to accomplish this. There's there's a few things. I saw others I could create pods that didn't do anything that were specifically bound to nodes.

E

Then watched for preemption and shot the VMS based on priority level or I could update the root reserve of the VM usage, but that's a recycle of cubelet.

E

So I was curious if six scheduling had a way of informing CPU and memory utilizations on the node. That's nuts, that's dynamic, but it's not. It doesn't change often, but changes occasionally.

C

So schedule the schedule of kubernetes does not know anything about no act for resource usage. If you're talking about, if usage.

E

I want to update it's like a reserve amount, not dynamic usage. Well.

C

Yeah so one one thing that was not fully family a video problem, but one thing which was not super clear to me was that, whether the fact that your VMs are by pods for kubernetes so.

E

If you look at cuber cuber solve this problem by scheduling the VMS using by scheduling pods, they converted, they created a vm CRD. They scheduled a pod and that pod represents the resource allocation of the CRT. And then, when the pod was scheduled, they actually created a VM in its place. I see so.

C

Essentially, for from the.

E

Schedulers point of view those VMs or pods yes, I was trying to understand. If, if I didn't, mmm do all the scheduling through kübra Nettie's, how could I back-channel inform kubernetes about these? One idea I had was create pods that were assigned to specific nodes that represented the resource utilization that were empty, yeah I, wasn't sure if you had any other suggestions of financially and how to approach this problem. Yeah.

C

That's one way of doing it. In fact, we have something: a concept similar to what you just described called, mirror pods. Those are used for to represent the denote usage, did not resource usage and basically create a sort of like a representation representation of static parts that are created directly unknowns, and then these mirror parts are created to represent they're sort of like the logical object and they are stored on the API server.

C

We decided we actually recently with felt like these are not great, because there are some problems with with respect to like race conditions, when these parts of created on an hours versus interpose are created on the API server whatnot, but generally they work. They are not, of course, as great as just regular natural parts, but that's one way he can. You can solve this problem.

C

How is this different from creating from what he's proposing? So this is first I think as far as I understood, this is very similar. Basically, they were never seen. Sounds.

E

Very similar I was this is an interesting Avenue and I can pursue yeah.

C

This is you: can you can take a look at mirror, pods yeah, that's one way of kind of representing these. If you want to take a look, what.

A

Is the downside of using what you just proposed because, apart from her perspective, represents the pâtisserie sources that we are reserving on the node? So to me it sounds like a logical solution, but how are you good, like? Are you going to create the VM not from the pod you're gonna have kids VM via, like an external, so.

E

I I was creating the VM externally, I wasn't sure if there was an easy way to potentially create the reservation before I realize it and flush that cache all the way into the scheduler. Such a scheduler could approve or reject, and then the only way, I thought of approving or rejecting was probably just scheduling a pod, and we do have preemption and priority levels.

F

Is that VM part of kubernetes in order? It's not it's.

E

It's placed on a kubernetes node, okay,.

F

So you basically creating a part that requests all of the CPU or for that node, but that part doesn't run on that. Node I have.

E

Worked with its that run on nodes that are represented in kubernetes and I, don't want them to be bound by the system. Reserve I want them to be able to dynamically share the same space as the kubernetes scheduled pods I.

A

Mean the logic like node is: is the is an abstraction right front company right what how we present the amount of resources of a single entity right so to me, the most logical way is to define what that node represents. What I represent half of the resources of that physical host or all of it. So you and- and if you represent, for example, all of it, then you are practically saying that kubernetes can use all of that resources to schedule a pod.

A

So yes, if you don't want to do that, then you really need to configure the not object that represents that physical host to consume half of it or whatever amount of resources such that you keep enough for for another for other system to schedule. Workloads on that node. So from my perspective, it is more office like a flag or some configuration to the cubelet that configures how much resource you want to give it, rather than the other way around. Well.

E

I, don't I I want them to be able to share the same space and I understand. That means that cube scheduler can schedule all of it, but I want to periodically or I want to schedule some of it and then informs cube scheduler that that that it's amount of reserve has changed right.

A

In this case, then, it's tomato pods case I.

E

Will look, I will look at that concepts. Thank you guys.

A

So one last action. One last item on the agenda is I, wanted to make here brains on what is the safest way to migrate our predicates priorities into plugins in this MV framework.

A

So what idea we discussed with aldo discussed with aldo and shown here is that we would like to have a feature flag, a new feature flag that I will allow, while you're migrating predicates and priorities into plugins. We will have, for example, for specific predicates. We will have its implementation as a predicate and its limitation as a plug-in and the priority flag. It's a it's a single flag.

A

We will either say but I'm gonna use, predicates or I'm gonna use buttons, so that I will will the the reason is that we, for some, for some of the logic, will be just copy/paste but for others might not be because we're gonna have to split it into different extension points. So just for the sift like a sense of safety, the idea is to have a feature flag and and graduated alpha, beta, etc, and it will be in GA.

A

We basically have everything converted into plugins I, don't know if you guys thought about this before or you think that this is completely internal logic. We should have full control over it and we can do whatever we want. Take the risk.

C

When we designed the scheduling framework, that was my impression that these are all internal logic of the scheduler. At the same time, we need to steal. We need to still keep the mechanism to disable some of these. Why are the existing policy can figure out the scheduler or basically, we should, since that policy config existed, we need to be backward compatible and if a particular predicate is disabled or if a particular priority function is removed or it's late, this change. We need to apply those changes to the plugins, but other than that.

C

I, don't see any reason.

G

C

Keep both at the same time. What are the concerns that you have that you think that we need to keep both it's the magnitude.

A

Of the operation that we are undertaking, it is a lot of code are going to be moving around and what.

C

Am I still all the scheduler pretty.

A

Much exactly so, it's it's a huge restructuring of the scheduler that is just for for me to sleep. Basically, let make sure that nothing I.

C

Understand, to be honest with you very early plan, was to build a second scheduler calling schedule okay and keep this scheduler in its own repo and built the scheduling frame frame work in parallel, but later on, I I felt like it will be almost impossible to roll out as scheduler 2.0. If we don't build the framework into the existing header, so we decided to basically build the framework into the existing.

C

What I understand that there are a lot of concerns? Similarly, there were a lot of concerns with preemption I. Remember that I I got quite a bit of a hard time when we were rolling that out, everybody was freaked out. People were thinking, our billions of dollars of workers are in danger, so I understand your concern, so it did make sense if we, if it really makes it much more reliable, reasonably then yeah.

C

Once we are confident.

G

Rights, the predicate hi- this is mati.

G

Wouldn't you consider predicates being somewhat API of the scheduler for the people consuming those and setting the scheduler, based on the predicate.

C

I've never been API is really but I mean if you're talking about autoscaler. Yes, scheduler has worked without this a little bit but other than that. Pretty kids on priority functions have never been API of the scheduler. Basically, their internal logic has never been a concern to the users.

A

So there is a as part of the migration planned. We will be moving the predicates into plugins, but I have a section in my job to to present a reasonable way. This is like even daemon sit controller to consume those those filters, we're not just gonna cut it and tell you guys, you know deal with it. We we will collaborate on that part and and we're not gonna, remove those functions until mid until we're sure that everybody has moved into a reasonable state. I.

G

Mean I mean I'm, just putting a lot of information that you are using outdated, predicates that is move to plug in these days. This is how you should be migrating here are not sufficient, but but yeah I will still put it just for the sake of informing users about the moves.

A

Any other comments or suggestions.

A

Okay sounds good, so bye before the next meeting, I'll be sharing with you guys all my thoughts on the migration it took a while because, basically pretty much for the whole scheduler code during that process. So.

F

A

It's a doc that will probably convert into a kit. The core part of that clip is going to be really duplicate in polis, convict conflict. That scheduler takes right exactly and what? If we duplicate that practically deprecating access to predicates and priorities, which means everybody has to use plugins, etc, and and as part of that, we have to make those available as plugins see what the component complete right.

F

So again, this is a I'm not very familiar with it. So I plugins are basically going to be core like compiled into the scheduler in the core Cuban at ease, right, correct, yeah, okay and then so, when we disable predicates, is there going to be a one-to-one flag for each plug-in that we enable will disable the corresponding predicate? No.

A

So, as we mentioned, this is this is practically internal logic. As long as we are able to support the policy config API while is being deprecated, then then we're fine and the flag that I was mentioning is just basically a generic way of switching between the old implementation and the new one right.

F

But I mean basically, we are trying to move the predicate to the plugin right, so both cannot coexist right. You try not okay, we cannot run at the same time.

A

But they can coexist in code, yes, you're guessing, but they can't be enabled in the same time got it.

A

B

One comment about the Knicks approach problem: the are you looking to use one reservation, pod per VM or.

B

Okay, just the the comment was that this is this gonna count against the pod limit that might affect how many pods could run on the node.

B

A

Also, using some resources like that they will not use. For example, part IPS are they're, gonna go to waste right.

B

Because it's fall ahead so with the rest of my cap. If this looks good, please illegitimate, if there are any questions, I'll answer it on the discussion thread. I know, Claus has a couple of questions that were answered. Hopefully look at it resolved.

A

Okay, thank you we're eight minutes over time. Thank you guys see you next week, thanks.