Kubernetes SIG Scheduling, 20 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Scheduling Weekly Meeting for 20210520

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hi everyone uh today is, may 20 2021 uh welcome to this week's sixth scheduling meeting and this meeting is being recorded and will be uploaded to the youtube channel. Alright, so I suppose you can see my screen right.

A

Yes, okay, cool: let's go through the agenda. First one is that, due to some requirements from the stereo committee, we need to clean up our maintenance and reviewers regularly if they are not active for for a period.

A

So thanks for abdullah and aldo working on that working on the dev staff metrics to come up with a list of inactive members, so we have this to clean up, there's the approvals and the reviewers, so that gives us better chance to get proper person to review the peers.

A

Okay, that's for your information and second wise. uh Last week we and dave just introduced idea about adding the extend external resource support to the balance like allocation plugin and during the reviewing both from me and aldo, we found out that there is some designing flaws in supporting the extended resources on the existing plugins like list the request and the mercer requested.

A

So the basically the problem we found is that in a heterogeneous environment, which means some machine, have this kind of particular resource, but some don't so for this kind of situations we are not properly scoring the machines properly. For example, I can give the example like we have swinners and why nails doesn't have a gpu.

A

So then supposedly comes better as a first part, and so the this is the current usage among the requested results and the capacity. So in this case this part doesn't require any gpu resource right. We calculate all the accumulated requested results and divided by the capacity so that we got the final score for node one uh for the list of requests. It will be the highest and second one will be node two and the node three will be the lowest score.

A

So this sort of doesn't quite make sense because the gpu resource uh discuss resources right if a node doesn't request that- and in this case you can see there is a targe in other resources.

A

So probably it makes more sense to choose note 3, so so that we so don't interrupt this kind of discuss resources right so, but this is not the current behavior, so we want to sort of uh get a consensus on how and if we should improve this- and you can you can you can leave any opinion on these issues and after we get to a consensus, say: okay, we need to adjust these algorithms.

A

Then we can go to the algorithm details like what kind of algorithm we should to make a better ferrous on the nails so basically yeah. This is there's some discussion on whether which kind of algorithm we should choose yeah- and also I mentioned diff. Maybe you can live through the dif algorithm to show the results famous.

A

So this is for the issue and the reviewing of the balance allocation and adding the extensive resource support on this. So I think we may need to resolve this first if it doesn't take a long time and then block this issue, because the problem exists in the extended resource, support and also the this proposal is, is intended to adding the extender's result possible, so basically kind of depend on this.

B

A

B

Is that yeah, like I agree that maybe we should we should improve it, um but typically um you wouldn't want to schedule a pod that doesn't request the gpu on gpu node in the first place, because so yes.

A

Correct yeah: this is all the scoring so.

B

Yeah usually like, for example, on gke what what we do is that you have a taint on these nodes by default and only if you tolerate these nodes, which is typically a part that requests a gpu, the part will get scheduled there. But the general point, I think, is still valid like we need to take care of this in a probably better way, but I don't think it's as problematic, as you might think,.

A

Okay, okay sounds good yeah, so that means in the yeah. In the user's perspective, they need to take care of the pain that operations that have to ensure to not to waste the resources on the on the on the machines which have discussed resources right.

B

Right idea, those cost resources are expensive. You don't want to actually, like you know, hog them for parts that don't want to use them in the first place. So a filter here is probably uh better uh in general, not just like a preference yeah.

C

A

uh Yeah any opinions, please visit these issues to leave your comments and next item is uh raised by abdullah a single scoring plugin for note resources.

A

The background is that we have a couple of scoring plugins, which should be enabled, if exclusively because they work in sort of in the opposite direction like the list allocated versus the multi-allocated. So we want to introduce.

A

A single plugin as the scoring plugins and make those kind of behavior as the options in the form of the plugin arguments so abdullah proposed, so that we can use the existing no resource set. This is the right now is the filter plugin so, but we can make it make it the variant of the score point as well, so that we make a list allocated mostly located at the last others as plugin arguments, so because this is good, but we need also support the viva better one which is use the scoring plugins as this.

A

So once we migrate to viva beta 2, once we have the unified resource field, plugging as our scoring plugin, we may need to handle the conversion as well as some other uh things. So thanks to abdullah to do a breakdown on the issues. So we can get this kind of items one by one. So right now the first items it's been working on and the other items.

A

uh Absolutely I don't want to take it over or do you want to volunteer.

B

You you can, you can read on the issue, but I think one important thing here is that um we nee we. We can remove the algorithm provider and algorithm provider flag completely. I double checked with jordan. I think it's safe to remove the flag, the command line, flag.

B

Because the default set of plugins right now is being set outside the defaulting logic for component config, that's kind of problematic like we want it to be virgin and if you want to be virgin, needs to be set and proper like. If we wanted to be properly version, we need to set the default set of plugins in the defaulting logic so that in v1 beta1, for example, we don't change the behavior. We continue to use, least allocated in v1 beta2. We change the behavior.

B

We can change the defaulting behavior by changing the set of default plugins and replace lists allocated with this new plugin, with the default still being like the same semantics, but it's a different plugin.

B

This is important because if you remember you could disable and enable plugins right so disabling plugins, you disable them for the default one. So if, for example, a provider, disables least allocated- and at least allocate is not part of the default set, it's gonna break the like their setup, um and so we want a way for them to do it properly and the proper way is to say: if you are using v1 beta1, then you can. You can disable list allocated because it exists.

B

But if you are in the v one beta two, uh you don't disable it uh you can use. You can change the configuration right for the default plugin that already exists, which is the fit, um and so I I yeah I create another issue to to remove the algorithm provider, and that will, I think, it's a good cleanup it. It will move things towards really making all the defaulting logic for the scheduler. More and more uh as part of component.

B

We had aldo working on the plugin configurations a while ago. Now this is this the set of plugins themselves. I think they should move there as well.

A

Make sense so do you want to move it a little bit ahead of other items for this one.

B

I think it doesn't matter at the end of the day. We need to do everything in the cycle. I think we have time.

A

A

Okay, uh thanks abdullah, for working on this next item is from tyler. Do you want to take over to give some demo and introduction on the scoring plugin proposal on the topology aware scheduler? I can.

D

Yes, please uh can you hear me.

A

D

Great okay um I'll try to put the mic far away a bit, because now it's better.

A

Yeah, it's better so yeah, I'm gonna start my showing so you can take over this year.

D

Okay, great, let me show my screen very quickly.

C

Yeah, I can see it anyway right.

D

D

D

Hi everyone, my name, is uh talo and I'm a team member of the echo engineering team at red hat so as part of our ongoing effort to improve the way kubernetes handles latency sensitive workloads. I'm going to introduce you to the proposal for noma whale scope. Again.

D

This proposal is an extension of the node resource, topology filter plugin, which introduced here in the past by my colleagues uh swati and alexi.

D

Okay. So so this is uh today's agenda and with no further ado, let me jump into the motivation section so currently node resource, topology, filter, plugin.

D

Its purpose is to filter out nodes that can't address the pods request on nodes which are configured with a single node policy, so topology our scheduler as a whole, aims to take the advantage of the pneuma awareness in order to make more optimal scheduling decision decisions.

D

So we would like to take it to the next step and enhance the filter plugin with the scope again, which is also new malware in order to determine the best candidate amongst the list of nodes which provided us by the filter plugin.

D

So basically, the filter plugin was created in order to avoid from topology affinity, error caused by the cubelet and the scope again intended to decrease the chances of pods stuck in pending state.

D

So both of those errors occurs due to the fact that there is a gap between how the kubernetes scheduler and the cubelet sees the hardware topology.

D

D

Cubelet has more knowledge and more information about how the node resources are spread among pneuma nodes and kubernetes scheduler doesn't so. We would like to add this score plugin, uh which has new awareness ability in order to reduce these gaps, and eventually we will end up with less issues, as I mentioned, with less issues, as I mentioned before, the the pods stuck in the pending state and on the long term, it will allow more optimal utilization of the system resources.

D

um Any questions so far.

D

Okay, I'll take that that's.

A

Right, I just yeah- I just want to give a background. Is that uh the topology aware plugin support has been been available in the scheduled plugins. So, in the background we pursued to add this part in the upstream, but due to the no resource api is not mature and served as a crd, so it against the community rules to adding a theory support in the upstream.

A

So we use the schedule plugin as as a place to incubate this kind of plugin, and also the no resource api has its own repo to evolve over time so yeah. This is a really good background.

D

Okay, thank you.

D

So uh this scope plugin uh offers three different strategy for school calculation, which are basically an imitation of the already existing entry kubernetes plugins.

D

So we have uh the list allocatable strategy, which is basically a way a way to. We are trying to beam pack as much pods as possible on a given node.

D

um We have the balanced allocation, which is kind of like spread the resource allocations as evenly as possible across all available nodes, and we have the most allocated, which is the opposite to list allocatable.

D

So it will look for the nodes which has the most amount of available resources and try to deploy the pod there.

D

So this is um an example for the kind of problems that the new score plugin intended to solve. So we are going to very quickly I'll try to provide here a comparison between uh two plugins, which has the same logic, but one of them is non-new malware and the other does so.

D

There is the kubernetes entry most allocated scope, plugin, which try uh to bin pack as much uh pods requests as possible on a given node, and we have the new scope plugin configured with the list allocated strategy. So they both try to be impact as much resource as much board's request as possible on a given node.

D

So this is the current cluster. We have here uh two hosts with two pneuma nodes. Each one has four cores on each pneuma and the other one has two cores on each numa.

D

The brown cores are resolved cores and the blue ones are available for allocations, and we have here a set of three requests: uh three pods, which are requests to be deployed on this on the current cluster, and here I attached a link to a live demo, but due to time constrain, I won't present it right now. So I just run very quickly on the uh this, the problem itself and later on guys, you can take a look and uh see it on a real environment.

D

So, uh as I said scenario one with the entry most allocated scope again, we try to deploy pod number one or one one. Other important thing to mention here is that both nodes are configured with single pneuma, node policy.

D

So to the people here who doesn't know what single police, single nominal policy is is means that the pod will be accept only if we have uh all the cpus that it asks for are coming from the same numa.

D

So this is an example, for uh this is a good example for a pod that can be accepted because all cores, as you can see here, are coming on the same numa. So the first port will ask for three three cores and it will be deployed successfully.

D

The second port will ask for two cpus and again, as I mentioned, we are trying to beam back, so the this scope plugin will decide to deploy it on host number one, and now we had port number three which asked for trickles, but, as you can see, we don't have here an available host with three cores which are located on the same numa.

D

Okay, so this board will stack on a pending state. So let's go back and try to do the same with the new scope, plugin so again, the same setup, the same cluster, everything they remain the same. We just replaced the um we configured the port to ask to be deployed with this scheduler.

D

So request number one is exactly the same port one three cpus on host one numero node zero. Then here comes the key difference. uh Port number two will be deployed on host number two, because since this plugin is numa well, it says that it's better to the to take two cores out of two available at the newman level, then take two core out of three available at the normal level.

D

So this is kind of more fitting down at host one number one and of course the third request will eventually uh it will eventually manage to successfully deploy it as well.

D

um So that's basically the um the explanation of the of the problem. I will, since I don't have much time, I will skip on the algorithm itself and I'll just give you a quick look at the manifest how it looks like so this is a standard scheduler configuration manifest.

D

We can enable the we can enable the plugin here. So we have score enabled here, and one important key to mention here is that we have this field which calls skoska joining strategy, so the user can switch between different uh strategies by his choice.

D

And basically, I'm done if you have any questions, yeah.

A

Could you go back to the algorithm page? I have a question.

D

A

Yeah, so you mentioned that you will do the scoring her human note right. So I suppose you mean that in the scoring in the final normalized scoring you will, for example, for for the, for the note one you just mentioned: there's two candidates, two numer knows candidates. So so you will choose the higher the score for the newman note. Or do you do a aggregation on the two new manuals there.

D

You are asking: how do we decide the final score for the note? That's the question: yeah yeah, correct. Okay! So, basically, um assuming we have two numerals, as you said, we will score each of each one of them independently and we will return the pneuma node with the minimal score as the node's final score. We won't calculate them together or something like that. Okay,.

C

A

D

Sense and that's because we would like to avoid, from worst case scenario and stuff like that.

A

Okay, okay, that makes sense. So another question is: how do you ensure this kind of new manual?

A

Is the decision is conceived consistent with the like the cube, topology manager's algorithm, because, for example, for the note one you have yes, we do assign the node one to the spec, but when it comes to the keyboard execution, it also has two choices there right. So how can we ensure the result is consistent.

E

You may I can yes, I can take that, um so um we actually the filter, plugin, that we have is almost a simplified version of the topology manager logic. So it does consider how alignment would happen from resource point of view on a numa node.

E

And then the scoring plug-in essentially scores the the pneuma nodes, uh essentially the nodes based on the numerators and the resources that have been requested and that's how this uh and eventually you'll, obviously select the one. The node, which has the maximum score.

B

Okay yeah, so when you, when you score, you score the whole node right like in not a specific route.

E

Right right, yeah, you score the whole node, but you consider scores corresponding to normal nodes and then aggregate them to identify this code corresponding to a node.

B

Right but the question is like once we made once the scheduler made that decision and set the node name to that node is: is there an agent on the node that actually going to place the pod on the right root or the right pneuma got.

E

It that you have two.

B

That would fit there so.

E

Yeah, like that is that is a gap that is essentially known right now, because there's no way for scheduler to convey exactly what no one would um the resources would be allocated from, like the apology manager still executes its same logic, so it is kind of a best effort that topology manager would make the same decision as the scheduler would make. But there could be a scenario where this decision is kind of different from what the scheduler has evaluated.

A

Yeah and another potential risk is that the you have to issue the policy in both cubelet and the scheduler are consistent, like you are specifying a list allocated in the schedule side. Meanwhile, maybe you are specified the most allocated in the kipling technology manager side, so that will also cause the unexpected behavior yeah, but.

E

Cubelet does not have any policy like how would yeah yeah.

A

Yeah, but you should have right, otherwise, you just randomly randomly pick up the available new muscle off there.

E

Yeah yeah, so we probably need some sort of alignment between the strategies and the policies that are configured at a scheduler level and at a node level, which is currently not known. So I guess uh one thing we could do is like identify like. Obviously these are good points.

E

We, we were planning that we submitted a proposal to this kerala, plug-in repo and capture them as kind of known items and try to figure out what are the possible solutions of addressing those. Does that sound good.

A

Yeah another another option can be that we made sort of an interface to to to let the cubelet aware of the specific human node. We want to assign to not a high-level node name but specifically non-human nodes. Yeah.

E

A

E

A

That we don't need to worry about what kind of configuration policy is defined. Keep this thing it just yeah.

E

A

That was an idea.

E

That was an idea actually that circulated, I think, very early days of topology, aware scheduling when we discussed some sort of hint coming from the scheduler to queue blood. Yes,.

A

Yeah, because why why we can fix this problem, is that we want to ensure that the eventual execution result is is aligned with the scheduling results right if there is no consistency, if there's no point bringing up a new scheduling scoring plugin to do a lot of extra calculations.

E

Yeah yeah, that's a good point. uh So v. Where would you see like this kind of hint, like obviously scheduler, is doing its evaluation um and we could figure out a way, maybe adding an annotation in the pod spec or something like that for the cubelet to consider or an agent to consider to maybe act as a provider.

A

I think those are.

E

Cubelet changes, I guess.

A

Yeah annotation is this one solution, but I think the community is not a fan of that right. Yeah yeah, so yeah we're gonna come up with some some some documents to listing all the potential options for the discussion.

C

E

F

E

That sounds good.

F

What would be, what would be the main reason why you want to do this in in the scheduling plugins are not directly in upstream, because I I think there is already a cab or already some implementation about yeah.

E

Yeah that was actually me. I had proposed the cap as well as like the implementation upstream, but I think we mentioned it initially. The the gap was that the crd api itself is not maturing yet and uh that's why we move towards scheduler plug-in to be able to gain some kind of attention and for allow people to use it there and then eventually, we move towards an entry plug-in.

F

And uh what what is missing, I think I would rather like to see the crt maturing yeah.

E

Yeah so the crt exists and- um and I think uh like I want to clarify- if is this- is if that is the question for the filter plug-in that was proposed previously, or the scoring plug-in that tellor just presented.

E

Yeah they're kind.

F

E

Separate yeah linked but separate.

A

Yeah, although the background is that yeah, it was firstly pursued to get this down in the upstream, but because the plugin depends on the crd. And here this we don't usually adding the record crd dependency in the core. Kubernetes ripple right.

C

A

We got a lot of pushback on getting premiums in the in the upstream, so I think once we got the node resource api uh mature and we can get that incorporated in the call apis down. You can move that approach. Yeah, yeah, plugin.

F

Is that being pursued already? How are those communications yeah.

E

So kind of they are related and basically I'm the owner of this piece of work. So it's me I was initially working on entry. Enablement then started working on uh out of tree enablement and once things are kind of at a reasonable state and we have people who are using it, and maybe we get some more feedback on the crd api and we think it's in a reasonable state, we'll again go back to the having those conversations with uh with essentially sing architectures ignored sex scheduling. Everyone.

F

I see okay, um all right, I don't have any more concerns for them cool, but I would suggest you um continue pursuing the the um graduation of the crd into into an api, and that definitely involves your scheduling when you do that, even if, even if it's only node related, I think it's better for us to be involved from the beginning.

E

Totally like you can have a look at the current state like how how it looks like at the moment, I did open um uh in the staging repo in the staging section in kubernetes as well, but then I didn't want to kind of bypass. Everyone agree on how the api looks like um so. Yeah just have a look and maybe give us some feedback. If there are things that you think should be done differently or or anything that needs to change.

F

That's good, could you. I include the link to the upstream discussions. Sure.

E

E

Okay, so I think just to wrap this up, I would like to make sure that we are all on the same page. We were thinking that for the time being, given that the scheduler plug-in is in the uh the topology, where scheduler plug-in is in this catalog plug-in repo, is it okay to go ahead and maybe create a separate cap and push an implementation pr to the scheduler plug-in for this change?.

A

I think I think we should firstly raise the cap to the kubernetes repo to focus that cap will focus on the kiplet policies of how first, uh so, basically, we have two options. One is to make the decision dispatch back scheduler, so I mean the humanoid decision, aware of cubelet. That is one option. The other option is there to add the policy scoring policies about the keypad, which is should be consistent. In the later we introduced the scoring plugin in the schedule style.

A

So that is one that's my suggestion to respect cap in the kubernetes and involve the scheduling and the signal. So the second cap can be just an update on the existing project, where plugins.

E

A

The scheduling, repo yeah that.

E

Sounds good: okay, um okay, so that sounds like um a good direction. We'll follow up on that then, um and and way I'll, keep you in the loop as we go along.

A

Maybe we can work.

E

On a proposal for the hint and in calculation between uh cubelet and scheduler,.

A

Yeah, thank you.

E

Cool thanks. Everyone.

C

All right, okay,.

A

Thank you, yeah taylor. You can stop sharing. I can finish the last test of the pageant. Yes, thank you.

A

Yeah sorry, we run out of time. So basically, the last item I want to mention is that there is some backup issues. Some are working in progress and some need some help. So if you have time and want to dive into the code fix some bugs. This is some good candidates.

A

So it's not uh a good fit for the first time contributors. So it does need some understanding on the soft skills, so yeah, that's it and uh before we enter this meeting. uh Anyone has any questions.

A

Please raise up.

A

Okay, otherwise, I will.