Kubernetes SIG Scheduling, 26 Jan 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Scheduling Weekly Meeting for 20230126

Description

In this meeting, we brainstormed the idea of preferred topology affinity.

A

This meeting is being recorded, yeah, okay,.

B

So I've been thinking about, you know scoring of uh resources from a kind of like topology point of view, for uh groups of nodes.

B

Let's say a wreck, for instance, so um doing scheduling um for uh and looking at the resource that is of the whole rack and having an impact from there due to the scheduling decision and uh actually uh I made some experiments already related to this one and I was basing that experiment to this uh node resources fit plugin, which already beautifully calculates resources or scores for individual nodes, and on top of that I just you know, put a little bit of a hacky code to this normalized function and did a little bit of calculating in there for topology key value players shared among nodes, and then you know had that as a sort of an adjustment for the node score.

B

I mean there are many different ways you could do that, but you know I just want to see. If I can do this, so it worked pretty fine. Do you happen to know if there's any other way of getting this kind of, like topology group of node score done uh for resource per resources?

B

So at least I couldn't figure out anything else and the use case is it's a little bit. You know connected to these Port groups, so um suppose you have like a rack which has very fast interconnections between gpus, but those interconnects don't span to other racks. So you basically want to have a bunch of ports to go and be scheduled into some rack. You don't want to say which track it is, but some rack as long as it has resources, and this Interpol Affinity allows for basically doing that.

B

But there's just this kind of like Corner case, where the result status of the whole rack should be considered for the first part, so that it would land into a good rack and then the following ports would use. Interpol Affinity to you know, go to the same place.

B

So that's basically the use case that I have in one.

B

What do you think.

A

uh So, first of all, I think, let me clarify the requirements that I think if I answer correctly is that you want to generally say I, don't want to use the the particular kind of resources by the regular workers. That may not need that. So so those kind of expensive resources can live for the real workflows which may really need them. Is that the the requirement.

B

uh Noted exciting, so just you know any kind of a resource scoring that is being done.

B

You know, there's, like you know many ways of configuring that you can't do this like in your least allocated node, or you know so forth, so I would take want to have some sort of uh topologic angle to it. So I have like a least used rack prefer the best Note in the least used or rack. That's, basically the goal and the what constitutes as the resource. Well, that's configurable at the moment and I'm I wouldn't propose changing that. No um I would prefer that to be configurable.

B

What is like, you know the most important resource, for that depends.

B

A

B

Configure that quite nicely.

A

Okay, I got your idea, it's more like uh a topology based. It's like the parts. Affinity not know the finity. You.

B

Said no no yeah yeah, so.

A

It's a preferred topology Affinity thing.

B

A

C

um If I make, um but if I understand the use case correctly, this is this is um basically if there was um All or Nothing scheduling you could you could use interpore Affinity to solve this problem so not.

B

Exactly so, if you have this kind of like All or Nothing, call Scheduling kind of a thing it's. This is like orthogonal to that you might still. You know if you just you know, use the current code, scheduling plugin, for instance, it just you know, tells you: can we uh deploy these ports to the cluster, the whole cluster, but uh would interpod Affinity help there I mean it's, it's a basically, it's a preferred mechanism.

B

Isn't it if you start defining it for topology keys, so.

C

So this is so I guess, let's put it another way, so um other nothing scheduling is kind of equivalent of a filter. It says yes or no. I can fit in this in this topology, but it cannot say which topology is preferred.

C

um Yes, there are multiple, multiple topologies that would satisfy the requirement.

B

Right and actually can you can you define with this sort of an All or Nothing mechanism, to use a single topology to limit it that it's kind of like failing entirely if it doesn't end up in a single topology.

C

Do you think the cosplay vlogging you, you can right away.

A

C

A

Check my is technically possible, but you know the challenge here is that so, basically, we are trading uh arbitrary set of nails as an entity, and we just treat that entity as a minimal unit when we do scheduling decision. That is the purpose right. So it's similar like like we before talking about like we more want to see whether we have Cam created the physical note like the metal machines, if it's running a VM underneath to take that into control into their scheduling decision.

A

I think this shared the same principle is that we want to use the logical unit, no matter it's a physical node or a rack or whatever you define yeah.

B

A

Channel here, yeah, the charger here is topology, is nothing but a label on node. So that means it can have as many carbonation you may have it's like 2 equals bar like bars. Any any level can form a topology instead of like no there's a physical unit. That is just you have 5000, yet just five thousand, but for topology kid combination.

A

It's arbitrary I can be a lot of that. You may have spent a lot of resources Computing during the scheduling uh cycle. So that's the challenge. I I will see, but technically I think you can explore using plugin first and I know in the production environment there's a lot of Labor. So that means a lot of a lot of topology.

A

Maybe you can just come up with some plugging arguments, say: Define, okay, I just limit the topology key to this kind of values, so that will just reduce a lot of a large amount of computation efforts. Yeah.

B

I actually went a little further than that: I limited it to a single one, but.

C

Yeah, that was yeah.

B

That was just the proof of concepting, so I wanted to limit it to yeah.

A

Yeah yeah, basically yeah. Basically the scheduling the input to scheduling is that you have to tell scheduler what kind of rules you want to treat the scheduling logic unit right. So there are in the scheduling, plugins computation cycle it can just re calculate you calculate this kind of information instead of just trading. You know, that's the minimal scheduling, unit, yeah I. Think technically it's it's possible. But to me to be honest, I don't see a popular, a large kind of requirements on this yeah other.

A

Do you see any similar requirement, if not I think I will say maybe start with some experimental schedule plugin and if that was well and if we see uh other common requirements for other companies for other users yeah, we can see how to proceed.

C

um I think one question I would have is whether the so the scheduling currently takes decisions per pot right.

C

So so, let's say just a bot comes in the first part comes in, and even if we have information about the topology in their resource scoring um you sure you could you could score the the topologies based on the incoming part, but that's not necessarily the best location, because we actually should have considered all the Bots from this group.

C

um So in a while.

B

Yeah, ideally, you would want to consider the whole group, however, going for the uh let's say, least, allocated topology yeah.

A

B

Think it's your best bet. It's like a best effort thing, I.

A

Think, ideally,.

B

A

I think uh maybe a real world use case is that I want to maybe like impact this rack as much as possible and then go to the next rack. Yeah.

B

That's the other alternative.

A

B

So uh suppose you just have like not a group of boats, but you have a single port and you want to do this pin packing. So why not? You know, do the scoring for most allocated and uh you know, run the same logic. It's just a number I mean scoring is just a number, so you know it would end up in the most allocated rack where it fits.

B

It could be. You know, done both ways and I was actually thinking.

C

About you know, but the problem is that we take the decisions for each bot. So if, let's say it's a big job with multiple thoughts, then the first pod might choose might choose a particular um a particular rack, then gets full and then the entire job doesn't fit I, don't think the the cost scaling plugin would be able to retry in a different drug.

C

If I'm correct.

A

I think that's quite depends on the on the on the use case. Just being pen is one kind of scenario that really is located. There is also another processing.

B

Area, yeah and I was kind of like thinking of building on top of the uh existing node resource fit plugin because it already has all these beautiful mats and these strategies to do like least, and you know most allocated and so forth and uh I would I wouldn't want to recalculate the resources I would want to reduce the calculations of the existing plugin.

B

So it's kind of like you know it's against. You know writing a new plugin unless I would create some sort of a hack that gets the resources from one plugin to another. I, don't know how to do that yet, but.

C

I, don't know another question, though, because okay, that is through this, this plugin already has all the information, but does is this feature useful if you're, just talking about single quotes, I feel like it's? Not it's only useful if we're talking about groups of pots.

B

I kind of disagree, in the sense that if you want, if you're, somehow having like multiple different kinds of let's say, Ai workloads and for some some reason you happen to have some of them such that there's just a single board required you could, you know, push those into the cluster into those topologies where the topology is not entirely. uh You know unused.

B

There are like lots of resources already used, but you want to do the bin packing, because it's better to you know, do it there, where you sort of like feel, feel that rack up and you keep your other racks free for bigger jobs. You get the area.

A

So yeah I agree yeah. It's to me. It doesn't seems like that. You related with Lisa. You said it's a job and now I you can't totally fit this English Singapore. It's just like we do what we do. Do the preference to know you just totally. We switch to another angle to do preference to a rack to arbitrary topology. You can be fine.

C

I think um yeah. If you still see value on on single pots, yeah I guess it makes sense to start with a plugin uh or yeah I would start with a different plugin just so that uh you can have some feedback or like just just iterate. First, before going, Upstream um uh I have one more thought.

C

Energy I'll leave it there for now.

B

Okay, well, um I shouldn't have problems. You know, writing a entire plugin out of it. Instead of just piggy packing on top of the uh existing node resources, fit should be doable. I should be able to show something um I'm not going to give you any date any or anything like that, but yeah.

A

B

C

To yes, yeah one, one more thing um so uh way was talking about how topologies are just labels. So the way you should probably start is um by having this plug this as a plugin configuration uh where you specifically say either just one topology key or a list of topology keys that that the plugin will be paying attention to during the setup during at a scheduler, startup type or like before. You start the scheduler.

C

um And of course there are a couple of defaults or, like well-known topology keys that you could use as defaults, because those are in the the standard labels and those are.

C

um I, don't remember the name, but they referred. They are referred to as zones and regions which, um of course, in a cloud environment they they refers to. They refer to zones and regions in cloud in the particular cloud provider, but um a lot of users uh in that run, on-prem on their own servers. They already choose to to map racks into zones like that's, basically what what some some people do. They just take Zone as a rock okay.

C

B

C

Where you could start.

B

Yeah I was initially thinking about just using the scheduler configs I see some of the plugins actually utilize. The uh information in the ports like this interport Affinity thing, but I wasn't initially planning on touching the port specs. That would be complicated.

C

Yes, you need API changes. That's that's a long process yeah, but also not every not. Every cluster needs um the support for topology scoring I guess.

B

C

B

And uh you know you can adjust things with using profiles for the scheduler configs. So you know a lot of things can be done quite dynamically. Even with those are sort of dynamically you don't have to have it like in the boards. Always.

C

A

That I think that's the only item we have for today meeting. If not just we can give you a few minutes back and see you in two weeks.

C

A

Thank you. Thank you. Bye.