Kubernetes SIG Scheduling, 15 Dec 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Scheduling Weekly Meeting for 20221215

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

This meeting is being recorded.

B

All right uh welcome everybody to this session of sea scheduling today is December 15th and, uh as you may know, this meeting is being recorded. So for this uh we actually have a kind of a packed agenda.

B

We wanted to start uh today mentioning a few of the things we we are really planning to do in the for the 127 release.

B

um Let me try to share my screen so.

C

So you get muted when you share.

B

That's still the case. Okay, can you share for me Abdullah.

C

um Yeah, can you give me uh back uh co-hosts I think I get yeah. Can you make me a co-host.

B

One second.

B

Okay, so uh meantime, as I was saying, I was trying to say that uh oh my God, we were, uh we were um with the help of six scalability. We were running some some performance tests and we discovered a few uh low hanging fruits that we can. We can do on on the scheduler to to elevate the throughput.

B

um Yes, you might know for several releases, we've been working on performance and we are really getting down to the to the very low level details, uh but there is still room for improvement. So um here we open a list of issues uh where, uh where these issues will actually improve the performance of of the scheduler, there are a few. This one is about uh some calculations. If I remember correctly,.

B

Second, one right: uh that's also yes, calculations. Some of them are closed even already. um Some of them require API changes. For example, the third one might require uh add in some configuration uh to the do the cube scalar config, which is a component coffee for the scheduler um and yeah more optimizations here and there um maybe more interesting, is the last topic um in uh in the scheduler.

B

uh We try to not retry pods um when we have determined that they are not scheduleable uh until there is an event in the cluster that might make them schedulable.

B

So, for example, if a bot, if a new node is created, we retry, um we, we put the pots back into the active Cube, but uh we have even extra checks, uh for example, if they, if the part, um if the Pod has a no Affinity, we immediately check just the affinity for that new new pod new note, sorry and then, if it doesn't satisfy we skip so we we in in general, we ended up doing a lot of uh quick calculations to reduce the the amount of retries for pots, uh but there is still an unconditional uh retry of bots, um which we call flashing.

B

So every after a certain period, the posts are put back into the active queue, regardless of why they were unscalable um and we've been uh experimenting with increasing this timeout or this flashing period, uh uh but giving the ability to revert uh just in case. We still have some bugs. So this this issue is mostly about increasing the the the the flash period even further. We are currently at five minutes by default.

B

We want to make it 15., uh but uh we still want to keep the option of reverting it back so, and we want to do that through the keepscaler config API. So this is what the issue is about. uh If you have any concerns, I think uh we we've been doing this uh um carefully through through other releases and uh there there hasn't been issues lately. So uh we that's where we're proposing to increase the the period even further, but please uh communicate any concerns.

B

If you, if you have any and another area where we want to make progress on, is on improving logs and metrics.

B

uh So um currently, when we have when a filter runs, we don't know if um we don't know if a filter actually made sense for a particular pod. For example, if we we have the node Affinity filter and the Pod didn't Define any node affinity, we want to be able to tell in metrics uh that that uh this filter didn't do anything any any meaningful decision on this spot uh that that way uh for any new feature that we had, we can more clearly tell whether there are parts that are using the feature.

B

um So this is what the the this issue is about. The first step is actually determining uh uh uh some form of status that clearly says that the the plugin didn't do anything meaningful. So that's that's the current War, the current work that uh uh cancer is doing, uh but then we have to migrate. We have to do a uh pass through all the all the plugins to make sure we are using this, this new uh distinction, um and with that once that is that is finished.

B

We would also like to include that into the logs. uh So it's it's more easy to debug, um and there is some uh we could generalize also for score plugins, but we haven't really uh thought of all the implications, but uh once once we finish this and there's, if there is still time in the release, we can definitely look into improving metrics and logs for for score for scoring.

B

So, yes, that's uh that's kind of on the small side of things that we want to achieve during this release, things that don't require caps uh Abdullah over to you to talk about a the bigger features. Clips can.

C

B

C

And you can see my screen.

B

C

Is it frozen or.

B

No, it's working correctly. It's moving.

C

Okay, so yeah we've got like one two, three, four, five um uh caps that we would like to graduate.

D

I, don't think we I think we.

C

Have only one that is proposed as new um so which is the first one. um So if you recall, we introduced the idea of match label keys to topology spread to solve the problem of uh you know um when you update a replica set, it goes through like a rolling update.

C

You wanted a way to basically uh apply the constraints on the new replica set, not on the old one right like because, while you're doing it, the old one, we know that at some point the pods will be downscaled and removed, and so those all dark parts from the older replicas. That should not be taken into account um when doing the calculations of skew and whatnot.

C

So uh there is a proposal to introduce the same idea to part Affinity uh and product Affinity I. Think that's it's reasonable! Making the API symmetric, um although I, expect that hot departure thread is the one that will be used more often for anti-affinity instead of product Affinity in general, I'm, not sure how much usage we have for part affinity, um but I mean it's nice to have I, wouldn't I wouldn't say like we should uh block it.

C

So that's the first one, the the others are like again. This is the bottle spread one it's the same as I discussed for portable spread. So we want to graduate this to um uh to GA to this table. It's 125 now, I think this is a really nice feature. It actually makes it easier in general to use spot topology spread um instead of specifying both label and label by key and value you can only. You only need to specify the label. Okay and the value will be.

D

C

Detected by the scheduler, um the third one is the mutable uh scheduling directives. So this is in beta as well, um and the I The Hope was to graduate the ga 725.

C

um But we're gonna do it in this release, so this is basically allowing to change the uh part Affinity on the job pod template.

C

um This is basically related to the whole strategy of allowing enable like adding new hooks into core kubernetes, to allow higher level job schedulers uh to do their work.

C

um I I, don't think any of those are controversial um and the last one is the pdb.

C

Enhancement related to the schedule having more guarantees um on like having having the schedule, uh respect for the power uh pod, surgeon budget, I, guess on on preemption I, don't recall the exact details of this one like what is it that we're trying to.

E

Say: okay I can I can mention a little bit so the background is that uh pdb was the sort of the first class citizen in terms of preemption since day one, but it's a best efforts manner. That means only like two parts: bills uh pdb protected and then on hnl. Then they are Thai. Then in this case we will preempted the PVP protected past anyways, but in other case like I, mean by best web efforts means okay.

E

If there is a part, not ability, protected and then another part protected by PB, then we will definitely choose the part without the pdb protected. So that is why I call it best efforts manner, but some users says: okay: uh pdb represent the disruption semantics in all Dimensions, not only preemption schedule, but also in like the eviction API Etc others. So it's reasonable for us to our list to give option to the user. Says: okay pdb is don't preempt the path protected by pdb.

E

So if we come up with the option to honor the semantics to honor the disruption, I think Abdullah, you gave a comment on the on some of the options. Yeah option. Four. Should be the most desirable to give uh yeah, you don't remember.

C

That sounds good. I thought this is a graduation. So it's a uh I thought. We made some progress on this one. So so this is basically a net new feature as well. Right like yeah, yeah, okay sounds good um all right. So that's the list.

B

C

B

Have a clarification.

C

B

um Yes, you you linked this one uh which needs to graduate to GA into development domains yeah right, uh but we also have another um cap for an arcade for for topology spreading, uh which is the one about match: label Keys uh right to the to the note that one um is eating Alpha. So we need to to make it beta in 127.

B

Then the next point in the in the dock, the next link that one.

C

So this one, the men.

B

uh No, the police, brain uh After, Rolling upgrades yeah.

C

C

Which is smart, okay right, so we're graduating from alpha to Beta. The other one is from beta 2G correct. Yes, sounds good, okay yeah, so this is an overview of like the features that we will probably focus on in 127 uh for the scheduler, please, if you have any um like other things, you have in mind.

C

um Please add them here, because we need to go through the opt-in process. I think again like last time, so you need like a lead to tag the issue, so that gets tracked um so yeah back to you.

B

E

In the past schedule Readiness, so we we're trying to get it debater, although it's pretty new in 126 Sr, but yeah.

B

Okay, uh well any questions or we can go to the next topic.

B

Okay, so uh Sergey. Thank you.

F

Hi, thank you for having me I came here from Sig note for the sidecar for early announcement of sidecar cap that we plan to run in 127. uh I. If you don't know about side, cars sidecars is a containers that run inside the port and doing some infrastructure ambient work. It may be logging containers a metrics container or it may be a shutter smash proxy that runs inside the port and the average of the networking. So all the networking will go through this proxy instead of regular Network.

F

So the sidecar's problem is that they will not run. uh They will affect both life cycle uh if we Implement sidecars as regular containers- and that is a big problem, so we were trying to address this problem for a long time to enable this scenario for our customers and in 127 I. Think we get to the point when we know what API will look like.

F

So, if you switch to this email yeah, there is a The Proposal that is currently on the table, and everybody seems to be agree that this is what the desired state will be. If we design from a scratch, we want to extend init containers with new type of containers with restart policy, always so those init containers uh will run in the same order as regular init containers would typically run.

F

The only difference is that we will wait for Readiness of these containers and then we'll move on to next one leaving this one alone, so it will leave through the installation stage through the regular containers life cycle and they will not block determination of a port. So if all other containers in a job for instance completed then this contains will be terminated by coblet.

F

So this is uh overview of a change that we propose, and now it has a lot of issues for customers and issues for implementation, because uh I mean one obvious. One is backward. Compatibility um now like before people didn't expect that you need containers who survive through the life cycle. Now uh some of them will be and from scheduling perspective, this uh change will mean that the resource calculation needs to change.

F

So if you want to understand where the report will fit into specific node, the formula before was maximum for any containers and uh some of all other containers, and you take maximum of all of that. uh Now. You need to be more elaborate because this sidecar containers they will run during the installation stage and will survive through container stage.

F

So as a minimum, you need to sum them up to all the regular containers, but then, in the sensation stage we can either go with simple approach to just take a maximum of all of them or more complicated approach.

F

When we understand the order of uh startup and take maximum of like specific chunks of uh sub lists, I can explain more, but the key Point here is that calculation resources will change um without major change of uh post structure so like, ideally when you want to extend init containers, uh um at least with this uh extra new containers.

F

So now, I I want to open for major questions um and uh maybe some recommendations. Some like suggestions, um I, can talk about some ideas. We've been running around backward compatibility. How to uh work around this uh problem, uh but yeah I want to hear questions first,.

C

Yeah, the semantics, like you, have the example here: the St proxy um I'm.

A

C

Understood what the semantics for the.

D

C

Restart policy here, um so you go through the containers and it contains one by one. The the current schematic is that if one fails, uh you continue to restart, but you don't go down the list right until that succeeds correct, and here, when you get to this one, the one with restart policy, always what happens.

F

For this one, we will wait for a Readiness of this container, so if uh Canada needs to go through the startup stage and get into register state, if they Define the Readiness problem, we will wait for prop, succeeded um and then we'll move on to the next one and restart policy always indicates that whenever it will crash, even if it will crash during the um like after the installation stage, we will keep restarting this container.

F

C

So, in this case, basically, the difference is that this container will be allowed to continue to run after you finish, this is the new semantic as well right, which means you need to change the calculations. Take into account that it's not just the maximum. You need to add it up to whatever, like the the Pod spec, the the the the normal containers have exactly okay, so.

E

I saw you that looks like it's like a regular container that is always running out there with other containers specifying the spectral containers spec right.

F

Yeah, it looks like regular container uh we needed during crystallization stage, because some of the service meshes provide TLS Network for everything and inclusion init containers themselves, um but then there are some Elite containers that initialize sidecars. So in this example, we download a certificate from some Vault and then once you have a certificate, we can enable TLS networking for entire report and all containers after this istio proxy will use istio proxy for all the network communications and nobody allowed to use regular Network after that, okay.

E

Yeah, it's just in the sequence. It should be initialized before the regular containers other than that resource consumption and the life cycle is basically the same as other yeah.

C

Well, there's a difference right because those containers are not expected to run to completion before the main ones, and so there is a significant difference and how we calculate resource requirements for the Pod yeah.

E

That's an implications yeah yeah, because yeah, because for other regular I mean all the internet. Containers is waste for its competition to go to the the next. But for this kind of one it's ways for the I mean the radiant Escape for it to claim it's ready then go to the next.

B

A

B

Any questions from the public.

B

um I have a a request that when you go through these changes, um we should probably have a single Library, probably in component helpers or uh in the component helper staging repo, because we have had issues before where there was a slight difference in implementation between kubernet and scheduler or cluster of the scaler um and then or or even uh I. Guess the quota, the resource quotas so yeah. We really should have a single implementation that is shared among all of these uh systems. Yeah cluster.

C

Auto scalar too right.

B

Yes, but but I guess, cholesterol Discovery just uses the the scheduled code. So maybe it's not it's.

D

Not a problem, but so, if you have it in uh sorted out in scheduler request photos clear will be fine.

B

Right, so it's more about cubelet, uh API, server and scheduler to have a share have a shared implementation. So we don't have problems. So if there is some intermediate work needed to first, you know do the the cleanup or the refactoring. Let's do that first and then change the implementation.

B

F

B

There will be some backwards compatible like we support two cubelets behind right, so does that mean that we only would be able to enable the feature within two releases.

F

B

It probably will.

F

Have as a weird life cycle like any port changes, uh pot spec changes uh they they are really hard to implement and the life cycle of them is way long because of skew polishing right. Okay, so one question I had this: uh once you have this library to share a resource calculation. Is there a I know? There are custom, schedulers I know there are like some plugins how much this backward compatibility will hit us from perspective of third-party tooling? uh Is it something we need to I mean? Is it something big enough?

F

So we there is no way we implement it and we need to go different route and rename something significantly so it will be like breaking change ish like so people wouldn't just list any containers any longer or we it's manageable and uh with enough communication we probably can get away with uh just adding things to this collection.

B

So when, when people Implement um custom scalers, they are encouraged to use uh the existing plugins um and that yeah, so once we fix the scheduler, then the cubescaler they should be able to benefit from from the existing plugging the the no resources fit plugging, but they might have extra plugins I, don't know for quota calculations or things like that which might break but I, don't think as kubernetes kubernetes project. We can offer guarantees in that regard.

B

um Wait do you have any thoughts any.

E

Extensions, yes yeah, but how does the same opinion that the customer staff, so basically we Define API? The API is some sort of contract we all agree on. But if your petition is lagging on that, uh you have no ways to to prevent that right.

E

So we have no way to enforce him a customer's schedule to always be sync with the API that that is also just the information side. So to always, but here to the API.

B

And, given that uh we would only be able to graduate let's say to Beta in two releases at least that gives some some period where they can um get up to date with implementation.

E

B

E

I want to give a uh ever correct some noise, maybe mindset um thinking about customer schedule. So basically the reason why I look at some code of the like the third party schedule, like unicorn, so like in their document official document, their claimed. They supported the latest, the 1.5 or 100 1.4.

E

But that's not really the case because I look at the, for example, unicorn. Their dependency is only upgrade to dependency the kubernetes 122.. That means, if we introduce a new field in the 122 or 133, then even though I'm not aware of- and you feel so how come they are compatible to to to honor the new apis.

E

So well just get you some idea that when some third-party schedule claims is to pause, 1.5, 1.25 or something you had to look into the real stuff dependency, Etc, also the logic out there so yeah, not every customer scheduler is following the real API to if they are around their own they're, not equipped in the API.

E

B

So I guess in in short, we we cannot provide guarantees right. So.

F

Yeah I mean ideas for backward compatibility.

F

We had is uh completely duplicate, init section and have a new section for with the same properties as in each sections today with new containers, but it doesn't make things much better, because people wouldn't know about this new section and they will just not I mean they would probably have the same problem with this new section as they have a problem with new flag on existing section, and another proposal was to have a fake containers in containers, sections that will duplicate uh the sidecar containers and they will be ignored by kublet when they actually scheduled I.

F

Don't know how much you want to invest into that work around, but it sounds a little bit too much to worry about. But uh if you think that it's something we need to think about, I mean I would really appreciate the feedback.

B

What's the best way to uh get involved here? Is it? Is there a cap already so.

F

We have early Junior, we will have a cap out, uh it's kind of a big change and we already have like three or four attempts to write this cap. So this time, there's a big working group so hopefully- and we already get uh API approval- I mean API like head notes on apis, so um we it will be uh I think we are in much better shape in this this time around so yeah I will send a cap link once we have it uh to this group as well.

B

Thank you um Michelle. Do you need um how long do you need 10 minutes? I have one more question, but I.

A

Don't know that was very quick.

B

Okay, so one more question, um so this is about startup.

B

uh Are there any uh thoughts about uh tear down, for example, when you have a job right, um the main container finishes, but then the sidecars continue running and that.

A

F

Be terminated once job completed, so this will be implemented. uh One big problem with trying to uh wrap our head around is: uh what do we do during uh graceful termination of regular, pods or regular containers, because um if it's East you're providing uh networking the network, you can be needed during termination. So you need to be really careful like ordering and uh restarting if uh yeah, if this proxy crush it, we need to restart it, even though we are in termination stage.

F

uh So it's a big change for us, but yeah jobs probably will be addressed for sure it.

C

Seems reasonable to basically think about it as like the first to start in the last two terminate right.

F

Yeah and also yeah, and also they will be um so um also, they will not block the termination of what.

B

Okay, so once the main container or all the main containers finish, you start terminating sending sick terms to each of the unique containers in the reverse order. That's roughly the idea.

F

C

Okay on like making, if there's some like last minute, logging that you need to flush out, are there any thoughts like how you.

F

Could uh sidecars today already have this mechanism? They just ignore sick term for some extra duration of time and try to erupt like clean up all the buffers. So this will apply still.

D

B

Excellent, this is this is very exciting, um so, with that Michelle.

A

Hello um for folks that don't know me, um I am one of the sick leads for um sex storage. um I came in here because I um wanted to. Let folks know about this user end table Community called um data on kubernetes they're, a user. It's a community full of people who are trying to run stateful workloads in kubernetes.

A

So there's a lot of um um there's a lot of database vendors here that are writing operators and there's also just users of those operators as well and um I'm, trying to organize um sort of a regular Round Table session.

A

With this group between kubernetes maintainers and people from the the dok community and hopefully I think we can maybe my goal is we can um sort of get direct feedback from users about um you know any sorts of friction or problems or things they like to see us enhance in kubernetes to make their lives easier and.

A

um But yeah, so we are going to schedule a first Roundtable session in January and so I'm just kind of going around to all the things to gather interest, and so when that session is actually scheduled, I can reach out to everyone here with that and I think eventually, eventually, if from that first session, if you know there's enough topics to talk about to be worth creating a working group or something like that, like a stateful working group, I think that is also a potential option on the table and see.

A

If that's all, I wanted to talk about here.

C

How would that be different from six storage, the working group? Why.

A

I think the the idea is that stateful, like State running a stateful workload, is a lot more than just storage. It's storage is just a small piece of it, um but you know they might have specific scheduling, problems or maybe specific um node problems that or networking problems that they need to deal with. So that would be the purpose of the working group to kind of um put together this like cross Sig um sort of cross, because the problems might be cross-sick.

C

I'd be interested to see how you're going to formulate the exit strategy we have faced. Some, like you know, back and forth, on working group batch as well, which I feel it's like similar to what you're trying to do right, but for batch now. This is for stateful, so trying to have a forum for yeah.

A

That is a good question. The like I, don't know, maybe I guess a working group is technically supposed to be temporary.

A

um I, don't know. Maybe this could end up turning into a long, more long-term thing, I'm, not it's a little hard to say right now,.

B

Sorry um is this: does it have any relationship to a user group from cncf, or is this um separate effort or ad hoc group.

A

I think this is not related. There's like a Sig users or something in the cncf I think is that right, um I, don't think it's related to that. This is a whole completely different group and I. Don't think I, don't think it's sponsored by the cncf.

B

Okay, yeah, the the figure in cncf is called user groups, and one of the example is the research User Group.

A

Yeah I don't think this is related.

B

So, what's the the best way to participate here.

A

um I think just add your name here and then, when the first meeting is scheduled, then I will be reaching out to everybody. I guess I should probably find a better way, maybe like a mailing list or something.

D

A

I'll think about maybe maybe we'll create a mailing list that might be better.

B

I guess um you could still consider reaching back to cncf and and figure if starting a user group is a good idea as well.

D

A

B

Cnc website.

A

Yep I can definitely look into that.

B

B

F

B

I think we have five minutes left three minutes left uh any last minute. Last minute, questions.

B

B

Well, okay, uh we'll see you uh in the next meeting uh uh you see in two weeks. We don't have a meeting when it's two weeks. Yes, in two weeks, we won't have a meeting. Let's, uh let's cancel that one. It's uh December 29th so um but we'll see you after New Year's, uh so happy Holidays, happy New, Year,.

E

E