Kubernetes SIG Scheduling, 11 Aug 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Scheduling Weekly Meeting for 20220811

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

um All right, hi, everyone um today is the august 11th. uh This is six scheduling meeting. um As you know, this meeting is recorded and will be uploaded um to youtube. So please adhere to kubernetes called conduct.

A

Let me share the.

A

Oh, I don't know what like can you see my screen.

B

uh Yeah, you see the meeting notes.

C

C

We we hear you.

A

See that okay, uh the problem now, is that every time I scroll I have to go back.

A

C

A

Again, so that I I get to pause my mind, which is strange, update to zoom, but like I hope that you see the agenda for today now um we guess we can start with the memory leak issue. um So this is an fyi.

A

Now do the same thing.

A

Do you see now the issue.

C

No, we just see the the header of the dock, the denotes dock.

A

Yeah, it's aesthetic.

C

Maybe you're just sharing the tab instead of a window.

A

A

Why we're gonna tap work? It should work right like share this. Why do we have that option if it's so complicated like that now? Do you see it.

A

Screen recording, which I already gave.

A

Okay, um in any case, let's uh I'll just pause the paste it here in the um chat. Yes,.

A

So this is the issue that uh we have at hand uh with the from the memory league.

A

Try to share the tab again.

A

um You see another issue right.

A

Yeah, so um thanks to um uh amy wayne, I hope I'm pronouncing the name correctly. They discovered the memory leak as related to basic leaking go routines. um You were creating a basically a sub-context from the main context of the scheduler and we were not canceling it. We were creating threads or go routines to process the nodes in parallel in the preemption phase during the post filter.

A

So the fix is basically to explicitly cancel this, that sub context uh and the function um and the bigger issue is that, like that, like general context that the scheduler creates at the very beginning shouldn't be really used directly should be, we should have a sub context created for scheduling cycle and that one um like gets basically garbage, collected or created like cancelled.

A

I mean uh on every scheduling cycle, I believe, um and so even if we had those issues downstream, they would be caught uh by the you know: the scheduling cycle level context getting cancelled, um but like for the hotfix, we we just fixed that specific thing, and then we have a follow-up pr that fixes uh things upstream in the code.

A

We are discussing also ways of how can we prevent these mistakes? In the in the future see, maybe we can figure out a way we can just basically future proof it. um The problem I see here is that I think this has been going on for many releases, not just the past two or three releases.

A

If I read the code correctly for previous uh the the code history and I'm surprised that it got flagged only today, uh aldo.

C

Yeah, I think the the thing is that this is not leaking the routine itself, because the routine finishes. What is leaking is the context and then uh the context. uh It's it's a very small amount of memory. That's why it takes like uh a few days for the for the scheduler to for the memory to grow.

C

So it's it's kind of hard to spot if you're not paying full attention.

A

That's a good point right and that's why these scalability tests that we have, which basically run, um I don't think they take like they take a few hours only even though they create so many unscheduled cause at some point right and they get retried. uh While this, why the cluster is scaling up, um those kinds of memories didn't show up and and so uh the on those cables like this. So so, as you mentioned, it needs to run for a much longer time for it to manifest itself.

D

Yeah, so, according to the analysis, so at the beginning, I I was in in the impression that if, if that say, all the potential nose has been running, that means the cancer there inside of the check now function will not be caught. That is the current status, so cancer function will now be cut. So my impression was that the child contacts will be garbage collected, but it doesn't seem the case, so that's also implies for the cases that preemption doesn't help.

D

All the qualm has been reached. That means the internal cancer hasn't been caught. Then we will have those issues. The child contact seems to leave some memories out there. So I think that's the the problem, and another minor issue is that we didn't pass in the child contacts into the check note which is incorrect.

D

So that means the side effects for this little buggies. Okay,.

D

We run the check note in batch like 16. I think that's a default number, so that means we still have some flying routines running. Even the cancel is being caught because we're passing the parent context instead of the child context, but that is fine, because the global things were finished eventually, but fixing that minor bug will also you know, gracefully terminate the unnecessary goal. Teams on the fly.

E

Also, this fix is quite one-liner and it should be easy to spot it in the code. Has anyone tried to just uh through grab just to see all the places where this with cancel context is created and if there's a corresponding differ like call of the cancer.

A

Yeah I mean, although I did a quick check, but I don't think it's always this the case that you need to call differ right, like we discussed one case, for example, when you create the scheduling cycle context, you can't just call deferred cancer right away because there is a binding goroutine that gets created.

A

And that guardian might not finish by the time we call cancer, and so it's there's still a lot of nouns around it um and that that's the to me is the uh scary part, so it doesn't seem like there is like a general pattern that we can follow. It's sometimes it's most of the time. We think it's it's case by case, but you need to look at it and see. Okay, does it make sense to call when to call cancer? Basically.

C

In general, there should only be two contexts.

C

Per par, I guess per pot, their scheduling attempt one is the main scheduling cycle and the other one is the binary, the binding routing. So as long as everything is inherited from those two, even if we get we forget to cancel anywhere else, the parent contest would get cancelled.

E

Yeah, that makes sense, I just uh wonder like if there is any easy condition to spot like uh it's called. For example, the cancer is called inside a function which is then like processed in in like in a in a parallel.

E

If, like this could be some genetic rule or not just like. Maybe I don't know.

A

Yeah, the challenge is that like, if, if that function, creates a go routine and goes on by itself, then.

A

Then, like you might be calling cancer before that go routine finishes right, so that, like that, that pattern uh basically invalidates the idea of um proposing a pattern where, for every condition, right away called deferred cancer. So it's not doesn't apply in in general because because of the that case, right of like basically uh in the in the middle agora king, gets created and doesn't finish in time by by the differ.

D

So we can say we should always follow these patterns. So basically, in some testing libraries, we wrote we use the cancel and use the with context and we initialize some informal factory.

D

So in that case, if we pass in the the say we have a functionality, is this create a schedule with startup business and then inside that we created information and started in from factory, because inside the info from factory it actually spin up the long running go routines. So if, in that function we defer the cancer. That means we will sort of close the child parent. Sorry, the child contacts passing the informal factors builder functions, so that doesn't work so just yeah reminder.

D

We cannot use this pattern for all the places, but we need to be careful when, when or not.

A

I think in in general, like as I'll discuss um for your scheduling.

D

A

You have two main proteins, the main one right, which is the one executing school socket and the binding one. So if those two have their own, like you know uh context that gets deferred cancelled, then that should be fine right.

D

Yeah we we should do a thorough check if the uh I think you mentioned, that goldberg can be helpful.

A

D

A

But then, like what we just discussed, this doesn't seem that this is obvious. Maybe like our k for the wk we discussed before it's like the goro king, getting created in the middle. I don't know how good it would catch.

A

Maybe that's why they um they're not using it or they're, not using that static analysis and in kids.

E

You're, like I just one last note to this, like uh you might check just to check the other place like to just uh locally, for example, to replace all the context with cancel with some wrapper very uh like a track like uh if each call to with cancel has a corresponding uh cancel location.

E

Just like uh every time the with cancel is invoked. We like log that this specific new context was created, and then we just every time the cancer is invoked. We locked another like a message saying that this specific context was closed and, like um maybe try to like, do one to one uh like a mapping to see like if at least help that helps to discover like a missing implications of the cancer function, like just a suggestion. Maybe.

A

In terms of metrics, I guess it's um about like monitoring um like the memory leak takes a long time, probably to show up or maybe a metric. I don't know if there's a metric that we can expose related to number of uh contexts that still exists. Maybe not. We have one for voter for go routines, but I don't know if you have, we can have.

B

One for context.

A

C

Okay, so let's.

A

B

I think you can see.

A

The agenda right yeah, I.

E

Can see it clearly all.

A

A

So for the discussions uh we have one related to um record plug-in metrics, I think, is that you, john.

D

No, it's, I think it's somebody who's the issues, and this cemetery also create an issue on behalf of him and brings up the discussion so.

A

If you click the.

D

Link yeah, if I can click the link, the background is that we use a 10 sampling rate.

D

So, basically, on the recording, the metrics for each plugins execution. So that is good. But when you see some like scheduling spike- and it happens, if it happens that on that cycle, it doesn't fall into the temperature and suffering chance so that you don't have enough information to correlate the scheduling spike with particular scheduling, plugins performance.

D

So that is the background. So the proposal is to make it configurable.

D

So that basically sounds good to me if we don't have other promising solutions to correlate the business to like a particular plugins performance issue with the spike scheduling spike.

D

But the tricky thing is that we grade we've decided we'll graduate the schedule complete component config to ga, so it's sort of not doable to add it to v1, so yeah just want to bring it up to for the discussion like how that is, we should support this requirement and if yes, what's the potential solution, we can provide.

C

When the uh scheduler config goes to ga, it doesn't mean that it cannot, you cannot add new fields. It just means that uh those fields will exist until there is a v2.

C

C

This configuration doesn't seem particularly worrisome to me if it's needed. um I think it's completely um legal to have this this flag. This configuration in the in the scheduler configuration.

D

And uh so in our component config, we actually embed some spec from the component base repository. One particular field is called debugging, but that debugging spec is so associated with some profiling conflict. So I'm wondering if we want to support some debugging related fields. We can make it a dedicated struct so that it doesn't confusing with other spec like regular ones like.

D

Which impacts the functionality, so we can have a separate field for this. So could you repeat that at all so when we are at so what's the possible way to like introduce the new view.

C

So it's it, it would be the same as what we do for let's say the pod api: okay, dps v1, but you can see that fields, but you just have to be very mindful that once you have the fields, even if they are alpha, uh they cannot be removed.

D

C

So it's kind of like you only get one chance to get it right. Okay, I think yeah. I think we can discuss this in the code review where whether it makes sense to to make it a a struct or just a single flag.

D

Yeah so yeah. That is one one option I want to raise like I'm not sure how the status of using those individual command arguments like in some other six they're still using the traditional command line. Arguments right, so maybe we can introduce a picture and come up with the like uh experiment.

D

Xyz magic should specifically support this. Instead of adding it to the component.

A

A

Yeah flags themselves are like they are the rga right away, right like they are v1 by default, and then you need to have some sort of.

D

A lot of components that are still using the command line, yeah. What.

A

Are comments yeah? What I'm saying is um you need to duplicate like it's true, it's experimented, but when you duplicate, then you duplicate them as v1 right.

D

A

Duplication policy so.

A

I mean like, I think we just need to make a reasonable.

A

Judgment at the beginning, I guess the problem here is just about how we, whether or not we encapsulate them in a structure. Is that the issue I mean, I don't see a problem, exposing that flag component conflict, okay,.

C

Yeah, it's just if.

A

You, if we use like you know we just create a struct and with a reasonable name, uh maybe something related to metrics and all metric configurations. Slowly start to go there.

C

I say we just proceeded with a call review for this. I don't see any major problems.

D

Yeah, I think somebody also has a requirement, is simulated project so.

D

I think it'll be the jing jing from.

D

Who raised the issue so yeah? Because we are.

C

Yeah hi um sorry.

C

So sorry, I um I was listening and um so what's the conclusion to go ahead and write like a pr for this.

C

Yes, uh I would say, go ahead and uh we just we'll figure it out during the review whether the name makes sense or whether we want to struct. I would I think I would just start with them.

C

um I I will have to go. It's really good. You can go through the code, see if it makes sense to to add the struct or not, but this is minor, I think so. We can just leave it for full review.

C

Okay, cool thanks or you can before doing your your pr. You can always write a proposal in the in the issue itself and we can kind of get to some agreement in there before you write the pr up to you. Okay, cool thanks.

E

uh Is anyone aware of other uh configuration options which might be like made uh configurable like from the outside of the scheduler.

E

Like a a generic question, just.

C

There are always requests, but uh yeah.

D

There are also some requirements. I recall correctly, so someone raised the requirement to like make the percentage of nails to score profile specific in standard global. This one I can recall yeah and also some some requirement about.

D

I think so another requirement says: okay, I don't want to shuffle the the percentage of multiple score. You just start with syntax zero every time instead of shuffling the index and that is sort of another one, but basically no follow-up on those requirements. So we don't know how how much, how much, how many users are demanding that.

E

Okay, so the decision is basically made per case and then based on the justification and the use cases, you make a decision if that's the way to go or not, basically,.

A

A

John, do you want to talk about the next item in the discussion.

E

Sure uh so, a few weeks back lucas started uh implementing the context. Contextualized logging for the cube scheduler is the actually cube.

E

Scheduler is the first component that is getting this nice improvement feature so and the reason to to mention this is to get awareness of the seekers scheduling group and to also discuss if there is, uh if you agree on the approach right now, the changes that public has made are more or less just propagating the vlogger, the new logger and replacing the k log with the new logger, more or less keeping the same the same key value, pairs and uh he's he is avoiding using the with name and with keys, sorry with name and with values, methods because there's been some performance degradation.

E

So the uh if you have like some time to take a look at the pr that would be great lucas, provided the performance measurements for.

E

Changes but still those were performed most likely locally, but I think that alda already checked those performance measurements and he didn't have like any any like uh like this agreements with that. So the idea was or is to to go ahead and see if there is like any performance degradation once the vr matches, and if there are, we can always uh like undo the changes so like.

E

uh If you have time- or it would be great- to take a look and and see and decide if you are okay with the direction and if not then be great to discuss or to suggest it needs to be done to to to get to a point where we can match the pr.

A

So that issue is quite long. um It seems there's like so many uh different discussions that that happened. um I was just going back related to introducing apis on the pi itself, to control debugging and how things and um I'm guessing. We are not pursuing those ideas, and at this point this is only about.

A

Introducing contextualized logging and in specific places and measuring the performance impact and the reason we could have a performance impact is because.

A

I'm guessing there are a lot of like string operations happening because no.

C

No, the the overhead comes from getting a log from the context because if, if you have uh a chain of context, uh then yeah each, if you you say like get key from context, which is what you would use for getting the log if it's not in the first context, it goes to the parent and so on and so forth. So if the your context is very deep, the context hierarchy is very deep. Then you you're always going all the way all the way up.

A

That's and there isn't like any safeguards against this like coke. The deepest should be this much and then somehow prevent that deep, stack or, like I don't know, path.

C

I think uh there is no general guidance.

C

And we just have to be super careful.

C

John, do you have any other insight.

E

No, unfortunately, not because lucas was mainly involved in this in this effort, so I'm just like talking uh on his behalf more or less but like what you just mentioned about like going deeper into context. That is like a perfect concern for waiting until we resolve this and maybe to redesign the if there's way to redesign the the searching in the context, I know to maybe to introduce some sort of a cache.

E

Instead of going up and up and up and up but uh further like we need some performance measurements first or at least find a a differential machine where we can measure the differences like I'm, not sure how far the the the context is actually going. But I think, but you know right. Yeah.

A

uh And uh this is this like a kids library or like something we are importing from the outside. Still on the case project or completely it.

C

Is k log, okay,.

C

B

I guess you might my stance.

C

Sorry, I think my stance is: let's proceed with caution, because it's gonna give us contextual.

A

Logging is useful.

C

Right, um we just have to be careful and we we have to start small.

D

Yeah I have one comment on this: is that so, if we think about logging, we usually the most useful area is that we have some issues or we have some unscheduled parts. We check the logs and we want to get more information. So, in terms of this practical usage, I would see like in the benchmark test the preemption test, which will, although it doesn't prompt a lot of unscheduled parts, but it will like the first attempt of the scheduling.

D

The high priority part will go through all the nails, so maybe you can generate more locks, so it may have a larger factor to show how the performance differs, but to go a step further.

D

I would say to craft some tests that to prompt into some unscheduled parts so that those unscheduled parts can go through the ordinals like 3000 or 2000. You know to generate more logs, so that maybe can be more helpful because, if you're running the happy path of like scheduling pass in the 3000, now you actually just hit by default. I think the 10 is 500 nails, so that is doesn't generally large volume data to look into potential performance issues of the contextual logging.

D

So yeah, so I'm looking at the good job of the preemption-related natural comparison.

A

But like you're concerned way is more related to how many logs we're producing versus.

A

um And this this is not like directly impacting whether or not contextual.

A

Logging is is going to be like whether we want to proceed with the with the current pr or not right, because.

B

Irrespective of that, like you.

A

We need to monitor how much logging we're producing no.

D

Yeah, so we don't have a metro section there right.

A

Your file size.

A

Right, like the file, the login file, I don't know how you process it and, uh like I guess, different car providers do different things right like they tail it and send it somewhere else.

A

um But like how many logs you're printing, if you're like is one issue and using this new approach for logging is probably a different one. Am I understanding correctly what uh this distinction right.

D

Yeah, I just want to correlate the the volume of the logging generated versus the final result so because, if the the logs with them was not hit too much and then by looking at the eventual benchmark results, it doesn't make a lot of sense. But.

A

Yeah I get it right, um so there could be paths where we have deep context, but those paths are not regularly hit, for example, at least in our benchmarks, right.

A

Yeah, that's a good point.

A

Do you have any.

C

Action items in.

A

This one, you still want to move forward, or are we asking for more uh benchmarking? Wait? Are you asking for more benchmarking for specific um scenarios.

D

I'm just wondering we can.

D

Like we should correlate the result with the core number of the contextual logging, but I'm thinking about easy way to achieve that.

D

Because otherwise, you have to do it. The pretty ad hoc, which is pretty troublesome.

E

I don't think that the amount of locks will increase because or like I haven't seen the log lines produced by the context device logging, but uh I think it will be the same amount.

E

The only difference will be that those log lines will be just longer because you will see more context.

A

Suggestions- maybe we can discuss discussing on the issue.

D

A

Or confirm that um there are known recommendations for additional testing and just basically back that we're moving forward.

A

I guess that will be it for today. Any last comments.

E

I would just quickly like to just mention the current state of the d scheduler project.

B

Very good idea.

E

B

About to come to and ask about it.

E

Great, so we have mike on the go as well, so might like still feel afraid to interrupt me at any point. So uh right now we are in the process of migrating the strategies into plugins.

E

We are almost done with like two or three strategies left, maybe four, and once this is done, then we can start introducing more of the framework types and helpers and running the plugins uh and the individual extension points.

E

uh We don't have like any specific deadlines, but the idea is to at least before we release a new d scheduler release. We want all the strategies to be migrated to plugins and once the framework or its first limitation is, is in place.

E

We are happy to to accept new suggestions, how to improve the the code base so like uh mike like do you have anything else to share.

B

uh Yeah, I think that that was a pretty good summary thanks. Yan has been uh taking a lot of the lead on you know, getting the reviews done and making sure that these uh strategy migrations are going well.

B

um This is really just like a big refactor of the d scheduler code base, and you know, although the framework approach and the plug-in approach is going to be great, for people who want to you know, extend it or customize it. Maybe there's use cases out there like that. It's also, you know.

B

Our goal is to create like a a much more maintainable code base, so that we can, you know, build this more towards a stable tool that people can use and with that, a bit of a framework that they can develop their own eviction logic with so yeah, um like young, said uh this shouldn't impact any of our uh upcoming descheduler releases.

B

um The way that we're migrating them is like really similar to how the scheduler migration went and that we're kind of wrapping the migrated plug-ins in functions that treat them as the new plug-in type. While we are working on the other strategies, so this should be really seamless to anyone. That's using the d scheduler currently or like just doesn't care about this at all.

B

So there shouldn't be any effects for that and if, if there's anything that seems like it would delay our descheduler release significantly, we should be able to push through and create a release without having to hold anything up on on any migrations. But they've all been going really smoothly. um Thanks to everyone that has been contributing and uh yeah.

B

If there's any feedback feedback feel free to add it on the uh slack or bring it up at a sig meeting or in github, um but we're like, like young, said trying to sort of focus on this refactoring effort. Right now before we take on big new improvements, so um yeah thanks for bringing it up yeah um and thanks for all of your work and for everyone else's contributions.

E

Yep thanks a lot mike for the additional heads up and the information thanks.

A

All right, thanks and and uh mike are there any questions, comments.

A

All right, thank you. So much I'll see you in two weeks and then one week I guess, for the apac time zone, meeting right.