Kubernetes SIG Scheduling, 10 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Scheduling Weekly Meeting for 20220310

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, hi everyone um thanks for joining in. As you may all know, this meeting is recorded and will be uploaded to youtube. um Please adhere to kubernetes um guidelines, so we've got three issues. I just added the first one just because it's a relatively urgent.

A

Scalability uh quotient uh bug that I think worth um quickly discussing because of the um code freeze that we will have in the next couple weeks, so we might want to fix it before that. um So the issue is that.

A

uh You can see my the other tab right.

B

Yes, it's a little small, but uh maybe you can reduce the size of your window or that yeah.

A

um So in the scalability tests, um they're testing, I think, on some 5000 nodes, but there are a lot of demon sets that are being deployed in that test um and the thing that they noticed is that, like for regular pods, they get scheduled quite fast. They can get like uh upwards of 300 pounds per second, which is great, um but for dinosaur pods. uh They are getting like below 100 pods per second scheduling throughput, um and the issue here is that um it's most likely because uh if you can see here.

A

So, for a normal pod, we typically value. We don't really need to evaluate all 5000, pods and nodes. We just try to find 500 eligible ones and try to score them, um but for demonstrate pods, because there's only one feasible node in the cluster.

A

uh The journey of trying to find 500 is basically going to fail, and so we have to go through all these 5 000 nodes and then eventually we will end up with one anyways, um and so that is like compared to just like a normal pod. That would be a lot more expensive. It's not the scoring phase itself. That is more expensive.

A

It's the fact that we have to go through all these 5 000 nodes just so that we pick the one node that is eligible, even though that, um if the assuming like a in a um um ideal situation, the scheduler is aware that this is demon set board and then it will just pick the node and move on, rather than um actually trying to find it using part of node affinity.

A

So, to give you some background, the way that demon sets work is that the dms controller creates the part and injects an affinity to the specific node, where the part should be scheduled, uh like on on node, x or node y, like that exact uh node, where the demon support should go um so the scheduler is not aware of the part that the part is a demon support. It's just basically applying not affinity, rules on it um and and that's the issue basically um do you have any questions.

C

You think, thanks for raising this issue, it's a very interesting symptom, so yeah. Definitely I think we should do some specific logic in the affinity filter, because if I remember correctly, the demon cell controller injects a very special field, called dot metadata done, name or sort of like that, and that is right now is exclusively used by the demon set part.

C

So I think one logic we can do is to handle this kind of affinity specs, especially and if possible, we can just use a triangle, node selected or metadata selector- to locate that node specifically, so that can reduce the time to one.

A

Yeah, so this is exactly.

C

A

um Yeah, what we have proposed here- um the it's not special to demonstrate, was basically like in node affinity. You can do selection based on labels or on fields right. So you have two types of node node selection.

A

If you look at the node selector spec, so it's not really anything special to demonstrate, um but it is um really um it should be reliable to basically say if we find this affinity to the metadata name.

A

um We can assume that there is only one node with a metadata that name matching right like that yeah, but we can't, but we can't do that for labels right, like labels, are more exactly um like free text, but for metadata name. This is already enforced by the api. Yes,.

C

A

Mapping to the node yes yeah, so this should be reliable and we can apply that as a pre-filter in the node affinity plug-in um the other one that we could potentially explore is adding the demonstrative controller, adding setting the node nominated node name since we already optimized. For that. I think we've done that past couple of right. I don't think that would work, because.

B

That's uh that's in the status. Oh! Is it yeah? Yes status? Yes, okay,.

C

A

Not only this is also because um the scheduler will actually might remove it. If it fails, I think to schedule the pot for some reason. Maybe if it maybe it could schedule the pot because of not enough resources, and then it will remove it. That's not going to impact the correctness.

B

It's just also, I think it might not do preemption or if it, if, if there is already a pod that is being prepped, then it would it wouldn't uh try to print more pots.

A

I mean when you said the nominated node name. It's basically it's taking into account when you schedule the next part right like if it's higher priority.

B

Yeah, but if if it said, we assume that preemption happened, so we don't try to pray more.

B

In the next cycle you mean in yeah in the next cycle. Yes, we try not to delete more parts until all the ones that are marked for deletion are uh deleted. Sure. But then there is nothing so right, but there might be another pod that put um preempted a pod and then you're only preempting that one you're gonna wait for that. To finish before you start printing, more right.

A

Like I guess in general, this is a like a hacky situation that, yes, uh it's it's not really um the cleanest.

A

um The other issue here is that uh in general, this is like one specific case for demon setback, but in general, when there is a lot of pause and getting scheduled um like we may not really need to always try to find 500 nodes or whatever like a bad scale and so um and so to make the scalar more efficient. um I was suggesting here as well that we could adapt the number of nodes that we need to find for scoring based on the length of the queue the q and the pod queue length.

A

um So if we have a large q, then oh the scheduler figures out. I have a lot of work to do. Let me not try to be extremely, like you know, optimal, try to find the most the best note possible. I will just try uh less basically um so that I make progress and make sure um you know I I it goes through the queue faster.

A

So I think this is an interesting optimization. We should look at uh as well.

B

I guess configurable through it's really component config right how.

C

Aggressive yeah, so, okay, here's another concern, I think from some other serverless users is that right now we evaluated the path according to the percentage of paths, to score the field right so but internally for the serverless users they want to the the bin packing in a very fixed manner, fixed manner. I mean internally, we sort of do this kind of uh random part selection right in the rubbing way or whatever. What other way.

C

So if they want to do bin packing, they want every time they select the same, like 500 nails to to to evaluate right. They don't want to this time to evaluate node, 1 to node, 500 and next time from 401 to 1000 so yeah. This is.

A

But that's that should be easy to solve, like I don't even think we need to continue from the previous round exactly you can just select a random starting point and the set of nodes and and start from there and the the question here is how many nodes I should look for from that point on uh whether I should look for 500 or just maybe 10 is enough, since I have a ton of work to do. um Does that answer your question uh concern.

C

uh No, I don't mean it's uh very related with whether your proposal to validate the notes according to the name, so so we just want to a similar, not similar, just related concern from users and the behavior that we evaluated to the nodes in a random way. So for now it's by default. You cannot change that. It's totally render if you're running a large cluster right.

A

Okay, let's comment on the issue: please, if you have any questions about it um or you have an opinion um comment foundation or if you want you can you can speak up now.

B

Do um do we have like? Is it open for implementation already, because we we haven't, we haven't done this from any pre-filter plugging ever so yeah. I need some work because we need to return.

B

Right so I guess we can discuss that in the the issue. If we're going to go this path.

A

A

Okay, um we've got a couple more topics here, just want to give enough time um mike.

D

uh Yeah, so I just want to preface this uh descheduling framework doc here that I have saying that this is a really vague proposal right now, I'm really trying to gather some feedback and input on what people think about this.

D

If this is a kind of some good goals for the project or not, and the basic summary is that I'm proposing that we really fundamentally refactor the descheduler repo as it is uh into more of a framework to let people build their own d schedulers, you know and something that's kind of similar to the scheduler framework or really exactly similar to it.

D

From a concept standpoint, um this idea of a de-scheduler framework has been kicked around for a while, and I think that now is kind of a good time to start organizing it a bit better, because we have a lot of different proposals and bigger projects that I kind of tried to outline in this dock in ways that they relate to the idea of a scheduler framework.

D

But just for some background, like the descheduler project over the past couple years, has really grown a lot just by measuring issues and pull requests and feature requests that come in, and it's really done that without any kind of roadmap like the roadmap itself has been effectively ad hoc just from what people are requesting, and that has worked so far as kind of a central component where you know people want to add features to it and we review them and merge them, and we haven't really had anything that gets.

D

uh You know projected in that point, but um it's starting to hit some scaling issues with the growth of the project and we're starting to reach a point where you know. Not only do we not have enough bandwidth to really effectively review and merge new proposals and changes, but some of them are just not really feasible to happen without like conflicting with things that we have already built in uh as assumptions to the project.

D

So um our solution to that so far has been to add, like new feature, gate flags and settings into our api, and this is also causing a scaling issue where our api is growing and not reaching any kind of stable setting where people can start reliably really trusting this and using it as a stable project.

D

So what I'm trying to do is gather feedback on organizing some of these ongoing tasks that we've had into how they could look for a framework similar to the scheduler framework.

D

The goal here, basically being to shift the feature, development tasks really out of the hands of the maintainers of the descheduler project and put that you know enable that, but also encourage that in third-party developers, so that people that want really custom pod eviction logic. uh It can be idiomatic for them to build that themselves and that frees up our maintainers to work on building a stable platform that other people can reliably build on.

D

uh So that's basically the summary of what I'm talking about here. I really just want to get a lot of feedback from anyone. That's interested in how this would work, what it would look like you know. I tried to put a couple of questions at the bottom of the document that I posted to get the conversation started and try to come up with like what would the scheduler framework work like? How do uh you know the existing um features fit into it?

D

What kind of future work could we expect from it? And you know any other questions that people think are worth addressing?

D

That's really what I'm looking for is. You know input at this point um when we get enough input, you know I'm imagining that this will take a little while to sort out and organize. I would like to start on some sort of work for this. Some concrete work in the next release, like the kubernetes 125 timeline, if we decide to go ahead with this kind of approach, but you know, as part of the discussion, I'm also open to any completely alternative ideas.

D

My main summary a goal here is just to kind of alleviate some of the pain points that we've had and really steer the descheduler project towards a more stable platform. For these kind of eviction controllers uh so open to feedback, uh just putting this out there right now and we have an issue on the github repo, um I'm sure that there's probably people on this call that have had uh pull requests to d schedule that have sat for a long time or hit a lot of review or gotten blocked because of uh conflicts with everything.

D

So my desire here is to alleviate that and get a better workflow for the project. So if anyone has the time and you're interested, please you know leave some feedback answer some of the questions that we have in here work on this proposal. I want this to be a collaborative proposal through the people that are invested in it.

D

So yeah, that's basically it. I don't want to really discuss any of these questions right now. I want to let people have some time to think about them and work on them offline, and you know maybe in the next meeting or if this uh you know is a big enough task. Maybe this becomes its own regular meeting that we would have to have among the people working on it. I don't know. That's what I'm looking for from people, so um that's it in a nutshell!

D

If you have the time, please take a look at this and uh feel free to help out we're looking for people that can contribute.

A

A

Any questions, quick ones.

B

Is uh is the frame? Sorry? Is the objective primarily to be able to review better and well have some separation of concerns within the code, or is it also about like starting a new descaler plugins uh repository that is separate from from the scheduler yeah.

D

I think that that's one of the open questions that I would really like to get some feedback from people on. um I don't want to bias the discussion too much, but my thinking is um just to kind of alleviate. The reviews, like you were saying, like your first point there like make reviews easier for the scheduler and give our maintainers the time to work on building a more stable component. That can, you know eventually graduate to a v1 api.

D

um You know it could follow the same pattern as scheduler plug-ins, where you know we have a more open process for people to contribute. You know strategies and stuff there, but that's part of the design that we'll have to figure out and see if people are interested in.

E

I think another.

C

Question worth exploring is whether you want to do the same way that the scheduling framework and studying parking are wired so right now we we have to recompile the whole battery right and uh in terms of this scheduler, I'm not sure, because it's not quite as if you look at the. If you don't look at that, the central component, maybe you can leverage some other uh interaction mechanism like grpc or some other things, so you don't need to recompile the whole binary.

C

So that may be one option to explore.

D

Yeah, I think that's a good idea and also part of what I'm looking for, because um you know I think, since the scheduler framework has really been um pretty successful to this point, and but I'm hoping that there's maybe some things that we could learn from that to apply to this um points like that. Where making it so you don't have to recompile a new d. Scheduler would be useful.

F

Is there um any thoughts on adding these scheduling to the the the inbuilt scheduler so that, after scheduling, um if certain constraints are no longer being uh maintained, uh there's some kind of descheduling that can happen uh in the in schedule? The inbuilt scheduler itself.

D

um That's been asked a lot and it you know. People have requested that a lot, but there isn't any plan for that right now and I can tell it's.

F

So any any reasons for why it has been asked for a lot, but not being prioritized as in like because.

A

um In general, like we, we want to have separation of concerns, um so the scheduler is not necessarily like it's a pod to node laser. uh We do have preemption, but that's mostly to place another pod, uh not necessarily to rearrange the placement of pods.

A

um It's similar questions about auto scaling as well like the order scale, is also a separate controller, not as you would have expected, like in some other systems where schedule or discovery or scale. All of them are a single.

A

You know binary, let's say so. This model, like in general kubernetes, where each controller is concerned with one task, has been relatively scalable, maintainable um and working well. So the question here: is this scheduler not successful by running on its own?

A

Why do we want to merge it with the scheduler if this, this scheduler on its own is making reasonable decisions.

F

Yeah, I think it's a good point. I mean yeah. I agree. I think it's essentially more of a scheduling decision from a scheduler point of view versus rearranging pods. I I I think my my my goal would be to like. I don't know if this scheduler is the right. I, the the only the main issue I see is probably the fact that you need to write two different configurations for scheduling and descheduling and, and maybe there is a scope for one single spec that can basically uh do both schedule time and run time uh configuration.

F

Maybe that's something we we could do. What do you think about that?.

D

um I I don't want to take up too much time with the d scheduler, because I think that there was one more thing on the schedule, but um just to wrap that part up. I think that this could potentially relate to the framework idea, because.

F

D

You know if this is an easy enough and portable framework. Maybe we do revisit that, but I think that the idea of merging the descheduler into scheduler as a whole is its own, pretty big topic that you know. I don't want to take up too much time discussing and we could definitely open another issue offline to start the discussion about that.

A

Thank you mike um I'm happy to stay for another 15 minutes uh uh answered. If, uh if you want to like to present the video anyway.

B

G

Right so so let me share the screen.

A

um Are you able to do? I have to make you co-host yeah, I'm not able to share, or I just made you a co-host. You should try again.

A

Maybe I should stop sharing.

G

Right so I'll try to be quick. Can you see the screen not here.

C

G

E

G

How about this one.

A

No, no do you want me to share it for you like? Is it the.

G

Yeah the presentation, the.

A

G

G

Right, so I'm not sure hello, everyone, I'm not sure how much time I have, but I'll try to be quick. This is a proposal for a scheduler plug-in that deals with over commitment, um so chen is also on the call can ship in any time. I believe you have to go in a couple of minutes. In any case, why do we need to look at over commitment? As we all know, schedulers uh just place pods based on what they request and that's the guaranteed uh by the kubernetes scheduler.

G

You guarantee that you're going to get your requested and not your limits limit is a nicety that a container could use and grow into and and spike load into it, but there's no guarantee from the scheduler that you're going to get that much resources which makes a lot of sense. The problem is: if the schedule is not aware of of these over commitment and not aware of the limits of these pods, then they could get scheduled on one node, for example, then they compete with each other leading to performance degradation because of throughout cpu throttling.

G

You could even have om kind of things. If it's, if the resource is memory, so this this particular plugin is, is proposed to look at, in addition to requests, look at limits and consider those not guarantee them but consider the limits so that you place the pods basically on nodes, the pods that are burstable and best effort on nodes where they can grow they can.

G

They can grow their usage beyond the requested up to the limit and, at the same time make sure that when they grow into into that room uh they don't compete with others. So we look at both the values of the limits, as well as the actual usage of the actual load on on those nodes.

G

So, let's move on to the next one.

G

Right so, as I said, versatile and best effort, that's what this plugin is targeting the the guaranteed one. They don't there's nothing else. You can do with them, because this is about limits right. uh So there are basically these containers that are burstable in best effort. They are telling us that they're going to grow more than what they requested may grow into it. So it would be nice if the scheduler can can consider that and give them again room to grow and that room should not be congested.

G

So that's that's. Basically the idea, it's a very simple idea. If we move on to the next one, what this uh plugin is gonna, basically evaluate two risk factors and two, these two risk factors will make up a risk value that would uh evaluate into a score. So as such is a score plug-in, it will do a scoring of the nodes based on the risk, and the risk has two factors: what I call a limit risk and the load risk.

G

The limit risk is based on the values of the requests and limits of of all the pods in in the cluster, as well as the part that is to be scheduled and the load risk has to do with actual load measurements, and luckily we already have in trimaran plug-ins. There are two plugins there that use a load watcher and the load watcher is providing through prometheus and other providers, data about the actual load on those nodes.

G

So we're going to take that and and evaluate the risk of how much load there is, and what's the chance that if we place a part on that node and it it goes beyond its request, it is going to have a lot of competition or congestion.

G

So the next slide, please and the suggestion is to add this plugin as a third plugin to the trimaran family, because they all use the same load watcher that provides loads. So as such, this is a load aware, plug-in and as well as as limit, how does it differ from the other two at a very high level, so the trimmer and all of them are load aware. So they do something with the load. The first one uh looks at the average uh and basically considers all paths.

G

The second one uh looks at also not only the average, but the variation in the load and the proposed one trimaran 3 would uh would also look at the limits. In addition to the load and as far as load, it looks more than just the average and the variation it has to compute the tail of a distribution which is the the probability, the probability that, if you place a node on the part, if you place a part on a particular node, it will compete with others with this probability or chance.

G

Okay. Next one, please right. So I show here two two use cases, one that talks about the limit risk, the other talks about the load risk. So as far as limit is concerned, let's imagine that you have two nodes in this picture here. uh Node one on the left has two pods a and b. You know two on the right has two parts: d and e, and they have these requests.

G

Both of them have the same. Both nodes have the same request values. So a default scheduler would pick whichever to place the new pod pod x. Now, if you consider limit, you can see that node one. If you add up the limits of a and b parts there, they go beyond allocatable beyond capacity, whereas in node 2 they don't. So it's it's better, probably based on that to place pod, the new pod x on note 2.

G

This way it will have more most likely a chance to to grow its usage beyond the requested one up to its limit of three without any guarantee. Okay, so that's at a high level! That's what the limit risk is all about.

G

Next slide, please how it's computed uh the risk itself. This is very simple formula that says: okay, I'm gonna add up the allocated on on which are the requests on the node and I'm gonna add up all the limits of all pods on the node. I will evaluate something: that's called the excess, which is the between allocated and total limit, including the part to be scheduled and the allowed, which is the available room on the node, and I do this for all resources.

G

So if cpu memory- let's say so I'll, do this for cpu I'll do this for memory and then this risk limit is is nothing other than a value between 0 and 1. That says, if the risk is 0, then then uh there's no risk, and that would happen if, for example, excess is the same as allowed, then there's no risk of of going beyond what you want. You're gonna get what you want if you wanted to and the risk of one when you really don't have room at all to grow.

G

So that's the that's the formula to compute the limit risk. That's the first factor. Secondly, next slide uh is so it shows the second the use case two, which is how about the load, the actual load.

G

So here is the same arrangement again of node one and node two, uh but because in node one you look at that distribution there of load uh that there is some chance that the load is between two and four four is the is the capacity of this node of both nodes, whereas node two all of the load and it's an extreme case, but all of the of the load is between two and four.

G

So if I place a new part x on node, two is going to really be uh competing with others more likely than node one. So in this case I will favor node one based on usage. Okay, so that's the idea move on to the next slide, and so how do we compute that? Well, the risk of of of competing with this uh load? Is we look at the distribution?

G

We look at the probability that usage or load is beyond the allocator, and we compute that and we say that's the risk and again it's between 0 and one. If all the load is is below allocated, then there's no risk at all. If all the load is above allocated, then there is a lot of risk and anywhere in between okay.

G

So next slide, please how do we? What do we do with these two risk factors where very simply, we just do a weighted sum. We have risk limit and a risk load we multiply by some weight default is a half that they are equally likely to contribute to the total risk. That's a configurable thing, and so the total risk for that resource on that node is one value between zero and one.

G

We consider all the resources, let's say the risk of a cpu, the risk of memory, the highest risk, the resource with the highest risk. We take that and we say the 1 minus that times. 100 is going to be the score of that node. That's it next slide. Please.

G

Yeah, I think I don't have to go through another example. So that's basically the idea any questions.

C

Yeah thanks elsa for raising this. I think it's a very practical challenge in the real question, so I have one question I think you mentioned two factors to to serve as the input of this kind of scope, plugin. So in terms of first factor, you look at the limits. I think that code depends on how correctly the user sets their limits. So what if they like just said just request under one cpu and the limits, it's a very large number, but they never reach that limit.

C

So does that mean the first factor will totally uh useless, and then you have to totally look at uh look upon. The second factor that look at the real load of the node.

G

That's a good point yeah, so the these two factors, the first one, um as you noted it's based on it, makes its decision based on what users specify as limits and they could be wrong uh and that's when the second factor is gonna fix that by by looking at the actual load. So on one hand you could say well, why do I need the first factor, on the other hand, uh well, what? If the load is the one that is not right, that uh that.

G

I I can't think of as of an example, but let's say that.

G

Judging only by the actual usage may not be enough to to assess what the limit is, in other words, if, if a container once in a while goes about as above, it's requested um based on how much is going above, it's requested, I cannot say, okay, the limit is going to be the maximum of that or twice the maximum, or something like that.

G

Maybe making that, as as an estimate, uh may not be accurate because because you know maybe that's what happened in in in the past, but in the future, all of a sudden we're going to see a big spike from that container.

A

Right, I I think you, uh this is what I understood like one factor is basically trying to take the future into account. The other is doing things based in the past.

G

Right, it's it's uh one is based on the past measurements and the other one is based on uh potential. What's in the specs yeah.

A

Yeah and the spec is basically saying like if you have high limit it says that in the future you could grow into that limit yeah, and so it's I think, it's reasonable to take it into account.

A

Any other questions from the community.

B

uh One um a little bit about the: uh how does this play with the existing plugins? um Is it expected that users could use the three of them at the same time, or should they choose one out of the three.

B

That's a good question.

G

They don't have to use. uh Of course they don't have to use all three. um I think that all trimaran, I'm calling them abbreviating them as tramaran one two and three and in the table I kind of contrasted them. um So typically, one would choose only one, because you don't get any benefit from using the three first of all. Trimaran three here doesn't do anything to the guaranteed parts.

G

um You guarantee this you're gonna get what you what you're asking for so it doesn't. I I believe, unless there is an indirect effect, that it won't, it won't help or it won't hurt the guaranteed ones. So how about for bursts of best effort? What does timer and one and two do is that they base their placement on actual load on nodes, whether average or average plus variation, but they don't look at limits at all.

G

So if, if, if pods are concerned, if a user is concerned about uh work on those pods or containers within the pods, that's going to grow beyond request, then definitely uh trimaran 3 would would be a would be the right the one to choose.

B

Yeah, um I guess uh I wonder how what's the best way to either conversion like? Is it possible to have one plugin that you simply configure differently to behave differently instead of having three plugins just to simplify the user experience, because someone that is not very familiar has to choose, and maybe they decide, maybe they think using this three at the same time is the best, and it might not be so if you just give them one with different configurations, and it might be easier to use.

G

Yeah, I see what you mean. I think it's a good suggestion. uh My only feedback is the the common thing about these three trimarans right is that they use a load watch. They use the data about a measure metrics of usage of resources on the nose, that's the common thing, but they have different objectives.

G

Each one of them has a different objective, only thing in common and they belong to the same family- is the fact that they are load aware. I can imagine many more scheduler plug-ins are going to are going to be proposed, saying. Oh, we want also to be load aware, but we have this other objective in mind.

B

You can clarify the documentation and maybe that's enough um yeah, but that's my only feedback.

G

Yeah, that's something to think about.

A

C

Any other questions yeah no question, but uh suggestions that maybe the terminal 3 can also apply to all paths because the limits itself. Well, it's not accurate right. It depends on how the users benchmark their workloads and set their limits.

C

So maybe you just take this guaranteed parts out just apply this to all parts, but just a suggestion, but we can evaluate.

A

My question is like: how do you evaluate these and what is your success criteria? How do you determine that you're actually making things better? How do you measure.

G

The ex the experiment would be like uh I would have a cluster with some pods that are running and going and beyond the requests um heavy load. What maybe once in a while, and then I have this new pod- that that is also going to have a lot of demand, and I would schedule the default scheduler and schedule primarin3 and see how long, for example, how long it took to run based on one schedule and the other.

A

I mean this is not a like a scheduling, latency or scheduling, throughput improvement. This is an improvement on how much resources a skater is able to burst into, and so basically what you're saying is that you want to measure um the performance of the workload right like not the scheduler performance. Yes, yes did you do that? Like did you yeah?

A

G

I did but only a synthetic kind of environment and not not a real one. In other words, um I had my own. You know generator of work, which is just using a stress kind of tool. That kind of thing, but it's not it's not a real kind of application. Right.

A

And this is like my follow-up question here is like um when, if we are to, uh you know, suggest these plugins to customers to users. uh What would we tell them? What type of workload example workloads like it's not enough to say? Okay, I have this synthetic workload that basically maximizes the success of the plug-in that implemented and you try to match into it.

A

You know, like you, can always manufacture a workload that works well for the plug-in, um and so it would be really really interesting if you can have a real workload that can show it actually did better and it's not the same right. It's not just the application per se. It's also the mix of workloads that are running on the cluster uh that would contribute to whether this is actually going to be useful or not. uh On the bigger scheme of things.

G

Right, yeah, let me help you. I agree with you 100. Let me add to that. I've recently seen a scenario on. uh I believe it was on six scheduling somewhere, where someone was saying that pods when they start for this kind of workload, when they start the innate container, takes a lot of resources and then, after that, it's very normal right. So there's a lot of demand in the beginning, and if you have a bunch of these all of a sudden, they are surging they're, uh there's a surgeon in demand in usage right.

G

So I was thinking. Maybe this would help in this case uh by just detecting that these parts have a limit that they grow into so they're not going to be scheduled. On the same note, so they're going to be basically scheduled on the kind of round robin this is an extreme right and a kind of round robin on the nodes in the cluster. So that's why they get they get separated when when they do their surges in the beginning and after that, they're okay- I don't- I don't know if I, if I clarify.

C

Yeah, I I I I know what you're talking about yeah, so you should yeah yeah, I think yeah. I totally agree with you and I do like that. We do need a tool or simplified process to evaluate how this packing works.

C

Like for this kind of chairman three, I think the correct procedure is that in a normal default scheduler with the default config that you'll load up the cluster, with the specific request versus limits and maybe put some cpu intensive workloads there and you can notice some om or of cpu and then with the terminal 3, you can see this symptom can disappear in a very consistent way. So that can be a very good evaluation procedure that you prove this plug-in, helps a lot and will be super convincible to others. But we do.

C

We are lacking of this kind of simulation tools, um verse. I think yeah internally, I'm building some kind of simulation to to to sort of uh fix the gap but yeah in some time. Maybe I can share.

H

Yeah, so adding to that, I I just commented on on the in the chat that this is really valid for the like java workload, because for java workload we we need a huge cpu and memory for the startup, and then everything will go down so by by having by having this example.

H

This is really valid example, and we we we saw this we're seeing this this problem uh a lot by not like overloading the the node, but it's like okay, I I dedicated a very huge node for the java workload, but after the start, as the start of time, I don't need it. I don't need this huge node, so this is really benefit uh for for this scenario, uh and I believe uh we can. We can use this example as a as a validation or uh to to to.

H

You know like to to make sure that, yes, this, this plugin is really valuable. Now.

A

Thank you thanks any other questions.

A

All right, thank you guys here. This is really interesting. I love the direction of this project um and the the fact that there is like attraction- and I was just trying to play this devil's- advocate- trying to poke uh into the methodology and see if there isn't, um but I'm really interested to see at some point. This is being deployed in in an actual, like you know, real-life cluster, and I'm wondering if this is something that you are planning to deploy on ibm. I don't know if ibm actually uses kubernetes uh in some form.

A

We use kubernetes quite a bit.

C

Okay, now I think, next time, maybe you assert chain or abdul from paypal can demonstrate this kind of triggering from the end-to-end workflow like how the metrics like. What's the metric provider you choose and then to see with versus, without distracting how the close looks like that could be a super super interesting demo.

E

G

That would be great.

A

A

Thank you all and thank you.