Kubernetes WG Resource Management, 11 Jul 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes Resource Management WG 20170711

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

And yep, so this is the July 11th meaning of the research patient workers at topics in a janitor, so Deepak where he goes. Yours, oh.

B

Hey go ahead and share your screen.

C

Hi everyone are you able to see my screen yeah.

B

We feel good okay,.

C

So I look presenting the resource over subsets subscription work that we did there to our way.

C

So, as we know like resource efficient is something like we have primary and best of our jobs, which which need to be co-located on the same machine and how can we best utilize the resources by sharing them across both the category of applications without affecting the SLS of primary applications? So we have tried to address this concern by by introducing some logic within within different components in the Cuban artists system. So this is the overall architecture diagram that we have so so we have.

C

We have the slave node where the cubelet runs and we have the master node where the scheduler API server runs. So we introduced two new components within the cubelet. Currently they are a part of the cubelet code. They are essentially go routines that run and one of them is called as a resource estimator. So the purpose of the estimator is to to report back what are the available and reclaim able resources?

C

So I will touch upon what I, what do I mean by reclinable resources as I as we move ahead in the slide, so that so so this guy tries to estimate what are the available and at particular resources by by updating the node status of every node. By doing this, API call to API server.

C

So the second step is, we have added a new controller called oversubscription controller in the master which retrieves this information from every node which which spells essentially about this information and based on this information, every node will be tainted with with a label by by means of which we can enable or disable resource over subscription feature on any node.

C

What what this will achieve is a scheduler which will try to schedule a best-effort part. If you would like to use the free claimable resources will be able to make decisions better to schedule this pod, which can accept agreeable resources on a particular node or not. So a scheduler will have this information about all the different type of places on node X, and then it will try to schedule a best-effort pod and to reclaim able resources when necessary. So so best-effort part is the pod, which does not require any guarantee to run.

C

So it does not request any resource or it does not specify any limit. So it is like it says that okay, it is okay for me to run but I'm okay to be evicted whenever you want me to be edited. So so this.

A

Again, I want to make sure we have the same terminology and we're talking about best-effort. So traditionally, when I hear best-effort, pod I say my pod, that makes no resource request or resource 1a. It sounded like you said that you included in there also a pod that has no limit but may have a rigged bus am I mistaken. No.

C

No, no, it will typically not either specify I request. Another limit. Oh yes, it will. It will not have either a request or limit it to not have any such because in its mods package, will not request for anything all right. What if we call a best-effort pod? Okay,.

A

So make sure in a fire how.

D

Do you match the best effort, job against the oversubscribed resources and group just for free do I.

E

Say you're going to get to that later. That's.

D

Enough presentation, yes, yeah.

B

C

B

C

Early yeah, so so the API server then launches the best effort pods onto onto the slave node, and there is another guy called the QoS controller, which is another go routine, which runs every n seconds, which we fix might be the monitoring interval and it will monitor the resource usage of regular pods, okay and if, if we think that the resource usage is going beyond a certain threshold for a regular port, so it might be okay to give back the resources which were taken taken from the jiggler pod so that it's SLA are not violated.

C

So it will try to kill a best-effort pod or it will try to freeze. So I will explain. Why do we need a freeze action, but in case of memory, if you want to release the resources, will typically kill a best-effort pod so that the memory is given back to the regular pods. So so this is the overall interaction diagram that we have so I will I will go into the details of each of each of this component in the later slides. So this is the scheduling example.

C

So if you see the state a you have two parts which are running okay.

B

C

1 and part 2, so what we have is typically 3 sections here. The be used resources is the the so say or guarantee pod requested for some X amount of CPU, so it might not be using the whole X amount of CPUs might be using X by 2 or X by 3 amount of CPU. So so, and we also have some Headroom, which we use for some buffer buffer right, and so whatever is left, is what we call a decline in this.

C

So so this reliable resource can typically be shared by some other by step at fault to run and similarly, the another pod pod, which will again have a similar structure like this. So so, if you see this blue blue sex, which is highlighted here, this is the total reclaim Abel resources that is available across board one and four tools which can be used or it can be shared by some other vegetable pod to them. Okay, so.

C

So if so, if a best-effort pod say pod three wants to be scheduled and it is okay, it is it fit, as that is okay, to use this or the clinical resources. So it gets scheduled onto this node, and the state see shows that okay, this particular block, is now replaced by a pod three, which is having some Headroom for itself, and this was the original available resources that we've had okay, so so this is the kind of resource allocation or changes in the resource allocation.

C

That will happen, then, when a best-effort pod is OK to accept reclaim abilities or PSA's, so the next person I talked about power direction so now I doing that the pods of pod, see tools or usage increases, I tried it and we think that whatever resources were taken back from pod to have to be given back. So in that case, this whole part three will have to be evicted or taken out so that these visits are given back support.

C

So also this figure depicts how do we do addiction for best-effort pots? In case the pod? The primary pods usage increases beyond some threshold, so.

A

One thing that I'm curious about is numbers.

C

A

And the key, but we have some support for evicting pods when memories, which is when memory available memory once care. Some of those like is there, is there a reason why we couldn't make the keyboard itself just more in tension more intelligent about how often or when it makes that a victim decision versus having to have this external cause controller? Do it but yeah.

C

So our actually our work was inspired from highlighters paper from Christo, canteen and and and also a project called meso serenity, so in the source also, they have a project called serenity where they have actually developed. This controller pipeline kind of a thing which I will read later, where you have pluggable controllers for different types of resources and each each controller will will actually try to monitor that particular resource, and then it will try to take some action.

C

So we took inspiration from that design and we, we thought people some of the similar framework into Burnett. It's such a surprise.

A

C

A separate controller when.

A

When you affirm and I'm sorry, but if, when when I think about pot eviction, I think about today, what we have in the cubelet, which is evicting pods, to keep the note stable when resources run, scared.

B

A

As the picture in here is talking about, how do I make room for scheduling new pubs or how do I it's less focused on? How do I keep the note in a label? I guess what I'm trying to do is like distinguish this type of fiction versus the eviction we have order. You guys.

F

There, okay I, can also talk to it, so we we actually developed serenity and got the api's in tomatoes which allows you concepts are drive from, and controllers are very similar to the eviction manager and dozen Iraqis there's more than one. So there are controllers for each resource vector that you wanna, protect, for example, for cash for power. For networking. Those are all things we can probably go into. The cube would eventually Crichton.

B

F

You're sure Headroom egress for hybrid containers of you know, guarantee your best effort. The resource estimator basically was a soft measure for how many separate parts that could be allowed into the machine. Fine, so I guess that number could also be estimated by the division manager right. You can reduce that to zero and say no more best effort. That's fine through the no conditions, yeah.

G

So use this thing new, so.

F

It's mostly to have a local Asian kind of theme, how many fine I guess just like I'm comfortable ringing in the abstract here, so a resource estimate over basically so help me if I don't know the most recent details.

B

F

The resource estimator or the eviction manager, or can the eviction mentor give a count of how many per separate parts that are to be allowed to be admitted, though,.

G

There is no such convert. This is like you, don't know they go so the dual is illegal, the number of balls in the nose and look the combination of like a bunch of other API constants. If you like, including risk reasonable, is false.

G

But the thing is, you will really know how much they're going to use like this challenges. They might be like oh you're kidding, and so one of the things we've talked in the past that we unfortunately haven't implemented. Yet this I mean this kid will be aware of usage in dual dispersant scheduling for the Super Bowl.

C

So uh so I'll touch upon the scheduler and resource estimator made in the data slides, which might help or clarify a bit mode yeah. The.

A

Other common I would have here is also reminder somewhere suspected, either early or too early to stack where you could reserve resource across class peers and I'm just wondering if this is also essentially solving a similar need, but yeah I kind of like this, where I feel like this appears to be. How do I they can like a move, a best-effort pod to a node, the police utilize. First, by having that utilization aware knowledge when doing the scheduling of just less painful, see the selector.

B

Just just a quick comment: this is deflation Valley, so you know this PLC effort we did was around six months ago, six or seven months ago, so I mean after bad. You know, there's been lot of work done so one other thing: that's one of the things we wanted to kind of find out as well kind of reach out to you folks and the scheduling sinkholes. Where is the overlap and what exactly? What can we do?

B

I mean as far as the work which we have done, how do we kind of kind of work side by side with the the estimator managers? Estimation manage a and there's a component which which you folks have develop, so that's kind of like that, so yeah, just to kind of give you a summary as well, so this world, this work was done about six to seven months ago. So I know the lot of work has gone on. So that's why we wanted to kind of just going to adjust.

B

If you don't mind, let's go hit represent it, and then we can have a follow on kinda session. There is overlap in all those countries. Okay,.

G

All the say, in terms of like terminology, we use preemptions to to refer to actions that we like in Victor like that is not between this earth Falls. For reasons other than say reason: starvation on all across the node, so the direction Willis briam, maybe hone, avoid confusion. The directions you could I can hear you use the preemption.

G

D

There's a another note about the the original paper and also the Serenity work was that you know eviction looks a lot like the eviction manager, but really that was kind of the most drastic it determined that we used before was called corrections. The idea was, you know, if you, if you realize, through the high priority APM, so that the application that you care about is suffering, then you you hurt de, and you know the worst thing that you can do is can get off the box.

D

But if you have more granular control than the first thing that you would do is try to throw out a letter along different resource dimensions.

D

So you know just fact that the one correction that is outlined here is eviction. It might not overlap as much with the eviction manager in in later iterations, as as what it what it looks like, but.

H

And I will say that in the CPU bidding stuff we're looking at adding we're getting the same kind of issue right, we're taking. Basically, we can allocate all the CPUs on the box to guarantee pods and then we actually can't satisfy best effort pods, and we need some way to signal to this scheduler to say: hey, don't schedule best effort pods to us, so I think that there's kind of a generic maked for some signal that you can. You know you pressure or a best-effort pressure or something no condition you can apply to.

H

Let the scheduler know: hey I know that you think that I've got room things, but I don't.

C

C

And so we also did some stuff for for network bandwidth, but I'll probably not cover it in this presentation, but I just wanted to know that it was also part of cycles paper. So yeah.

B

So the way to the whole, you don't very much at the door it mentioned. You know we use the whole pipeline design. So one of the pipeline controller was the network controller and the memory controller. We didn't really do the CPU controller, but that's one other thing: we did that as well yeah. You can keep going into it.

C

Okay, so so the API object changes that would be required in order to achieve this is a first sale say, an application which widget, which doesn't want to share its resources. The claimable associate with any other thought would would also have the capability to say so by by introducing a new, you know flag, they say that offer reclinable resource falls. This would be in that I have that particular application would not be willing to share its removal resources, so the scheduler would basically not try to schedule any parts.

C

The separate parts by taking away resources from disparate, not deployment or application, so so I think in we are presented with in a scheduling and one point that was brought up was: we could simply not take away the clinical resources from a primary power port just like that, so it should be based on some priority or some some logic. So we thought we could introduce this additional field. The user could provide some option to say that battery is okay or not. So this is the first API change yeah.

C

G

Is this good? Why can't it is? Why can't you again Lawrence you conference a message for access to resources? Why can't you physically overcome it.

C

So so so the point there right because brought up was some applications, have a very, very strict latency requirement like with maybe two seconds or four seconds, so it would be very difficult to monitor the resources and then you know predict and then estimate schedules and all that stuff within within that time frame. So some applications would really be not willing to share resources for even a fraction of a time in. In those scenarios, those applications would not really be happy to share resources, so that was a discussion that went went along in that sense.

C

G

So essentially, it's like saying that you have a really really intensive application and one of the features is personification guesses.

E

C

So the second API change would be if a best-effort pod would would want her to accept reclinable resources, so he would set a flag accordingly in its pod spec. So, ideally, as we discussed, he would not specify any requests or limits, but he would simply say that I am okay to accept or decline of the resources. So the scheduler.

A

C

Consider this hello or other.

A

Stuff response, inherently accepting reclaim about resources.

C

Yes, they would, but we are still not sure whether if we put any requests or limits over here or we, you know, if you omit any of those, would they be categorized as burstable yeah.

B

C

What is okay, so in this case we could then do away with the slag, assuming that, if we don't have any requests limited, it is a best effort possible, correct, okay, so.

C

So this was a resource estimator, so this is just a flowchart of how it operates, so it initializes with some configuration and it that's the matrix acquisition like acquire matrix using C advisor, and it has some smoothing and forecasting methods and it tries to calculate basically what are the reclinable resources that are available on this node. So, ideally, it would be reporting like for that particular node. What are the amount of resources like CPU and memory that can be reclaimed, so so this?

C

This can also be further improved by introducing some kind of a confidence level or or or how much time are. We really sure that this kind of resources will be can be reclaimed, but we did not include that in our first pass of implementation. We just offered a static point in time, calculation of the amount of reclinable resources on a particular node. But yet, ideally, this estimation could be further improved with the additional additional parameters.

C

So so this is just a graphical view of how do we calculate reclaim a very surface? Basically, it's totally requested resources by a non best effort pod, which is like a primary identity pod.

C

The total actually used by that on Batman, best efforts, app and considering some buffer for the head room, and then you subtract the requested resources by the best effort app. So this is just a pictorial representation of that the oversubscription controller, which sits from the master rate it's job, is to basically paint a particular node based on the amount of XA or de clima beliefs. Also say if the amount of beans is greater than some threshold which can be configured as a startup parameter or something, then we can enable the feature on the node.

C

So so so so, a scheduler will then be able to schedule the Stepford parts on the node based on based on pay. This paint, if it is saying that it's enabled, and if it is less it will do the opposite. So it will not here. It will not allow any more vegetable parts to be fitted onto that node and enabling disabling the feature can be done by applying paints and threshold can be passed as a parameter. So so does this like a helper clarifying?

C

How can the node be painted or how big a scheduler uses information.

A

Yeah I'm just I'm trying to reconcile this last slide with the existing behavior you can get by setting eviction threshold today.

A

Okay, and so, if feels similar and then I guess, the part that appeals to me is.

B

A

Able to have the utilization aware scheduling so that I could place best effort pods to the least utilized nodes as a priority.

F

So that directs sometimes the utilization that does not have a correlation with how much a quadrotors is hurting. So it's it's. It's a really complicated topic line and the computation we had before in terms of the claimable resources was only in terms of CPU time. But it's been. One of the conclusions from the back case is really that it's hard to gauge. You have to you, know, look at the performance and there's almost like you know, open open the drawer close it by to allow to come in oh yeah,.

A

That's the thing I'm most addressing I'd wonder finish. The point here was like, like the realistic difference between it, a versatile pod, that's at the very low request and the best effort pod is it's like a mirage like they're, very similar, and so like I've, been just sitting here. Thinking about like is trying to do this with SF or power clause.

A

Class alone, like is that really people's problems like when I think about the types of workloads that I know our users are scheduling on to say, openshift product more more often enough, people are actually setting request a minute, it's just the shape between those two vectors. It's very wide, like there's a very huge possible constraint, and it could be argued that their request is just so realistically low that, like it's most by nature, best efforts, but we don't fully allow it to become best ever and I.

A

Guess I'm wondering in the community is like how many like, where you ran this and when you test this, like what were you targeting as a separate jobs and like how overall utilized is your cluster.

G

Where's the reservation, or indeed Linux or was it efficient, I mapped.

A

A question for Deepak like when you motivated the work like what were you targeting to run us best effort, and why would you choose to run up that rather than give a minimum or say like a lower request value, and are you targeting resource starvation, around memory or CPU or the previous question.

I

After that, that.

C

When we did this work yeah, so this was a part of our technology project actually, and we were trying to also see how we can. You know, schedule short running jobs, which requires a very less amount of execution time and and we were trying to experiment and seeing whether we can get any good resource utilization by sharing of the resources. So we primarily targeted from memory, and we also did some work on Don's network bandwidth.

C

But we did not do any work on CPU, but we did right running some experiments or using CPU core pinning, but that was not integrated into the code as such. So.

B

Just just one more clarification.

C

See also not really, this.

B

Is not talking the direct your question was, but what exactly your use case or a customer requirement like.

A

A real-world versus less academic and well yeah, yeah.

B

So it is so the scope of this effort was more like academic kind of thing, in the sense that you know we wanted to kind of use the oculus and can I understand you know how can we implement a similar thing with the kinetics like the mesas folks Nicholas, and these guys to tap into energy project? Thank you so from that perspective yeah, so we didn't really. You can see that you know we. No, we never even build the CPU control, so the focus was more on memory and network at that time.

B

It's not that, because this is what the requirement was. This is something we did that in at that time. What.

G

Is the goal of the specification.

B

What's the goal line like.

G

What what are you hoping to take away so.

B

The take away exactly what what do we want to do with this all this yep.

G

Like reason for you presenting here, yeah.

B

So I think so the the what we've done is so essentially we built the sole design of the resource reclamation thing and so obviously the POC effort. It's not something in production like that at this stage, so we wanted to kind of reach out to community and see how do we go forward and can we cannot discuss this Pierre based line and we can build upon it or is there some overlap already there in the community and we can kind of work together?

B

We've done a lot of written a lot of Kodak's, so we wanted to take it further. Essentially the share with the community what we have done and then it all go from there. It that's the goal. Okay,.

G

So I hope you thoughts, will let you both complete your business, okay, yeah.

B

Go ahead dad go ahead.

C

Okay yeah, so the next slide talks about the continuous controller pipeline. Where I mentioned, we have a bunch of controllers and what essentially each controller does? Is it builds an action list so so an action would typically say that I want to kill a pod and I want to freeze a pod or unfreeze a pod and every every controller takes like monitors the usage of primary pods and then build the list of actions of secondary or best-effort parts to be killed or frozen. So a networking controller would put a typically inspect the bandwidth.

C

That is a nine-week usage of a primary pod, and then it would try to freeze a best-effort pod if required, we did not do the shared resource. A CPU, cache controller and a Scylla controller was was again implemented with a beater custom probe agent, which would basically monitor the latency of the application and then do some actions based on that and the action executors was the place where all the actions will be executed, like actually killing a pod or signal and feeding a pod and a corrective action was like.

C

If, if we, if you later, observe that the primary pods are no longer at their peak and we can, we can go ahead and scan whatever a pod, we had frozen in case of network bandwidth, we unfreeze those so that they can start functioning and start consuming network bandwidth as they were doing previously before getting frozen. So this is the overall controller pipeline and comments from the scheduler sig I had just noted down here.

C

So one question was with the cubelet should not contain the previous controller code and it should be designed to the demon said pod so that we keep changes to the cubelet at minimal. This was one recommendation, VA got and the second, when we already talked about that, we cannot simply take away resources from primary jobs. The third one was why CPU controller was not included, because a CPU is managed by the external component C groups and by setting CPU shares.

C

The competitors is able to arrange for the CPU resources, so we did not handle the CPU controller that much but yeah, but for but we had, we had to do some work on core pinning, which was like not completed in that time thing and we there are also asked for some statistics, pre and post, applying the QoS controller. So if you see, on the right hand, side we have a graph which shows that so with respect to CPU and respect to memory, you have you.

E

C

There so this chart compares average utilization per second memory between we call it horse like who are the orchestration scheduling, system and open source context when they run with full load. So we did the with four different data fits combination of the workloads where that data, jobs and enterprise apps and the observed the average CPU utilization with open source kubernetes was 25% and with our resource over subscription, enabled it increased to over 76% and similarly, the memory validation shot up from 31% to 84%.

C

So so this is these other stacks that we got and yeah. This is some roadmap or that we have tried to put, but we just want to take your thoughts on this.

C

We just wanted to wanted to enable the CPU controller, also make the network bandwidth controller controller, more mature and improve upon its functioning and also input the prediction algorithms, which are used for the source usage like we try to predict the next ten seconds or in the next 20 seconds. What will be the resource usage for a binary part or for performing an eviction, or you know, freezing the second epod and also improve upon the resource estimation techniques that we saw earlier so yeah? That's it yeah I mean.

B

Yeah so just going to.

C

Recap: Balian of.

B

The Rohit thing your question, so what exactly is the intent? So intent is kind of first of all, going to find out if there overlap. If there's something already been done, that's one thing: we wanted to kind of get your feedback and then the second thing is: does it really make sense, like I think the David mentioned that we can start off with the intubation and then you know, people start experimenting. The data no violet and incubation fails, and so that's marketed.

A

The thing I wonder when you talk about getting an incubation is like what incubators come with prereqs, right and I. Think you've encountered this before with some of the other things that we looked at incubate and so like to actually incubate this. What what are the implied for your actually unspoken prereqs and for one it seems like we have no way of like incubating API changes right.

A

It's not like it's something that you can so that's that part of dua, acceptor or off, for example, resources like it's, not really something you can incubate without changing the core, but on this individual bulleted list, like so I guess what I'm wondering is like.

A

Maybe you could elaborate on what you deal with the correct where, if you said yes, let's incubate this like what it is that the core project is actually agreeing to to support to enable your incubation and I worry that it may be more than we can take right now, but then on this individual roadmap. There are things that do interest me that probably interest others as well, like I, I, think having some prediction algorithm for resource usage or general resource estimation. Techniques would be helpful on making some existing code.

A

Today we have a consumer, reliable, but I guess do we have to you, have to incubate the whole thing or are there pieces well.

B

I think no, no, no, we not I mean that we we can that's what then this discussion dawn so I don't know so is that is that an issue, though, let's say we just want to pass the whole thing: the whole code base and intubation. So what that's kind of that's my because that that's what I thought would be very simplistic thing. So, let's park it and then we can keep slicing and dicing and see what we can. You know that yeah.

G

Fusion process works today. Incubation is more about like saying there is this long term project which is very valid. This is the right direction, and that is like gentle buy-in from the community on going in that direction, and at that point we started an incubation process, make sure the achieve the feature of the project of all the writer and then move it into the core ecosystem.

G

So here the speeds like that this is basically touching the whole system and- and there are many different in there may be many different intersections with existing components of the system, and we need to talk all of that in detail and and I feel like, instead of approaching it from like hey here's, the code base that this is awesome, let's all make some use get let's. First like talk about the goals and let's try to prioritize them and then like.

G

Let's see what the right solution is for each one of those goals and then see whether incubation fits in the agenda for each of those codes because they.

B

Exactly Stephanie I think that's that's exactly. We should do and that's the reason we send out their details kind of word document as well, so maybe that we can use that as a as a at the document, and then we have discussed all those kind of things in document itself that or what do you? What do you propose? Then it should be, have a follow-on meeting and can I go over or what exactly should we do that so.

G

I'm actually going to come, I'm going to actually say what what I feel that really is working towards know in in your future, and please correct me: everybody else in the call, if I'm misinterpreting our goals, I thought, like we've, prioritized resource management and disorder. The first one is like having having linked very simple scheduling, basic, like priority mechanisms and simple or subscription the way we have like possible and best effort and have like really working Kota model. So we can have a deterministic system.

G

Then the next level will be like providing performance guarantees or essa loose for that the domestic system under the third level would be, or the third priority would be, the improving your flight mission. So that would be like more smarter resource estimation techniques or for basically, including at all, commit right like instead of just doing over covered by request of material but opportunistically take some more and never along the way. Have a prioritized performance performance guarantees for both best effort points like we don't even have a viable use case for best effort.

G

Pontius may be batch low close, my might start embracing festival, but even that is not here, and on top of that we are missing. Some critical features like are doing best, effort scheduling in the scheduler it just just like strength, best foot pods right now that I happen, and we are also missing vertical power scaling. So is vertical border at scaling once that's available.

G

What would happen is this Headroom would reduce, because then you will have and components similar to the repose estimator that stay constantly changing the recursion is limits for a part, so the hitter would solve they taken away and you automatically like get more better utilization because you no longer having pods that are holding it unless you use somehow some of let those pods into your system, so I feel like in terms of biggest bang, for the buck are getting the basic performance, optimization features we are doing now with executing a project, your amateur and enabling vertical bar scaling, which would like get us towards our goal before we have to do something.

G

That's a little bit more drastically system like was being described as proposal.

G

Another thing of like saying course, controller would be articulate, is something that we already considered in the recent past economics that actually the default for retirement I mean we drop the idea, primarily because we want to be Cuban to contain the code logic of like figuring out which resources would be assigned to what toward pause, and we also decided that city or memory would be Co resources that the culet would handle so like trying to take that away and, like reverse, the decision is like going back to square one, but.

B

This is the guy in the pipeline design that correct.

G

So in one of your slides you mentioning that I think that he also had got about feedback. Seven, six scheduling that the GWACs controller. Yes,.

B

G

Only killed it, but it's angry dot. We all like a community. We talked about this quite a bit and it came to the discussion that cubed would be the one that is managing CPU and memory, including like or committing it, and even if there is a resource estimation aspects of it that could be dealt with by the cube in itself, because it's necessary for vertical form of skinning anyways.

G

So so like trying to plunder out. This is not something that we are looking to do in any order, because we have a bunch of other priorities.

B

So for my sister's, the scope for this way is the resource reclamation thing. So I understand that so usually saying so. The you trying to kind of mitigate that you know by doing vertical scaling vertical park scaling and all that, so is that something I've done put that because this become intubation for that incubation place. For that you know the overarching kind of resource utilization gains. You know using resource reclamation, because this is exactly what we have done is active. We showed the numbers as well.

B

You know so from that perspective overall kind of a resource reclamation perspective, so.

A

I guess my struggle. A little bit is um it's very focused on just running one class of jobs, best effort, jobs and I wish. I can speak, for which my user and customary I am inside into where today best effort, jobs are not really the common place and I feel like there'd, be a lot of other prerequisites want employees before it would become too common. We've done some stuff in the last six months, around enforcement of known allocatable for CPU and memory that can make maybe the appeal of running by secure jobs more practical.

A

But then there are still other things that best-effort jobs can can cause havoc on that, like the benefits of getting CPU or memory, reclamation are outweighed by the other things they can do to destroy your nodes. So for myself at least I I, the challenge of making more cpu available for scavengers to complete test is not like a high priority. For me at least I, don't know for others, but that's that's where I am at this point right now, I guess.

E

Good question, so how is this work related to an auto scaling? You have any kind of auto skating suppose in community, so.

G

It's being prioritized or time like an only to be a higher priority. So maybe this connects release. Then the thing is I does a feature, that's being discussed called vertical or aesthetic, and one of the idea.

G

That is, that the node could do auto scaling reactively and then the node can decide how much more resources now much less resources it has to give for a pod and then sort of the greece investigation right like you're, trying to predict how much the spot, which would use and either you can take the rest of a $1 or you stated Korea, like put it back in a scheduling, pool and then someone has use it.

G

But when it comes to the thing that, but that doesn't solve is trying to provide performance guarantees for best efforts, because during scheduled time you don't know how much precipice both Americans here so you're, just like try to boost investor for scheduling and then, if you want to guarantee performance for them. That's when, like the reason estimation would matter, but so what director Rick is saying, my understanding is like there is really no clear use case for best of a pod year. Well, this is beyond Hennepin Stein recently in the.

B

G

Needs to catch up with Maine to include radiation afternoon is my super local or.

A

More specifically, this, what is the workload that's safe to run today is best effort on the cluster that the community can speak to right, like it's not clear to me what your actual workload was when you're running this when you collected the data which is less like it was like reduction thing or an academic thing like.

A

Is it a very particular workload that was running as a best-effort job like what did your best job do to compute crime numbers or to do something more? You know it. It did a consume disk like what what particular did it do? What what classes best effort job was safe to use. Let's this benefit of summer yeah.

G

Some guinea so I just said we should. We should I think this a continual discussion, there's barely multiple events and adults or adoption just come continue in there. I, don't think. There's one more in the ice today.

A

Yeah, I'm, sorry and I should have backed up it, but III understand the the what was being presented far better than what was being discussed and I kind of echo efficient. That I don't know if just saying yes will create an incubator. It's the right answer, because the scope is much larger. I am interested in seeing if we can tease out parts of the solution and maybe grow those without having to take the entire thing, and so we should have to fall on that.

A

G

Know specifically, the resource estimation box could be reused for going to get a photo skating. You know that's one concrete thing: I can think of right now: okay,.

B

Yeah I think it would be great if you get if you folks I think if these are very good comments and feedback achieves the great you know. If you can provide that feedback in the document itself, then we can not, but we have everything in one place and then we can definitely get back to you folks, and if you want to do a follow-on meeting, we can do that as well. To come on. I know talk more in detail about certain things, I, so.

G

I already provided this feedback in an email thread that you started I think Layton was also the several images about this order, which is the right one to respond to I, can copy and paste my response. So that's out.

B

G

I can add the answers workgroup to their existing.

B

No idea, but I don't think we did it did or did you receive an email of? We can maybe D send that email to us in a delegator so that so your email was based on the document. We keep set out the your feedback on the document. If you can tell yeah.

G

And do you have other colleagues who are working on this way? I see someone no.

B

No, it's the same same set of people there. Oh yeah.

G

B

G

Little head Canon, ding I've already do this yeah me I.

A

Think fish, if your feedback was very broad I, think we can get some more specific feedback like we discussed here, which is like what is the workload that you're running that actually benefited from the reclaim and kind of tease that out, because right now it wasn't clear, but that said I do think we should time boxes and I'm. Sorry, Nicholas Akana that you had another agenda item that were 20 minutes over on. Is it worth discussing a preview of your topic in the next 10 minutes?.

D

Yeah sure we could we could do a just a quick taster and then maybe bring it up at the next week's. That sounds good, so yeah just I guess the goal was to try to get some consensus on whether people think that it's a okay, so I'm just going back up sorry so right now, we've got a few components that have been implemented and some others that are planned, that all make policy decisions that relate to nuuma.

D

So, specifically, we're looking at the container network interface plugins, the CPU manager, which is making decisions about Corpening the device manager I'm, not sure, if that's exactly what it's going to be called in the POC, but essentially the B component inside the cubelet. That makes the concrete device bindings and also the huge made you see, your controller settings and so I guess the major the TLDR for describing the problem is that if they're all making independent policy decisions about these specific bindings and there's no centralized way to unify that affinity, then we could end up.

D

You know straddling sockets and a bunch of really bad ways. That kind of you know limits the usefulness of of actually all those components that are that are trying to increase performance by by assuring some sort of Numa affinity. If that makes sense, and so specifically, you know you could be pinned to a core and then assign huge pages on another socket or a nick on another socket or you know, if you're connecting to your you know accelerator Hardware over PCIe, you want to be in the same socket that that PCI switch is attached.

D

You so I guess leave the be. A short-term goal is just to maybe get just some sort of agreement within this group that that's a problem that we want to solve in the future and and start a discussion thread about it. Yeah.

G

That's what are you describing is probably a future bug I'm going to bring this up like one of the reasons we're trying to unify all of this logic, underneath underneath the container manager is to avoid the situation. So we should like come up with a path with the practical modular design inside the container Maya that let us like do CPU one and memory assignment along with device assignment like a unified fashion and.

F

Cable vision, next time it's a small spike to visit, or it took like a look at maybe three four different ways, but no the biggest thing for us I think was just to get a feel for people's like so maybe its own sense of urgency, which is like starting a maybe a feature request. So, like you know, in coop nad 1.11, or something that you have a new being being a a line item and.

G

We should also like, based upon the worldview of plan, for one eight on one leg like I mean consider like yes, we definitely need this. It just said we should not lose focus on the things that we're doing for one eight and one line journey.

F

I think one aspect of this is also that um as to have this in mind when we design the current components, so we don't shoot ourselves in the foot later on. I think that that is maybe why it be worthwhile. Just like doing a thought experiment on it now, yeah.

G

Right come on so I've been thinking for a while. I don't have a cubelet architecture. Arch I like describing the different components, may be good. You might want to have one for the container manager that describes what the different modules are and what each clear response really hard for the interact with each other. Oh my.

A

God this you have me laughing so I, remember asking we should build that and you said because self-evident wrote that but it's the map, it's very clear, I, know.

G

Like a lot of these, are you know, people are asking you English in that Lake is completed.

G

I know it is making little big stable now, you're, mostly looking into many features and nothing moving around so much.

D

So I guess anyway, that was the of the preview version. If you want to step on time but be come with lots of ideas for for next set just.

I

To build on what you said corner and in terms of x and y accelerators, we're mostly GPUs I've got three or four cases where you can make performance, go very wrong, so I think if you're interested I can share those on a documents.

F

Yeah, that's that's also just like a quick notes. What you set wish for the 1.8 stuff, it's something we should discuss next week or scheduled like a special meeting on how we can help accelerate the one put in pieces. Yeah.

G

Definitely I think that will be helpful if again, if you can give me alternate bigger something, if we can have a status update on all the money stuff that the group we are working to it in my yeah.

A

Yeah one thing I was thinking and I know similar six do stairs. Maybe we can dedicate time just to doing design reviews for the proposals that might be out and one to eight, if not next week, then only beyond that and just ensure that maybe we alternate meetings to discuss the design before talking some bouts with these future issues but I know I, know Federation was doing that time, but other six are doing similar. We could we could try.

G

That cause like by doing like, like maybe their stakeholder, a lot of each in here, will definitely you would once the crumbs yeah.

A

Okay, well looks like we will do the infilling for next week and thanks again, Deepak Mohit for the best, and thanks.

B

Everybody for letting is presented thank thanks. Applicative yep.

A

And I will upload a photo on your trip and I'll talk to you soon. I guess.