Kubernetes SIG Scheduling, 2 Aug 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Scheduling meeting - 2018-08-02

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, so let's go over our items that we have since Deepak is here. We also would like to hear about the progress and Satan so with our items, so equivalence cache, which is one of the items that you were discussing recently.

A

So what what I had found before was that um we were seeing not much improvement when we were enabling equivalence cache, but it turns out that our scripts that were running the benchmarks had been changed recently and the changes caused some of our environment variables to not get passed correctly to the benchmarks and, as a result, benchmark results were not accurate. I believe that they reverted that change, and now those those results should be different. I haven't I, haven't tried them myself, but I believe there is er now fixed and we see major improvement.

A

Major improvement with equivalence, cache I, don't know if Harry is here, I, don't think he is because I guess it's not a perfect time and his time zone, but anyway looks like equivalence. Cache problem is solved and we are on track to enable this by default in 112.

A

B

On a related note, I think one of the things is related to the read/write lock very, not scales well across the course and I'm testing that out I had a discussion with Harry. He has created a PR for that and I'm testing that out.

A

Yes, so is that any different and what has already merged I do.

B

Not know if they've got that the PR that has created recently got merged or not. Okay,.

A

So because we have had one change with respect to read, Roy block and actually making the the lock for equivalence cache more fine-grain. So then we do a lot. The whole equivalents cash incident.

C

A

Node, so that change is already made and that's actually why we got so much improvement recently.

C

A

This we were not seeing any improvement with equivalents cash. Actually, we were seeing performance degradation when we were enabling equivalents cash which didn't make sense at all, and it turned out to be a problem with the with a lot. But anyway, if there are more areas that we are seeing, that lotting is causing performance degradation, it will be awesome if you guys can address those. Yes.

B

A

I'm working with molecular in X, so graduating tanks, tainting node to beta ray. Do you have any further updates for us.

D

So right now all the major issues has been resolved all the last week. The integration test appeal has consumed to our strengths and the ROI now think. The the item is just to update the website. Okay,.

A

D

That I will contact the release team to see the venues and it's specific release, note or not, and after that we can close this item. That's.

A

Great, thank you so much for your help and if there is any PR that needs my attention, please ping me yeah.

D

I think the only left one is the website. You're. Okay,.

A

Sounds great, thank you so much so gang scheduling. Klaus is not here yeah. He had a meeting last week with, with Klaus and Connor Connor door from Intel to discuss further how the API should work. We are I think we are closer to basically reaching an agreement on the API. We are thinking about having a group called like a or having an API object called pod group. What group they essentially task, how many members should exist in a gang, and it also provides an array of and telling the API server?

A

Ladies, in general, what should be done when the number of members of a gang drops below this the provided number? For example, if the gang was supposed ten members what should have, and if the number drops the number of parts dropped below ten should the whole gang get killed? Should we tolerate the lower number?

A

Should we wait for that value to get back to ten, for example, if there are controllers that create pots, should we wait for those controllers to bring the number back to ten for a while, and if it didn't happen, then maybe we can kill those. There were some discussions about that, so we have a. We are going to have a policy in the inner part group, a object-- that specifies what should be done in the in these cases.

A

Another topic that we discussed was to have another controller, that that controls the life cycle of a gang or having one of the existing control and so forth. The idea- and our conclusion is that we need another controller and another Kotori that is specific for controlling the life cycles. Again anyway, you know that the PR is out it's linked to our document, which is in the invitation, link an invitation and email.

C

A

Calendar invite as well so if you're interested you can go and read that further any question about that.

A

A

There is this image locality which is on track, and then, oh, yes, so another important item that we were working on, which is related to like designing a new feature in kubernetes, is designing pod scheduling policies with respect to participation policies. There has a. There has been a PR out there for the design. It has gone through several cycles of changes so far and there are still so many comments and a PR would like those to be addressed as soon as possible.

A

Yes in as working mostly on that as far as I can tell and I don't know, if there is any further updates, I actually managed to read you. The PR and I'm gonna meet with Tim earlier today to further discuss the details, but we are hoping that we can finalize the design before 1:12, but that's just gonna, be the design you're not gonna, have the feature in 1:12. Yes in is not here today. So I guess he's.

C

On like PTO on vacation for like a week or so, okay.

A

Thanks Erica, do you have any further updates for us or do you know any I think.

C

The updates are basically what you said: they've been a recent iteration I've only started to go through that I saw you left some comments. Tim left some comments. I'll probably try to respond to a bunch of those comments today or tomorrow. All right thanks, otherwise yeah same kind of going through and trying to figure it out. Okay,.

A

Sounds good thanks mmm all right back to way again, so how are things going on with respect to daemon, such as scheduling default schedule, so.

D

I'm working on some to adding some integration tester, for example, especially to add a case to test that attended. Not so it actually is, is a behavior change right so and also I submit issue to kill Kip ATM because they had used the demons data to to provision the cluster in the very beginning phase and this well. This feature scheduled demonstrated by the default schedule, the kind of change their behavior a little bit I, so I want to notify them as early as possible, so they tastes yeah.

D

They said it's a it's a it's a change and the diva have some new casts, or somebody review on their. So simply put so I think the major changes that I'll, actually all the internal behavior, is to consistent. How we to predicts- and priority is the same but in terms of because we want we need compared to the behavior of the human self controller when they scattering the demon sales posture. So one major changes they respect int, but the fall schedule doesn't so that will cause some some behavior, for example, for cube a DM cluster.

D

The master node has specific change so that some like keep proxy or katakana we're not deploying on the master node all right in the. If we enable the schedule demonstrated by the default schedule, there will be a king being part on to the master nel before he don't change. The you know said manifest so.

A

First of all, why are you saying that so default scheduler definitely considers things. It's not like default schedule. Doesn't care about things so yeah.

D

You can see the things, but in the in the worry later face, so it supporting. You should count the total number to run right. So we so.

A

The demons that control is still creates, the demons at hearts right and the controller puts no the affinity on the parts so that there are scheduled on specific notes. We.

D

Put the specific in order of energy on that, with the metadata done name and with the match feel.

B

Yes, explicitly added, as after the port space is one Winston.

A

So I saved it again. We.

B

Have explicitly I'm adding to the the dimension controller? Sorry demon said respect that we are adding yes.

A

That's what we that's, what we do actually I mean class, make some changes to the domestic controller as a part of this feature so demon. Basically, if demon sets are scheduled by the default scheduler demon, sir controller only create spots and as a part of the creation, it also puts the node affinity so that these pods are scheduled on specific nodes. Yes, and if demons are controller, does not create any demon set for the master node, we shouldn't see any pending pass for the master, but.

D

A

Pathway created, so that's something that we need to actually look at and see why the demons, a controller, creates those parts may be. Domestic controller, doesn't see the taints for specific nodes, and, if that's the case, then we need to fix that. That.

B

Is right, that's order, it's not respecting the teams, at least for now, yeah.

A

Okay, yeah: that's something that we need to fix on the DMS or controller. This is not something to be address on them. On the scheduler yeah I just looked.

B

At the issue that we has created and I think the agreement was cross, I stood to make it as a change in the website, instead of ensuring that this is backward compatible.

D

Let me let me let me spend more time on this to dig into whether we need to document it all kind of fix it. Okay,.

A

And also you mentioned cube ATM you're right, that's actually very important. Bootstrapping the cluster is important. We must make sure that the you know our current two scripts that bring up a cluster like cube, open all consider scheduled as one of the critical components and bring it up as a part of cluster Buddhist wrapping. We should make sure that cube ATM does the same thing and brings up the scheduler as a part of the critical components or essential components of the cluster during bootstrapping.

A

Otherwise, demons are paths that are created and a bootstrapping process won't be scheduled. That shouldn't be a fundamental change. In my opinion, unless cubed cubed iam is making certain assumptions, I hope they don't but you're right. That's an important, that's an important point to consider and we must make sure that cube idiom works and I.

D

Will add this this comment to to death today? Is you so to have them to be aware of the scheduler should be created.

B

This happens when we enable both think nor conditions and schedule demon supports.

A

Sure this you're saying that we should enable both at the same time, I mean.

B

This issue happens only when able both, without that the cube and behavior is going to be pretty much seen. Why.

A

Is it the case? Why does painting out by condition has anything to do with events of the schedule so.

B

I like to be I, do not know how cube Adam internally works, but the way it works probably is. It creates a dimension controller for scheduling some of the parts like it creates dimensions for some of the parts and when, when the board is actually getting created on the nor, since the taint is not respected by demons, that control unit is creating it. It's going into the pending state.

B

If we do not have the demon set, controller scheduling or take notes, click if either one of them are enabled, we will not see this issue happening. I don't know if Adam has completely moved to tainting notes by condition mechanically.

D

That's a good point: I can't check out yeah.

A

This should be addressable, I, don't see any fundamental problem here. This is just say, probably a logic that should be fixed. I, don't see any. You know fundamental technical issue with with enabling both at the same time, this is just something that we need to address. Yeah.

B

We we may have to just start like cool in the dimension controller side, yeah.

A

B

I, don't know if we have decided going that way or on the website. We haven't update that it's. This is a change going forward, that's how it will be yeah.

A

I, don't know about a lot of approach, I think if something is fixable and is relatively easy to fix, we should address it properly in synergies, documenting the problem. Yeah.

B

And we don't know the use cases like you, gottem seems to be one. There might be other people who are using it in a different way.

A

Anyway, so moving on there was this other feature or life performance improvement technique that I added to schedule 2 to check fewer nodes of the cluster for feasibility. Basically, after we reach to a point that a certain number of nodes are found feasible for the cluster, then we stopped and just to score. They were found now, instead of just going and finding on checking all the nodes in the cluster PR is I. Guess almost ready, thanks for thanks to class on Raw before reviewing.

A

If you guys have any more comments, please go ahead and leave those comments. Otherwise, I guess we are closer to finishing that that's gonna be an alpha feature in 112 and probably we can start enabling getting away that basically we're they're, gonna score, start scoring fewer notes in 113 as a way to improve performance of the scheduler hi.

E

Bobbi candy, can you just serve publish the link to the PR? Please? Oh, it's.

A

The link to the PR is already in our spreadsheet, but that's.

E

Connect Rustica be running it too same issue actually informant. Also, okay,.

A

Okay, let me just put it in chat.

A

E

A

So I guess these are all the features that we're tracking for 112 epoch since you're here. Could you please give us an update on the facility, but.

B

We have like one more thing which I'm sorry Deepa.

B

One thing is related to the rarity classes, Kota that we have that we are trying to graduate to be time so because has created like peers, and he has tagged you I'm going to ping you. It runs slack. He.

A

Has yes, my mate, those are on my on my task. Disciple I will review them. Hopefully, today sure yeah thanks.

A

E

Just before I go ahead, just a quick question. You mentioned very briefly about the image locality thing. Is that the same project which the Berkeley guys propose that the dependency scheduling for containers thing or is that something different? No.

A

The imaginal quality is a as a priority function since trying to put pods on the nodes that already have the images for the pod. Basically, it's not it's not like a predicate. It said priority function, it's a best effort thing. It considers other metrics as well, but if I node, let's say it has more or less equal parameters and similar distribution and everything, but has all the images and that project boils or some of the images are the public wires. That know this preferred over and yeah.

E

The reason I brought that up was because we invited all the the Berkeley researchers recently, a workshop at Huawei and one of the research he said he's trying to incubate a project called dependency scheduling that has a similar kind of thing. Actually there how to optimize the image, docker image thing, scheduling based on the horizon, and he was saying that he's working with you guys to do the incubation steps, or maybe it's something else, I'll send you some details. Maybe this.

A

Is this is not.

E

A

This is actually a feature data form to enable in 112, okay,.

E

And then the other other comment was that, while we were doing very detailed testing, we ran into the taints I didn't want to bring it up. You know we're not mean the person who was talking about the taints issue, the taint yeah, so we ran into some couple of issues. Actually, so everything works fine, but when we do the end-to-end testing that took there two cases, the taints in toleration they fail and for some reason, there's some kind of a synchronization between timing issue.

E

When the taint gets applied, I don't know exactly, but the developer who's working on that I'm gonna ask her to create a PR and she did see that actually one one issue.

A

E

A

Facade on or defaults of schedule, no.

E

No, no, this isn't providing this isn't for Monica, so we wondering so we're not really sure we thought, maybe it's the same issue with the default scheduler as well. So because we spend a lot of cycles and we can figure it out and they keep failing in end-to-end, but otherwise, when we do testing in our local environment, they all so.

D

E

Anyway, we'll go ahead and create a shoe so just to give update so remember last time when we left off- and you specifically mention about the part and I affinity, symmetry thing and all that so that took us some time actually so we it cannot throw us out. So we put the logic in so the whole value proposition of for moment is that you amortize work. So essentially, if so in a moment everything is a job. So replicas of self is a job. Deployment is a job and job is each other.

E

So if two jobs that they have the same CPU memory resource requirement, they get all grouped and then that's all you amortize work and that's how you see the better through policy, but in order for us to support symmetry, we couldn't combine that you see because we need label. We need to check the label to see if this part incoming part has any conflict with the existing running part. So so we ran into issue, and then we were able to isolate that so essentially everywhere saw our assumption.

E

Is that they're not going to be lot of these kind of pass conditions, so we isolated that upfront. So if you have this condition conflict, we will schedule, we will create and we call the equivalence class actually so based on that for that particular job, but the normal parts. If they don't have any conflict, they will be able to we'll be able to group them together.

E

So keeping that design in mind so for the normal parts which do not have any conflict with the existing running parts and then type anti symmetry, we we still see the same performance benefits. You know 70, X and I'm, sorry, but 20, X, 30, X, and all that and then, if we have so what. But the testing which we are doing, if we can a sprinkling 10% 20% of the parts which have conflict incoming part which have conflict with the anti affinity symmetry.

E

So obviously those parts will have some performance degradation, but the normal parts which don't have any performance conflict. They will be exactly the way we were seeing the performance. So essentially, we have the whole functionality in there and then the the node affinity I mentioned that as well last time, no affinity because we cannot encrypt them as well, because if you have an order finicky, we have to check the notes back and all that, so we so that. So we cannot really see the amortization.

E

We cannot realize the amortization there and the powered affinity as I mentioned, is one by one currently just like the default statement. So but the issue, the the thing, obviously, which stays the same.

E

The normal parts which I am assuming is going to be 70 to 80 percent of the cases that we still see the same benefit, because if the resource requirement for the jobs are the same, we are able to it's very straightforward actually, because if the resource requirement is saying instead of you doing filtering and scoring for all 20 pods in that job, we just once. That's all you see the performance basically.

A

Doesn't similar problem to like no definitely exists with chains and toleration same.

E

If the ancient toleration is exactly like no Tiffany, yes,.

A

Right we're not able to group them yeah yeah and then, when there is let's say even one part with node entire family running in the whole cluster. Don't you need to check all the parts we're checking everything so.

E

We have a memory, so we have a memory structure. Actually so we'd go node by node and see if any any any any label in that job matches. With that structure we have a hash map and then, if we do, then we treat him differently. Basically, all.

A

Right so my question basically is this: if there is one part in the whole cluster, with anti affinity with interpolant affinity, doesn't the performance of everything and say.

E

That no no that's.

A

E

What I meant take this? So that's a Weaver, so if one part so let's say that in twenty part 20 jobs coming in and then so they all have 20 labels, unique labels and one of the replicas said, has a conflict with that running part. Okay, only that replica said would be impacted the performance but the rest nineteen. If they have the same conflict, obviously they will be impacted as well, but if they don't, the performance would be exactly the same. The.

C

E

Is a very minor minor overhead is because we checking upfront if it has the conflict or not, but the same.

A

E

Calls as yeah that was basically my question.

A

So that's what I kept.

E

Saying the normal parts which do not have any conflict they are, we still see the same. You know the 20 or 20 X 30 X, depending on the resource requirement. So we trying to kind of stress that as well, so the CPU memory combination being the same, then we can amortize. It is very good at I mean so, and the other thing is actually there's a lot of optimization opportunities. The the solver which we are using is very open source. It's a CS truth solver and the solver which these guys multi and you know, develop.

E

That's even they're saying this when ten times better. But the problem is that's not open source mm-hm. So if we can somehow get that solver or so one of the problem with this approach is the number of Arc's the increase. The throughput goes down basically C and the current solver doesn't handle that properly the one, the open source one thought they was developed by Microsoft ICS to solve, but if we can get relaxation based solver, the performance will even get better licensed.

A

America go ahead.

A

Alright, thanks thanks a lot the medic. What did you did you want to see something, or this hurt all right, so we're running out of time. If there is any other question, please go ahead and ask quickly. Otherwise we can end the meeting.

A

Going once twice alright, thank you, everyone and see you next week, some of you at least rifle one. Thank.

E