Kubernetes SIG Scheduling, 15 Nov 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Scheduling - 2018-11-15

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

As you know, this meeting is recorded and will be uploaded to public Internet, so watch what you're saying okay guys. So we have a couple of items given that tomorrow is the code freeze and also the fact that I will be on vacation for a while, basically starting tomorrow, I'm not gonna, be around for about 10 days and I. Guess. I have already mentioned that you're not gonna have any meetings. Next week, it's gonna be holidays here in the US and I'm, pretty sure we're gonna be around so no meetings.

A

Next week we are, we you're gonna, finish up everything that we have probably today or tomorrow, since tomorrow is gonna, be the code. Freeze I know that there are a couple of issues which not like super urgent, but they are. They are some issues that we would still like to fix in 113, I guess both of the ones that I have in mind are being handled. By way, one is a crashing to scheduler, which is actually a rare kind of crash, but it happens.

A

Occasionally we don't know exactly why it's happening, but we still need to fix it. It's apparently caused by the fact that some of our pointers are nil and we dereference them and it causes a crash in node info clone function. So we need to address that merge it in 113. You know we can merge bugs after code freeze, but it's still better to reg as soon as possible, so that there was darts releases will have D patches.

A

B

Yeah for this one thing, I think we can have a safe fix that he'll comment in the in the issue, so we can just check the resources whether they are me or and yeah goddamn, dad yeah.

A

I actually didn't carefully check but looks like we always create note info with the new node info function, and you know the info function always initializes those resources, so I I cannot think of a scenario where those resources are nil, but apparently it happens somewhere and we still need to check I guess a better, probably a better approach. The problem is to make sure that every time that we initialize a known info, we do it. Why didn't you know the info function? Al not directly yeah.

C

A

Is a direct initialization I would assume somewhere in the code and it's probably a path that is not very commonly executed. Otherwise you would have seen a lot more crashes mirrors are pretty okay, so we can actually look carefully. We can actually look a little bit more carefully it. Maybe we can find it, but anyways we go unless I guess it doesn't hurt.

A

If you add those checks, probably they're not gonna make us slower compared to I mean relative to what we do in the scheduler and all the logic there, probably several more checks, you're, not gonna pet as much okay, that's one another issue that they actually did a great job.

A

Finding is a sort of like a race condition in the preemption logic, so that one sometimes when a nominated part has been rescheduled the fact that its nominated is removed from the scheduling queue, so the scheduling queue while looking at the part or we trying to schedule a nominated part doesn't have the information about the fact that this part is nominated. So when the part goes back to here, let's say that the part nominated part is not a scheduled both.

A

So it goes back to the queue and the schedule actually takes another part for me, but in the meantime, where this is being put back to the queue and being scheduled to schedule, they missed the fact that there is one extra nominated.

B

A

Result as the pilot is being tried after this nominated but may be considered, scheduled about I'm, getting scheduled and then later on, gets preempted again by the high priority point, basically that, with the same nominated part, so this race condition can be fixed by basically leaving a nominated pod information in the queue until the part is actually assumed. So he has that fix. He is working on a test. You know he has actually written a pretty sophisticated end-to-end test for this, which can reproduce the issue, but the test is brittle. In my opinion, it counts.

A

The number of events happening for parts and stuff like that, which is we're gonna fail in the near future by change of logics in many other many many places in the code. So I asked him to see if we can reproduce it reproduce this with an integration test. Of course, this is not, since this is like a race condition and it's kind of rare I.

A

It may not be reproducible all the time, but if we can reproduce it with some probability, I don't know at 5% or so is enough, and then when we have the fix, we can check if it is a stoolie producible or not, so they I guess. I asked you this on on the bug there. You know, if you have tried, stress, testing this, which actually running the test way to stress maybe four thousand times or something to see. If, if it fails, yeah.

B

I'll, try the a little bit this afternoon, but I just don't have luck to reproduce him using the integration test. So basically, every time the preemptory has the chance to be rescheduled prior to other pending pass and that the first time he got the chance to be rescheduled, it has the chance to be scheduled to be running on the path. So no other story after that rice, so I, think white, heaven Singh a real word is that the preempted pass still being terminated so in their grace period.

B

So that means in the first time in the first attempt to reschedule the preempted prod has failed because the resource has still be kept by this terminating path right in the integration test. I, don't know why there is a comment, maybe by you says that we must said the grace period to 0.

B

Otherwise it won't be work or something like that. So I tried yes, I have to set the grace period to zero, and that means saw the past. Rabbid I will be preempted immediately. No.

A

No, no, you don't! You don't need I, guess that you know the graceful termination period equal to zero was for a different test in this particular test. You think we said it should work. Maybe we didn't want to unnecessarily wait or make the test slow in some other cases, but feel free to increase it. Probably like one or two seconds would be good enough for your case right yeah.

B

I, try I, try to add in some seconds, but it seems it totally doesn't work I mean the preemption wasn't able to be scheduled, so I haven't checked it the real case yet well. I will ever do some more investigation to see whether I can peel time. You got some tasks.

A

Among other changes that other folks are working on, is there anything that requires attention or requires my attention in particular,.

A

Is there any major or critical change that you want to measure in to 113.

A

Okay, I assume so.

B

Bobby, so for these two issues, I should be able to Kewell Lionel commits by tonight. I think so. You us the available tomorrow right, yeah.

A

Yeah I I will be on vacation starting tomorrow, but no.

B

A

Check or tomorrow morning, I can check, I mean it's not a big deal or you can ask us to to approve for you or actually I can I can approve them, and then you can ask someone to lgt em, okay,.

B

D

Hey Bobby I had a question about the the issue which are trees, which was like time bisk priority that TV, which is being constructed so like we have a lot of like people who are submitting machine learning jobs, so they just fire and forget jobs, but they're like their long-running jobs.

D

Almost all of them are so it's like a big issue for them like if there are five jobs waiting, obviously because the resources are being consumed, all the cheap use and everything so so when their jobs are being picked up, if they don't get picked up by time based order like it's, it's a big issue for them, because then people which are submitted a job like two days back their job gets picked up because some of the jobs run for days.

D

So what was the blocking condition you were talking about, like I was trying to review it and I have postal command on what's in? Are you I try? So we were just wondering what what is blocking conditioner talking about yeah.

A

Yeah, so so the thing is this: in you know scheduling to you. If you've noticed there are like two Q's right, so one is the unschedulable one and one is.

B

A

So when a pie dish is tried and goes to the under schedule about q1, once the condition changes in the cluster, the party's moved from the unschedulable q2t activity, so in larger clusters, with thousands of nodes, let's say, the events that will receive four nodes are very frequent because not sent updates every 10 seconds yeah. So basically, let's say that if you have a thousand thousand nodes in your cluster chances are that you receive an average 100 dates per second mean so part in the unskilled. You go frequently to the active queue yeah right.

A

So, if you have, let's say a a part that is unschedulable and keeps getting to the head of the queue, it can block the other parts behind it from getting rescheduled, because if it is created in the past, it just keeps going to the head of the queue. Oh yeah.

D

A

That's the blocking thing that I'm worried about in larger clusters. I I saw your comment there. Sorry I just just saw it right before this meeting. I didn't have the chance to reply, but I guess in your case, since the clusters are small, probably you don't see this issue all right, because this this part that is not in the unschedulable queue, is not gonna, go very frequently to the active queue yeah.

D

But without that condition, don't you think it's all random anyways the chance so anyway, some of the pods are gonna, go an impending State, so at least like oh, what we were thinking is like at least we can determine how the pods are getting. There I understand the issue, but what if people need these pods.

A

To arrive in right now, if, if many parts have the same priority, yeah some of them are already in the active cube yeah when you move something from the unschedulable queue to the active queue chances are that those those are those which are currently in the active q are gonna, be ahead of this part, yeah, alright, so those get some chance to be tried by the scheduler yeah.

D

So what we were trying to replicate, the scenario is like we had a lot of jobs which are long-running jobs and in most of the cases, the job which was next was now picked up like it was always like the if they were like four jobs running. You expect the fifth job to get picked up and we speech other resources or present for the fifth Sean nine out of ten cases, the seventh of the age of victim, like not so that so we were wondering like. Even if.

D

Okay, if you say that's an issue, maybe we could put that as applied, but the logic is like just few lines. So if you put a flag where people really need in order like execution and they could use that flag and your pride EQ could handle in order arrival of pods in that case, if that would be really helpful, yeah.

A

Yeah we can. We can certainly do that, and I mean if a flag that is maybe by default disabled for now at least and yeah. We can certainly do that or a conflict somewhere that tells the scheduler.

C

A

Could be our policy configure or something as well, but we need. We need some sort of a configuration, because otherwise, in some of the scenarios, if this may hurt performance of the scheduler, okay.

D

Sure I'll updated to work with flag and then send the BIA sure sounds good.

A

Yeah, you know another thing that actually has been in the back of my mind for a long time now it's the fact that we shouldn't be retrying all these unschedulable parts very frequently. We should. We should have a backup mechanism AC and in the past, without the priority queue that we can only have. We had a BAC of lack of mechanism.

A

So if we try the pod- and it was on a schedule of all- we would, it would add a little bit of back off and the back of would he increment get incremented like exponentially up to a certain limit like, for example, before you, with incremented up to a minute minute is too long. Maybe if he can increment it up to 10 seconds or something would be, would be good enough for now I.

A

Actually, there there has been a PR for a long time to implement the back of mechanism, but I guess the person who was working on that PR is no longer active.

A

B

Gonna be chasing the same thing of me, so maybe.

A

I can talk.

B

With yes, I can talk with him, um picking up his stuff and continue on there.

A

Oh, that would be great because I was kinda on the verge of cloning, his PR and finishing it myself. If we can still work on it, that would be great yeah I can talk to him. Yeah, okay, Arang, right, correct, yeah,.

B

Yeah he's hissin community of community. Oh okay,.

A

Yeah, if he can still work and I, think it I actually last time that I reviewed that PR I felt like it's almost there I had some comments about changing some smaller things. If I'm not mistaken, it was a long back. I don't have a very good recollection of that, but I feel like it was almost there in probably a little bit more work.

A

It would have been done, but anyway, if we had that back of mechanism, the problem that we are now talking about, you know like a pod, kinda hugging, the head of the scheduling he wouldn't have been there.

A

The scheduler would have tried several times and then this pod would get it back off and wouldn't be put into the active queue after after, like several tries, so he could have waited a little longer allowing other pods behind it to get scheduled, and if we had this backup mechanism, probably we didn't even need to have any flies or anything. We could just sort everything by priority than by creation.

A

So I'm aware of this I'm aware of the fact that this is actually a feature that a lot of batch jobs require yeah, and we want to try to address this actually in in the scheduler, as well as in our heart creation logic. This is actually something that we sometimes refer to as Hugh jobs. There are jobs that are queued and should be scheduled, based on the order in the queue all right you another thing with with these jobs, you know something, probably you don't have a lot of concerns about Codell yeah,.

C

A

Some users reach one motor motor connect clusters have concerns about Koda as well, so you have like queue jobs and you don't want these queue jobs to consume your Koda if they are, if they shouldn't be scheduled, for example, if you have a job that should be, that should get the scheduled and finished and then the next one should be scheduled. You don't want all of those to use all your quota from the very beginning.

D

Yeah, that is one of the concerns like we have spoken to close about that as well, and we were trying to figure out how he's able to manage that and cubot scheduler, because, like suddenly feels running machine learning, jobs yeah like there are users who do not get a chance to submit their jobs ever and that's that's a big headache for us as well. Right.

A

Right actually, today, I I spoke with one big company. I cannot reveal the name, but you all know probably the largest company, at least in terms of market cap in the world, so those guys are switching from amazes to kubernetes, which is actually a good news for us. I met with their scheduling team today and they are probably gonna start contributing a lot of code to to us to kubernetes and particularly to its scheduling.

A

Those guys are also interested in some of these features, and particular some features around batch. You know, kubernetes is already a pretty good platform for running services, but when it comes to badge, we still need quite more features for batch scheduling and batch admission yeah. So this queue jobs is certainly one of those. We are gonna you're, going to start adding more features particularly implemented.

B

A

Later might be not necessarily later parallel, maybe also work on gang schedule. Yeah.

D

The one concern like which we had was like what was the direction going forward like it's Cube badge, gonna, be an alternative schedule or there are plans to integrate that into the default scan. You know for future releases so.

A

But you know I mean with Mike, according to the conversations that we have had with Klaus. The plan going forward is to business scheduling framework and then after prototyping gang scheduling in an incubator which is cube batch. We bring the logic and everything as a as a our plug-ins for the scheduling framework. So, but whether it's going to be one scheduler or two scheduler is something that we need to discuss. It's not finalized. Yet, but at least today our plan is to build everything in the same codebase, yeah.

D

Because some of the people have started using cool batch and default scheduler and one thing which they have noticed is like when two schedulers are competing for the sauces. Some of the some of the EPS don't like like behave appropriately like they, both are assigned a node and then one is: has a memory unavailable or something and they just back off. So that's another concern which we we're going to look after this one gets all like when two schedulers are competing for the same resources: I I, don't know how they're gonna manage that, but that.

A

Is true, so that's kind of inevitable or sort of inevitable in in scenarios.

B

A

Have multiple schedulers and if both of them are very aggressively and looking for resources, for example, if both of them have a lot of pending paths, then chances are. You will see this for this problem. This is probably less common in clusters, where there are not a lot of pending paths or you have a lot of free resources available.

A

This situation probably happens with lower probability, but in well or well, utilized clusters that this is a common issue. So for that we may actually the reason actually that I said that we haven't decided is exactly because we haven't experimented with these two schedulers and are yours enough to tell you whether we should go with a single one or two works reasonably well: okay,.

A

Okay, is there any comments? Oh hey Henry is here hey? How are things going in the equivalent cash world? Hey.

C

Yeah yeah I have some quick updates. So, as I mentioned last week that I encountered some issues during the POC code and I spent a whole week on it. So I think the issue has been fixed so next week. I can create a issue of a task list and contribute some PRS based on I. Think the first P of it will be the cleanup PR, okay and yeah and feature PR. We will photo op photo behind them. Nice.

A

That's great okay, I, probably you've seen I I have actually created a cleanup issue for you. I.

C

Did not notice that yeah.

A

I created a cleanup issue to remove equivalent cash from the scheduler yeah.

C

Yeah, that's my plan. Yeah.

A

Yeah, okay, I, actually Ray and I, were wondering whether we should let ray do the job or you have two cycles to do it. So yeah I.

B

Was to have okay.

C

That's good, okay,.

A

How do you find okay.

B

A

B

A

Right so I said I think I didn't hear you though I think.

B

A

Would be impaired, taking sure sorry, I don't have any particular preference I mean if any of you has enough cycles. I will be happy to give it to any of you guys. So how are you were gonna? Do it yourself? Okay,.

C

And we'll do that: okay,.

A

Yeah the issue is there: we've mentioned you I guess, and that is really. How do you find it? Okay, what else I guess these are all the things that is in mind. I would like to emphasize again that next week we are not gonna. Have any me by the way. Are you folks going to kill Qantas here or see I'm going to okay, I will.

C

Be there I haven't, talked small talking over there, nice, okay, how.

A

About you, Robbie, are you going or so oh yeah.

D

I'll, be there.

A

D

I'm not going to my native country, okay,.

A

So I guess I I see most of you folks. Thank you, fine as well all right. Thank you! So much you're almost running out of time. If you have any questions last minute, you can share with us.

C

B

A