Kubernetes SIG Scheduling, 20 Dec 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Scheduling Meeting - 2018-12-20

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right welcome to the last 2018 meeting of six scheduling. It was great actually to see some of you folks at Q Khan. Last week my I guess I saw you man for the first time. It was great meeting you and also a few other folks meeting them in person was great.

A

Q Khan was useful for getting some feedback from customers. There has been a lot of questions regarding scalability of the scheduler. Some people are asking that you're a DD one around larger clusters, aku behind SNP cannot do it, partly because of the scheduler. We know that there are other bottlenecks in kubernetes, but definitely we would like to address some of the issues with the scalability.

B

A

Scheduler in particular- and hopefully we can do that in the next quarter and maybe further work in the following ones. There has been some questions regarding scheduling framework. A lot of people are waiting for that they would like to customize the scheduler for their own workers. We have a tentative plan to have some of these extension points in 114, and hopefully we can go from there.

A

We're still kind of like going back and forth about the interface for plugins. Jonathan is working on those. Hopefully he can I can I.

B

A

Basically, change that it will change that and make it a little bit more generic and usable and sort of future-proof. It we'll see how that goes. But anyhow, our plan is to have something usable in 114 and then probably in 115. We want to have a fully fledged scheduling framework with a lot of extension points and something that we can use for our for building our own scheduler, basically moving our own priority and predicate functions, and hopefully the preemption logic to a plug-in.

A

That's about Q, Khan I have been to a few talks which were quite useful and interesting. I hope you also enjoyed Cuba.

A

There's any question comments about that before we go to some of the items that we have. Oh I'll actually have it, but the lady asked.

B

A

I have a object about a part that we found in the scheduler. Okay,.

C

The question that I have is: did we capture anything like on some dogs for the questions that we have got from customers or the people who have been using it, especially that feedback that would be useful on the scalability side?.

A

Sure yeah sure I can I can actually write something down. So this scalably, the question was our basically run. Usually two topics, one throughput of the scheduler: some people want to have higher throughput of the scheduler, the other, the other part. So an example was uber. Uber was comparing kubernetes to the system that the operating in house and they were saying that their system is a lot more performance compared to communities.

A

Although I I feel like some of their numbers, were a little off, so we have some scalability numbers that are not fully sort of agreeing with those numbers that they present that maybe they had a slightly different set up, or maybe they had measured something, but anyhow one one question: one request is basically having higher throughput. The other question is supporting the larger clusters authority. The scheduler does not have much problem with larger clusters. Of course you would require more memory, but still the memory usage and everything is not something completely unreasonable.

A

So I don't see any huge issues with raising the size of the cluster for the scheduler. However, it depends on what are the expectations for throughput. If you raise disguisers, if you raise the cluster size to like $10,000, for example, communities scheduled throughput is kind of dropped to IPR 2030 pods at maximum per second, which may not be acceptable for everyone.

A

So these are basically the two main things that are that are related to us as a scheduling, but we know that kubernetes has some other issues around the scalability as well. So the number of parts in the system, the number of namespaces in this system, the number of volumes that you can attach to nodes- and you know DNS records, IP, table records and rules and all of those can become a issue. If you go beyond the serving size in the cluster.

C

Yeah, the other thing that I that I wanted to ask us is the scale, a concern for most of the people, meaning like I've, seen people telling that they've started using Federation and all that did you notice that pattern in this cube com2. So you.

A

Know part yeah there are. There are some folks who have done this. So, for example, I was at a talk by Atlus and those guys have done a you know something like this: the instead of having a small number of large clusters, they have gone with large number of relatively small to us, there's not very small but relatively small clusters.

A

So yes, that's an option but oftentimes it doesn't work perfectly for all customers.

B

A

Know one of the one of the issues with smaller clusters is that you cannot raise a resource utilization to a great level yeah.

C

A

C

Different managing them has become like kind of a problem managing.

A

Definitely is an issue about that, but even if you build tools to manage large number of clusters still you cannot achieve high resource utilization in a large number of small clusters, as opposed to a large cluster. That has a lot of notes, hardly because you know it's similar to like sharing its if you, if you basically give a cluster to one or two teams now, let's say those guys always submit their jobs at their place a day time and then once the work hours are over, the cluster remains empty.

A

But if you share it among the cluster is larger. If you share it among I, know, 50 teams and chances are. There is at least a few teams all times of 24 hours now that have some workloads to run your classes. So when you share it, usually the resource utilization is higher. You can over commit your resources, and many of these are available T, which don't exist in small class or by doe Nexus I mean it usually doesn't happen in smaller classes.

C

Yeah, thank you. I.

B

Want to amend is that for scalability problem, actually any notice, the scheduler is not in the critical cost, I mean he's, not the only one part of making the scalability, actually the edge, CD and API server path is very matters a lot. So so, if you shake the city up, shimmy can see that we are trying to solve this problem from a city partly actually undo. Another were clear: you can check that patches and also we are refactoring the banking of database of LCD.

B

You know the full TV part we are actually we work a lot of things there, which I think act and, according to our des, helps a lot for the scalability part, I think yeah. If it's something people their organ barrage are interested in that hard and we can talk offline. Oh nice,.

A

Thanks for the update, I I wasn't aware of all the efforts in that area, but I totally agree with you tours particularly at City is one of the bottlenecks. Here we always have been always aware. I've been trying to address, but I'm glad that there has been some important improvements over there. So is this mostly about at city or both at cdnas.

A

B

Currently, most work is done from the edge city park. You can trick the extreme countries we have. We are doing this totally in the open way. Everything is all almost only a dream. You can check the PRS and, after the other we will also do something like indexing, partying agents over park and also help a lot and also how to split events from a CD. So we won't have become much larger cluster yeah.

C

You mean two separate cities, one for events and one for resources or the objects that we are storing.

B

C

Is this like to exceed is one for events and the other for normal, so it's basically.

B

Like it's like yo vivo use another bank and to choose serve the events part not a actually not to be a oh.

C

Okay, because we already tried this events like separating no THC events into another, eight CD, the results were good, but I do not know if we compare them. The other back-end that you are suggesting.

B

You're already able to separate I mean use 2x, ladies, who serve a events separately, but we are working on.

D

B

Thinking to do that, which won't be much better actually.

C

That's great and I agree with the point that you have made like scheduler is not on the critical path for the scalability yeah.

B

Yeah, we will share more achieve the all of everything here, so I think it's a good direction so.

E

That's is that a new feature, because API server currently doesn't support different back the net CD now.

B

It's not only a case. The record is a CD part, yeah.

A

All right, that's good, so yeah, but you know, maybe schedule is not the bottleneck. The main modeling here, but scheduler is certainly a problem in large clusters and if you go like five thousand or clusters, a scheduler becomes a problem. The NCD, of course, becomes a problem. If you create a lot of objects as well, but even you know, LCD can handle certain number of objects, and if you want to have like a throughput of fine deposit or more, you cannot get it in five thousand or faster, at least today. You cannot get it.

A

So that's why we're trying to address those as well so scalability in general scalability of kubernetes, particularly referring to, of course and generally is a problem that cannot be solved by fixing one component it. This is a team effort and many teams need to contribute to make it more scalable. As I said, there are many areas that need improvement and scheduler is.

C

Probably one of.

A

Those all right so one more update about the scheduler. We identified a relatively serious issue and the preemption logic we see there is a race condition between setting a nominated know them for a part and the next scheduling cycle. So when we said the nominated node name for a part in with in first scheduling cycle for, let's say part number one, then the next scheduling cycle for part number two may start before that nominated node event has arrived.

A

As a result, the scheduler may not see this nominated node name before starting the next scheduling cycle this causes. This could cause the scheduler to play. Some parts and I know that is nominated for another part. More details exist and in the pocket I filed I just send it over the chat window.

A

We are trying to address that I I will hopefully same APR today or tomorrow. Today is unlikely everybody yeah. So that's about that. I also find a couple of other issues with respect to. Actually this is again related to scalability setting percentage of nodes that we use call in each scheduling cycle dynamically if it is not. Basically, if it is not specified as an argument, then the scheduler has a logic that determines the number of percentage of those most dynamically.

A

Yeah so because we talked about most of these stuff, okay, so my I know that you have been working on moving moving the event, handlers and I would like to hear your thoughts and see if we can help with those right.

E

So I think I discussed that with Jonathan and basically he was saying we should create a thread so I started a email thread on the sick, scheduling, yeah.

A

E

Some thoughts from the github and then I think the the end result of the last discussion was basically that we were trying to see if we can separate the event handler queuing to two just so, wherever we add to the scheduling queue, we put it in the queue package in scheduler internal queue and wherever we are adding to the pod queue, which is basically the scheduling queue we added to the other package right, but that was busy the last. The issue that we were discussing was that that has an ordering issue.

E

It might happen that an event a goes into one of the twos and not the other, and so, if we I mean the way we have the code structured in scheduler, we have that dependency. I, don't think we can easily structure that way, so we would I mean without doing too much Rilke tech chure. It feels like we would need to have that, adding to both the scheduler cache and the part you should happen in the same like together as an atomic operation.

E

So with that in mind, right now those add event, handlers are being called from factory or go, and my current PR moves them to scheduler dot go in the outer scheduler paper, so yeah I, guess I have laid out a few other options like, for example, should we move those to a common package parallel to queue and cache? So, for example, we have internal queue and internal cache right, so do something like internal common and move everything there or keep the way it is in the current PR, which is in packet scheduler.

E

It still moved out moves out of factory dot go. It was all my thoughts. I, don't know if there are other people who have looked at it and have better suggestions in that we are happy to like see. There are better suggestions. I am happy to work on those sure.

A

Yeah for sure we have I actually thought Jonathan as well. We definitely have logic dependency on the ordering of updating queue on cache and we cannot easily get rid of those. So one option is to basically have event handlers in one place, but make at one place aware of the other. So, for example, you can move possibly even handlers to the cash, but you pass the pointer of the queue to the cache for that cash updates the queue as well. But this is not it I, don't know.

A

This is not necessarily a very clean design either. So we are solving one problem, creating like right, so I, I kind of like the idea of having maybe a separate common package, but it doesn't have to be internal common. It can be like internal I, know, event, handlers or something like that sure yeah. So something yeah, something like that sounds to me like a better option, but I have to actually look at the code eventually to see if it all fits and and.

E

Maybe what I'll do is, like the same email thread, put some pseudo code on what this might look like yeah.

E

That will avoid, like me, doing the PR. So, let's brainstorm on that thread, yeah.

C

E

Guess the other question is like I think what Jonathan was suggesting was to pass the informers in new cache and new queue so that we can add them inside right. So if we create a new package and put all the definitions of those event handlers there, then maybe then, how do we? Where do we call that from I guess that question you.

A

Know so, even if you do that and let's say you pass those, then you you want each of those to install their own event handlers. How do you yeah.

E

That's the part: I am not clear Ella, but to take a look at the code. Yeah.

A

That does not necessarily work, I mean at least at the moment. I feel it. It won't work because again we will have ordering issues but yeah, maybe Jonathan had a different idea. Can we can talk with him so.

E

I think the main issue- the main issue here, is that we didn't want that event hand needs to be. In fact we don't go, which is like the internal part of the scheduler right and it doesn't even belong to the factory part. So it does belong to the outer part. So should we should we keep it in the scheduler and not something internal?

E

It could be something in the internal, common or internal event handler as well, but but then will not be able to call it. So if we put it inside internal will not be able to call it from something outside internal right, so yeah and put it inside internal, then we cannot put it in the queue or cache, because that will have the ordering issue or we I mean the other way to look at.

E

It is like call the same event handler from both queue and cache and let the informer Handler and handle the duplicate calling it would be a no op right, but.

B

E

Seem thinner as well, so it feels like it should be somewhere in scheduler. Dot go right so that it can be called in one place for.

A

At that level, yeah I mean does.

E

Necessarily to.

A

Being that file.

D

E

Okay, let me send out like a couple of these two three options as pseudocode and then I'll. Happy yeah like please feel free to comment on those ideas. Would it be easier if I put it in this Google Doc yeah.

A

That would be easier, I guess than email.

D

Okay sounds good I'll follow up on that Thanks yeah.

A

Yeah sure, okay yeah, is there any other updates from you. Folks I know that it's kind of like the end of the year so I met. There may not be a lot of progress on projects, but if there is any updates from you folks, please share them with us.

F

Yeah wait here so regarding the regarding the feature there to support the path affinity jointly on multiple paths, so basic aid, the API straight and also the performance test. The drug test has been lost almost already, so it doesn't have any Xfinity. Second, a decent performance increase according comparing the same codebase we've without peer, but I do know this they're slightly 6 or 7 percent interest.

F

You know remember that I had also a performance testing 113, so the code base I think most of the same.

A

One second way, so you have added that. Are we talking about supporting multiple, almost affinity to multiple parts, right, yeah, yeah correct? So you said there has been performance increase.

F

And now I mean it has again current code base right and I also used to taste against 113 code base. I mean we post without my PR. The stick, they assumed seems slightly performance increase in these two versions. I guess maybe some, because we have a lot of refactoring. We have a lot of also some like issues like we are doing on the timestamp of the parts of the same priority right so I'm not exactly sure it's, because because of that I do see some performance increase.

A

Performance comparison, we have to compare with the same version. Basically, if you're, if you're, adding your patch to a head, they have to compare performance with head. We cannot go to a previously same company.

F

Yeah yeah, that is just just a conservation, so the test I believe based on the same code base on 114, okay,.

A

Send all the detail report later, but I guess there is a good news here and that's the fact that it at least does not hurt performance, much yeah.

F

Okay, so I think it's actually correlated with with our benchmark testing. So what can cut performance is that we have mmm a lot of affinity. That is not ideal case right. We usually don't have too many to any affinity. 10. The other effector. My impact performance is that we example we have a lot of paths can satisfy this and that they are in different topology domains in our benchmark, cast that we we just put them in the same zone, so yeah, there's no starting and an intersection calculation logic, so that larger wasn't hit.

F

So I can suppose that there's no performance in you know the benchmark. Yeah.

A

What we can certainly improve the benchmarks to cover more scenarios and that's something that we definitely should do because, right now our benchmarks are pretty yeah.

F

A

Thank you very much for the update. Is there anything else? No, this is it okay and thank you very much Harry for for that for the removal of equivalence, cache I know that was a huge effort because we kept running into issues with rebasing about PRF, but finally, it's in Spanish. Thank you very much for your help and we are looking forward to see the next phase of the project yeah. Thank you alright, guys, because this is the end of our meeting.

A

We're not gonna have any meetings in the next two weeks because of the holidays, but we're gonna restart our meetings in 2019 on I. Believe it's gonna be January. 10, see ya. January 10 they're gonna have our next meeting.

A

Alright. Thank you very much and happy holidays to everyone.