Kubernetes Kubernetes Contributor Summit - Seattle 2018, 19 Dec 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes Contributor Summit 2018: Inside the Kubernetes Scheduler

Description

This was an unconference session and therefore has no proper description.

A

All right, hi, everyone thank you for coming, I am Andrew Chen, and that is Dominic torno and just to give you some context like this kind of came out of a project from cig docks where we're using systems modeling to try to explain. You know how kubernetes works better. So this is sort of like a case study with the the scheduler and then let me give it.

B

B

B

All right just a quick question, because I'm curious so for myself, I had the hardest time when I got started with kubernetes to wrap my head around. It I have the hardest time to understand what it actually is and I had the hardest time to understand how it works. Does anybody else share this feeling or all right, cool yeah? A few of us so I would argue that kubernetes does have a problem with complexity.

B

However, I would actually argue that kubernetes has a problem with its perceived complexity. The perceived complexity of kubernetes is very high. However, the conceptual complexity of kubernetes is actually fairly low. It has a few basic patterns that it applies over and over and the understanding of kubernetes relies on the understanding of these patterns. So I believe that the problem is a problem with communication and not a problem with engineering.

B

So, as Andrew already said, this presentation is part of a larger collaborative effort between the CN CF Google and s AP, to advance the standing of kubernetes and its underlying concepts using a systems. Modeling approach, so at the end of this presentation, I pleased to share your feedback about the model about the modeling approach that we use in this presentation. So we talked about modeling, so, let's moral, the scheduler, this diagram depicts a high-level architecture of kubernetes like every other component of kubernetes. The scheduler monitors and modifies objects in the kubernetes object store.

B

The sequence of events and actions around the scheduler can be summarized as after a user or controller creates a part. The scheduler monitoring the object store for unassigned pods will assign this part to a node. Subsequently, the cubelet monitoring the object store for assigned parts will execute this part.

B

So in summary, we may informally describe the tasks of the scheduler, as the scheduler assigns a part P to some node n. But that begs the question: what is an assignment of a part to a node.

B

So diving deeper, this diagram depicts the kubernetes objects that are involved and of interest to the kubernetes. Scheduler kubernetes represents a part as a kubernetes pod object, a node as a kubernetes node object and the assignment of a part to a node as a kubernetes binding object.

B

Please note in this diagram: there is a node name in the pod spec that we will come to later that assigns or actually pre assigns a node to a part before it is actually being scheduled, and please also note the scheduler name in the pod spec to assign a custom scheduler to a pod. We will entirely ignore custom schedulers in this presentation.

B

So, equipped with this knowledge of kubernetes objects, we can now define that a part P is assigned or bound to a node n. We are a binding B if and only if the bindings name will supports name the bindings namespace equals supports namespace and the bindings target name equals the nodes name. The existence of this binding signals to the cubelet on this node to execute this part.

B

So, equipped with this knowledge, we may now formally describe the tasks of the scheduler s, the scheduler for a part, P selects a node n and creates a binding B so that the bindings name equals the pods name. The bindings namespace equals the pods namespace and the bindings target name equals the nodes name.

B

So conceptually the kubernetes get scheduler loops over unbound parts and in the case of pre assignments, creates a binding to the pre assigned node or, in the other case, selects a node if it can find one and binds the part to that node.

B

So this is a specification of the control loop of the scheduler in TL TL, A+ temporal logic of actions. The control loop of the scheduler is actually fairly straightforward. Now, let me add this. Alright, the scheduler is a complex piece of software. It has many many lines of code. It is not easy to understand. However, the specification of the scheduler is actually fairly straightforward, fairly simple, it is one loop, it has one if statement and it selects a node for a part.

B

So the complexity of the scheduler is apparently in the selection process of the nord for a pod. So, let's dive into that selecting a node is a two-step process. First, the scheduler selects a subset of nodes that are qualified to host this part and then second, the scheduler selects a note from this subset with the highest rating. For this part,.

B

So we had two kinds of functions: we had filter functions for qualified nodes and we have rating functions to score nodes. Let's look into some of the filter functions.

B

So the filter function is basically a conjunction. The overall filter function is a conjunction of many individual functions that are predefined or Priya ya, predefined in the scheduler.

B

So, let's start with a simple one: let's start with a few sanity checks, so this diagram depicts the relevant attributes of a node for the sanity checks. Most importantly, we have unscareable, we have running and we have ready so the filter function. Here the scheduler may assign a port P to a node n.

B

If and only if the node spec unscareable equals false, the node status phase attribute equals running and the node status condition collection contains a node condition so that the condition type equals ready and the condition status equals true, any node that does not pass this filter will not be considered for scheduling.

B

Let's look at another filter function, famous taints and toleration, so this diagram depicts irrelevant attributes for taints and toleration x' and takes apart and the node into account.

B

Quick highlight the Toleration czar defined on the pot and the taints are defined on the node and just like before we can define it with actually a simple expression of what taints in toleration stew and the scheduler I assign a part P to a node n. If, and only if, for each taint, that is an element of the nodes, specs taint. There must be a toleration. There is an element of the pot spec toleration so that the toleration matches the taint. Once again.

B

If this is not the case, the node is not considered qualified for hosting this pot. Similar, always the same pattern. It comes to affinity personally, I have to say, I had a harder time. Wrapping my head around Tenzin toleration, x' and an easier time right to understand affinity. I was actually surprised to see that the data structures for teens were actually very little and innocent, and the data structure or rounding affinity and the formulas surrounding them is actually much harder.

B

So on once again, we have a formula and the scheduler may assign a part P to a node n. If and only if, node affinity, pot, affinity and part anti affinity hold true, so the node affinity holds true if and only if, there is at least one node selector term in the pot spec affinity, nor affinity so that the node selector matches the node same with the pot affinity. The pot affinity holds true if and only if, all port affinity terms in the pots, spec affinity, pot affinity match the node and last for the pot.

B

Anti affinity for anti affinity holds true. If no port affinity term in the pot, spec affinity, pot, anti affinity matches the node. You see the repeating pattern as soon as a node passes this filter, it may be considered for scheduling if it does not pass. This filter is taken out of the of the list of candidates, so the filter functions, select, nodes right and then the rating functions assign a score for the node so very similar to the filter function.

B

The rating function applies individual ratings to parts and notes, write sums up this rating and then for the set of notes takes the highest rated once so rating functions. Anybody want to see more formulas. Anybody want to see more grass anything like that. All right enough graphs, enough math for more details. Please do visit our blog post. We will release a blog post after the cube con that talks about the scheduler and will include the nitty-gritty details, but for now we can do a simple case study right.

B

So we have a cluster and on this cluster we have nodes with the GPU and nodes without GPU I'm, pretty sure half of you are already sick of this example, but we gonna stick with it so and we have a set of parts that do not require the GPU and we have a set of parts that require the GPU. Now. What is our objective here? Well,.

A

B

Want that every part that does not require GPU is bound to a node without a GPU, and similarly every part that requires a GPU shall be bound to a node with a GPU. Ok now, as you can see right now, oil pots are eligible to be scheduled on any node in the cluster.

B

So the first thing we gonna do is we add a taint as soon as we add the taint to the nodes, without that the nodes with the GPU, none of these parts is eligible to run on any of the GPU nodes and is only eligible to run on the nodes without the GPUs, since they do not have any toleration specified yet now. This is the first trip up as soon as you add a Toleration to the pots that requires GPU. You are not done.

B

The parts that require GPU may be scheduled on the notes with the GPU, but nothing in the formula. If we remember the formula tells the scheduler that they have to be scheduled on the nodes with the GPU, you need a different method, basically a different filter for that. So the second thing you do is you label these nodes in preparation for node affinity? Again, this didn't change any of the possible assignments just yet you have to add affinity, two to the port's that require GPIO.

B

So as soon as you actually do add this affinity, we reached our mission statement write the parts that do not require GPU are only eligible to be ever scheduled on nodes without the GPU and parts that require the GPU are only eligible to be scheduled on the nodes with GPU right now we did not venture into the raiding functions. So, for example, we did not venture into preferred preferred affinity and preferred anti affinity. If you want to say something like spread my workload of GPU ports evenly and not just on one machine.

B

You need a venture more into the into the topic of rating functions and, in that case, part affinity and Ponte affinity. Alright. That actually concludes our presentation. If there are any questions, any feedback more than happy to answer, if I so can.

C

B

Ranking functions.

B

Ok, I actually do not have the ranking functions included in the in the presentation, and let me try to to basically paint a picture without a slide so for the ranking functions. A common example where common user-facing example is part, affinity and pot, anti affinity, so you can require and that a part does not run on the same node with another part right. You can require that, and that is a filter function. So if the if the node is already inhabited with a part right, the scheduler will not take this node into account.

B

However, when you use the preferred preferred qualifier, it has the quality of a reading function. So if the scheduler finds a node that is uninhabited by the port in question right, it will automatically rank the node higher and you also have the possibility to add a custom weight to that. However, if the node is pre-populated with a pod in question right, the node will be ranked lower. However, it will not stop the scheduler from not scheduling it on this node if it doesn't find any other node that actually ranks higher but I'm.

B

Sorry I do not have I, do not have a formula handy on this one.

D

For your example, what is the reason that you need to use both tent and affinity? Wouldn't affinity alone is enough.

B

So I am actually not 100% sure if you can express each end, every scenario that you can express with tents and toleration Zahl so with port affinity and anti affinity, not 100% sure if this is a complete subset. However, in this case, you actually can.

B

In this case, I could add labels to these parts, I'm sorry to these nodes right and add labels to these nodes and then specify a pod affinity for both of the sets of the nodes. So you do have this possibility. There are some reasons when you do not want to do that. One, for example, in this case would be.

B

You would actually have to label each and every node right and you were to have to label each and every part right, and that means you have to alter the pod template of any deployment you have any replicas set. You have any cron job, you have any job, you have right and, yes, you could actually come to the same result. I'm not sure this is the case in all cases and under all circumstances, but in this case you could come to the same result, but it gets unwieldy very quickly.

B

So by expressing it this way you are actually only touching the nodes and the pods right that require GPU and everything else it doesn't require. Gpu does not need your attention.

E

Something has to go through and do the discovery to to denote whether it has a GPU or not. It's not it's not built into the scheduler.

B

Yes, this is absolutely true. So, from the point of view of the scheduler right, GPU means nothing to it. So it is, it is a the taint is an opaque value right, but this taint matches that Toleration, and in that case in the label, that would probably be a mention of GPU right, but for the for the scheduler, you are right for the scheduler. This means absolutely nothing. So this is entirely in your domain and in your responsibility. Yes,.

F

Are the default out-of-the-box ratings, such as just to try to go to nodes that have say the most available resources.

B

This is, this is actually a tricky question, so number one I believe yes, number two I am not entirely sure which one there are and most importantly, I, am not entirely sure about the rating behavior. What part of the rating behavior is actually part of the contract that that the scheduler gives you right or what part is actually basically inside the scheduler and may change without notice, from release to release I have not uncovered that yet, but I do believe. Yes, the the scheduler does have a tendency to spread out.

B

If this is part of like the contracted, behavior I'm, not entirely sure more digging necessary.

G

How does the priority you know come across? You know these two filter and ratings. You know aspect.

B

So I'm not entirely sure if I understand your question but I try.

B

So so the first line of this specification right eliminates any note that is not qualified to handle this pot. So this variable now points to a set of pots where the scheduler already determined these pots are qualified to run.

B

The note now, in the second, the scheduler loops, the scheduler loops over the individual individual notes and establishes the ranking right for the for the node and the pot and then will select a will select a node from the set of nodes that scored the highest and, if I'm not entirely mistaken, the contracted behavior is, if you have a set of nodes with the same ranking like, for example, four or five nodes with the same ranking the contracted behavior is that it is randomly chosen. I do not think you can.

B

You can place any further constraints on that then saying that it is randomly chosen.

G

Pretty much the same set of nodes, but one part has a higher priority than the other one.

B

So when we come to priorities, we actually also venture in a in a topic that we did not include in this presentation, and that is the scheduler actually has the chance of pre-empting pots right. Sometimes there is, there is somewhat of a confusion, not entirely sure I cleared that all up, but there is some kind of confusion between eviction and preemption right, so eviction is defined as determination, the pod by the cubelet. It would do so under pressure. Take all of this with a grain of salt.

B

It would do so under pressure of resources, then the cubit will will kill, terminate the cubelet will terminate the pot ahead of its time right. However, also the scheduler does have a chance of preemption.

B

So if a part has a high priority higher priority, but it cannot be scheduled, the pod will I'm sorry, the scheduler will examine the workload across the cluster, will find pot or may find pots with a lower priority, and if the scheduler determines that, if it terminates these pots that then the pot with the higher priority can actually be scheduled, it will preempt the pot now preemption. In this case, I believe means that the scheduler sets the deletion time stamp of the pot.

B

It is not going to delete the pot, it is not going to delete the binding, but I believe it sets a deletion time stamp of the pot, which will then give the cubelet a chance to gracefully terminate the pot. Then the resources become available for the scheduler.

G

How do you part disruption, budgets, kind of fit into this.

A

E

Other questions comments.

D

So pop disruption budget keep a schedule or take account for disruption budget in preemption. How? But this is not guaranteed just for your information.

B

If you don't, if you don't mind me asking you a question, and so what do you think about this style of modeling, this style of communicating about kubernetes and about its individual components so or functionality? If you have any thoughts, if you have any feedback on this on this style of modeling, I am super happy to hear it doesn't have to be now. You're gonna stick around for a few minutes, so, if you want to, if you want to come around back, that would be cool. Thank you very much.

A

Yeah related to that, so you know. Currently the documentation is a bunch of descriptions right and you, you kinda, have to read it all and kind of piece it together yourself. So the idea is to have a much more. You know rigorous way to describe the behavior, and so this is why we're using this modeling approach, which will include you, know these formulas as well as diagrams. So please yeah give us some feedback if this is a much better approach in terms of explaining how things work. Thank you. I.

F

Was just gonna say: I would have probably read that more easily if it was written and go.

B

It is if you're curious, it is a. It is a formal specification language that is called plus color, that translates to TL A+. That is a language designed by Leslie Lamport and it's a formal specification language and it come with a model checker. So you can actually check if your, if your statements and invariants hold.

B

But yes yo you gotta kind of get used to it.

D

So I'm curious about what kind of checking will you do in this formulation.

B

So when it comes to that it is, it is part of what is called invariant based design right. So you have you: have you have an invariant about the state and the state transitions in your mind when you design an algorithm, and then you specify the algorithm and you check if your invariants actually hold true so for this one I had previously I had an invariant.

B

That said, if I start with a set of parts from the very beginning, if I start with a set of pots after the cuban heiress scheduler reaches a steady state, that is, there is nothing else to do. If there is a set of nodes that can host the pots right, eventually, every part will be assigned to a node. I am not entirely sure if the way I modeled kubernetes scheduler is incorrect or if this is actually part of the kubernetes scheduler. But this is not a guarantee. The carbonated a scheduler gives you.

B

The model checker showed clearly that it can run into situations right that, even so, and with a holistic view of the system, every part would fit on a node. It can run into situations very basically starves itself because it made a few bad decisions in the beginning, and it was clear after that that I had to relax the invariance right, because I I am fairly certain that this is how kubernetes works, but I need to have peer-reviewed, but I had to relax. This invariant and saw kubernetes does not give me any guarantee.

B

It only gives me and guarantee that it will make an optimal decision each step, but not an optimal decision from a holistic point of view, and you can actually find that with motor checking.

A

Also part of our process is a mix of interviewing engineers to like ask them the behavior in certain corner cases. So we can tell and then also what's the other thing yeah, so we can validate the model and then what's the other thing.

G

A

Oh right, yeah right so look at the source code because that's the only way you can check what the behavior is but between the engine, but the problem with that is just that its implementation and maybe may or may not be perfect. So that's why we have to check with the engineers.

B

B

Correct so, overall, no question about it. The these these processes are or these activities are asynchronous right, so you would either have to put a review process like you just mentioned into place, or this is also the reason why I usually like strongly emphasis. What is part of the contract right, because as soon as it's part of the contract, I gonna include it in the model and I gonna include it in the invariance.

B

If it's not part of the contracts, I will formulate it so that the model checker can actually choose randomly like, for example, when I said the highest-ranking pod, it will choose randomly. Of course, there is no random in kubernetes at least I didn't find one I think it always takes the first one in the list or something like that, but for the for the for the sense of the model, this is now random right.

B

So, as long as the contract stay in place, the the model adhere stood and if the contract changes you have to change the model. Yes, if you are actually interested in this detail, I I give a little bit of detail, and there is. There is something that is called model refinements. So you, your model, behavior on a very, very high level right and then step by step at define, meant refinements and also proof that the individual step on a lower level is actually a behavior that is allowed on a higher level.

B

So the higher level is the most abstract one right where you could go so far and just say the invariant, a pod shall find a node right and you go further and further down and add more information in that case also the higher your abstraction. If is the more its longevity right, the further you go down and come closer and closer to the implementation. Then it is really tied to the lifecycle of this component.

C

So I for one have found your presentation very helpful, so thank you and I think it should be extended to other components too. You know, for example, like like the cute proxy or like the cubelet, but I think I totally agree with the gentleman keeping these in sync and making sure that you know the formula is there put in a documentation?

C

Do not me sleep potential like engineer's who are trying to understand, and you know figure out why things doesn't work the way they they think they should is gonna, be key and also it might require viens from like the community to put effort into keeping them in sync.

E

H

Current schedule, architecture scheduled pause one by one right so, but in other, like scenario like big data or motion running it's better to schedule all the pozdneyev group. So what's the recommendation for this use case is the path is I'm, not sure in the next scheduled version. We will cover those use case or it's better to extend the scheduler power selves, I.

B

Am actually, unfortunately, not qualified to speak to that so I, strictly limit myself to to modeling how the how the scheduler works right now there. There is a very interesting there's, a very interesting detail to your question that maybe I don't want to like kill everybody's time that we can talk about after this presentation, but yeah.

E

B

Take this with a grain of salt overall I am not the one like qualified to speak to the architectural direction that cubanelle is gonna. Take.

E

Chance, maybe no okay. Thank you very much.