Kubernetes SIG Cluster Lifecycle, 12 Sep 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG Cluster Lifecycle - Cluster API Deep Dive into Cluster Autoscaler Node Group Balancing 20220912

Description

In this video we take a deep dive into how the Kubernetes Cluster Autoscaler performs its balancing similar node groups feature with the Cluster API cloud provider implementation.

A

Sunday, the 12th of september 2022- and this is a cluster api community special, deep dive meeting, we're going to look at the cluster, auto scaler and in specific what the balance similar node groups option does.

A

um So, to begin with, here, I'm going to share my screen and we'll take a look at the faq from the auto scaler, and this is generally where I direct people to look. um You know first, when we're when they have questions about the auto scaler, um and if anybody has questions during the meeting today, you know feel free just to kind of raise your hand or interrupt me or whatever. I'm not gonna, I'm not going to stand on too much circumstance or yeah ceremony.

A

um So anyways, one of the questions in the how to in the faq is about running nodes for h, a purposes in multiple zones, and it talks about this flag called balanced, similar node groups that was introduced uh quite a while ago in the um in the auto scaler, and this feature is something that you know at red hat. We get asked about with pretty regular frequency. um We have many users who seem to like using this feature.

A

I wouldn't say that everybody does, but it's probably one of those things where you know 40 to 50 percent of users. Ask about this feature and what this feature allows you to do is to tell the cluster auto scaler that when it is attempting to scale up, because it will only do this during scale ups, that the cluster auto scaler should try and create nodes amongst node groups that are similar to each other, and so when the cluster auto scaler is attempting to scale out.

A

You know it's attempting to grow the cluster one of the first things it does: is a sort of bin packing algorithm to try to figure out what the best node type for expansion would be, and it does that based on the pods that are pending and then it looks at kind of the capacities of the different nodes that are there. um It also looks at things like selector labels in case the pending pods need, you know, special requirements and whatnot, and then the auto scaler will make a choice based on that.

A

If you have this flag enabled um the auto scaler will attempt to find similar node groups when it wants to expand more than one node, um and so you know what what are node groups and what is similarity between them. So internal to the cluster, auto scaler is a concept that it calls a node group, and this is not exposed in a crd.

A

It's not it's not kind of like given anywhere where you could look at them, but it for each provider that the auto scaler uses every single provider has the concept of a node group, and this is a grouping of nodes that have similar capacities or perhaps they have similar labels.

A

You know all the nodes within that node group will be the same when you create them, meaning that they'll all have kind of the same topology they're all expected to have the same labels and taints on them. So when the auto scaler looks at all the node groups, it has, it can then make decisions about these, and you know every provider has a different notion of how to group node groups.

A

um You know in some cases like in the case of the aws provider, you know they're able to use uh aws, auto scaling groups internally and it's able to do a lot of manipulation directly with the cloud, but because cluster api, um you know, is kind of like an abstraction to many clouds.

A

Then necessarily our node groups have to be more abstracted, and this is where I find working with cluster api with the auto scaler to be kind of easy, because in general, uh cluster, api machine deployments or machine sets, if you're just using them correctly map one to one with node group. So in cluster, auto scaler. If I have two machine deployments and both of them are doing auto scaling, then I could assume that each one of those is a node group.

A

So if they both have the same infrastructure machine template behind them, then they will look like similar node groups, because they'll have the same capacities and they'll look kind of the same to the auto scaler.

A

So there's a couple primary tests that the autoscaler looks at when it tries to determine what are similar node groups. uh The first is are the capacities, so this is like the cpu, uh the memory capacity. um In the case of special resources, there might be a gpus or you know, networking uh special networking cards that have been uh added and it will use those to try and compare the node group yeah fabrizio. You had a question: go ahead.

B

um At least me doesn't hear fabrizio same for others.

A

Yeah, I'm not hearing fabrizio either. I do see you are unmuted, though, for breezy.

A

I'm gonna stop sharing for a second here.

C

Can you hear me now.

A

Yes, I can hear you, okay,.

C

uh Okay, so making question, maybe silly one.

B

C

Guess that what what.

C

The story is, we are deploying the auto scale that, at the end, is a new controller with its own uh stuff, and the flag that you were talking about is a flag that that goes in the autoscaler deployment.

C

Okay, the missing, maybe I I'm making a really silly question ah now: you're talking about okay, what is a node group in class api? The missing bit for me is. I would declare that a machine deployment is another scalar group.

A

Yeah, that's a good question, um so our implement and and you're right at some level, I'm going to go back to sharing because I'll show you where this is in the code, our you know the auto scaler yeah. It is kind of like a controller and it's just kind of reconciling records in the case of cluster api.

A

um So if we go back into cluster in the I'm in the auto scalar repo looking at cluster, auto scaler uh cloud provider cluster api- and this is where this is- where all of our documentation is, when you add the annotations, when you add the scaling annotations to your machine deployments or machine sets, this is how the cluster auto scaler will identify um those node groups, so it it is, reconciling all machine deployments in all machine sets and when it sees these annotations present, it knows to include that node group uh in auto scaling.

A

Does that answer the question for breezio.

D

uh One question yeah and uh the capacity information that you just mentioned that it needs to group or get the attributes for a node group. It's also just saying from the capacity annotations right, like the ones below directly.

A

Well, right, yeah so like those can either come from the capacity annotations or now, since the scale from zero change has merged.

A

They can also come from the status capacity block of the infrastructure reference right, so the auto scaler will attempt if it sees an infrastructure reference, it will attempt to get that record and then it will look to see if there's a capacity block. If there is a capacity block, it will use that if there are annotations it will use the annotations as an override.

D

So it's good remember to propose yeah got.

A

It make sense cool abby, you had your hand up, yeah.

B

Yeah one query: so if I have a cluster deployment, which has two node groups, but in two different zones, uh would I be able to identify uh if the pod is spending in a particular zone and the cluster auto scaler could only scale out that zone. uh The measuring deployment definition could be the same across the two zones so how's that identification carried forward to the auto scaler and then applied to the right. Node group.

A

Yeah that that is a great question, abigail and- and I am just getting to labels, because labels are the other part of how the cluster auto scaler attempts to match jobs right or attempts to to grow the cluster. So if, on the pending pods, there are labels that are needed and those labels can be satisfied by a node.

A

The cluster, auto scaler will attempt to make those, but in most cases um the zone labels are not necessarily required by the pods right like it's it's it's usually rare that a user would create a pot and say I need this pod to be running in this zone. It's certainly possible that they could do that, but they don't necessarily need to. So there are what are called the you know: the well-known zone label in kubernetes um and when the cluster auto scaler starts getting into looking at balancing node groups.

A

um There is a piece of code that is called the node group set processor, and this is I'm getting deep into how the auto scaler looks at these things now, but there's a group of what are called processors and these processors customize. The behavior by which the cluster auto scaler can look at things like node groups and node info, and these this node group set processors are what the auto scaler uses to do.

A

Some of these deep comparisons right so in the case of zone labels right um oftentimes, when you're attempting to balance the nodes between them, you don't actually care about the zone labels right, but the auto scaler will consider two nodes different if their labels don't match right, and so, if, if you have one node, that's deployed in zone a and another node, that's deployed in zone b, the auto scaler considers those to be disparate node group.

A

Unless you get them to ignore certain labels and that's where these processors come in. So this you know this create cluster api node info comparator.

A

This is a function that's used by the auto scaler when it attempts to do balancing operations and it uses this um to try and figure out what labels it should ignore, and this is kind of getting to the crux of the discussion we were having a few weeks ago at the cluster api meeting, because one of the things that I've been doing is going in here and adding more lit zone labels, or you know, adding more labels that it could be aware of. So, for example, this one here, topology.ebs.csi.aws.com.

A

Zone that is aws's, well-known topology label for their block storage, csi driver. So if you have the csi driver enabled on your cluster, your nodes will have this label on them, if they're being used by that by that driver.

A

But we need those nodes to be considered. You know we need them to ignore this label for the purposes of considering. If they're the same, you know the scheduler will handle the persistent volume requirements, but these node these labels would cause it to look different and similarly there's a group of what are called the basic ignored labels. And these are you know we you see these used over and over again like.

A

If I look at the aws node group, for example, you can see that aws has its own list of things that it's ignoring and then it also looks for the basic ignored labels as well, um and so I'll. Try to remember if I can remember where those are yeah.

D

um Is there another list somewhere in the beginning which, which essentially is a list of labels that the autoscaler cares about, and then you have to add an uh have that ignore list so that it doesn't care about it or does it just care about all labels or notes that you have.

A

Right, it will just attempt to look at all labels on all nodes, and so okay.

D

That means, if someone adds some labels in some way to a node um it I mean that you can't know about, because it's just totally different reason: why just label is there.

C

Right all scalable.

D

Start considering this another note group just because there's another label.

A

Right now, you're you're jumping ahead a little bit. um I appreciate it because it allows me to talk about this next I'll, just I'll do a little diversion here and then we'll go back so you're absolutely right, um and there is another flag here called balancing ignore label.

A

So if you have, if you have a situation where you have users who are customizing the labels on their node groups, maybe they have some special information that they want to keep or you know, there's some reason to demarcate two node two node groups is different from each other, but you want to make them the same in terms of balancing. You can use this flag multiple times when starting the cluster auto scaler, to give it different labels that it should specifically ignore.

A

So, even if the labels aren't being automatically ignored by the by the node group set processors, you could still inject labels that you would like them to ignore, and similarly um you can also define a balancing label that you should use. um This is a feature that was added more recently, so you can say I want any node groups that have this label to be considered the same, regardless of what else, what other labels are there?

A

And so this was a feature that was added recently that when you're, using um balancing similar node group sets, you could set this balancing label and then the the cluster auto scaler will ignore every other label and only look for those labels to make similarities so that that's kind of a little bit of a side look here into how you could you could supplement this um on your own. If you need to.

D

One follow-up question um and based on that, one machine deployment is always one node group in auto scanner.

D

You should probably really make sure that you don't screw that up by having different labels and then they're considered separate, even though they're the same right, I mean, if you just manually, set some labels on your notes, and you don't add it here the right way, then you actually have only one machine deployment and one node group, but because of that label you get two and then it's not really able to scale it accordingly, because it just has one machine deployment to scale up and down.

A

Yeah I mean that could happen. You could do that and- and likewise you might have different reasons to use these balancing labels, like you could imagine if you had four machine deployments in your set in your in your cluster and two of the machine deployments had label a and two of the machine machine deployments had label b, you might set this balancing label twice to designate that you have two different. You know topological groups and you wanna work is when work is going to one group.

A

You want it to balance between those nodes, node groups and when work is going to the other group. You want it to balance between those. So there are a lot of complicated ways you could use these depending on the topology you're trying to achieve.

C

If I got it right, so let me say: standard behavior, that'll scare, balance between uh another group, but if I enable this flag balance seminar, note groups and then I play with these other flags.

C

Basically, I can have the autoscaler balancing between many machine deployment at this point. Oh that's, cool.

A

Yeah, you absolutely could and then, like you know, think about like think about topologies, where we have people who are like they have one group of nodes. That's like machine learning related stuff. Maybe it has gpus on it and then they have another group, that's like their database stuff that has high speed, storage or something if they wanted to make both of those groups highly available. They could make two machine deployments for each group set the machine deployments in different zones right and then, when they have workloads that flow to those machine deployments.

A

Yeah the auto scaler will properly balance between the two and I think, just to.

B

Just one question: yeah yeah go ahead: you just showed a uh a list of the uh labels which are kind of ignored right, not taken in so uh do we still have to maintain that or because we already have this. The other way of passing it through the api right. So is that list still required to be maintained.

A

So that is a good question abhijit and it kind of brings us to the where, where, where I'd like to end up here um so yeah like let's, if I go back to the code, we're looking at here, um do we need to maintain this list? That's a great question, because I you know one of the things I have up right now um is I have a pr open to make to try to make this better.

A

um So you know one of the one of the things that I would like us to do is. I would like us to maintain some of these labels, especially the labels that other providers have called out as being important because it it makes the experience better for our users automatically out of the box. So if they, when I'm trying to here so here's the list of labels that I'm trying to update currently into our um into our you know, hey jack.

A

These are the labels that I think we should incorporate and I'm trying to bring them in from all the different clouds that we cover. um These are the automated labels that they use, and I'm kind of describing many of them are used by csi drivers or they're used by cloud controller managers and they're just used to tell like this. Node is in this zone, or this node is in that zone.

A

So, in general, the reaction from the auto scaler community has been to ignore these labels for the well-known like zone labels and whatnot, and so I've been bringing them in um to try and create a list, and you know make sure that when users use the cluster api provider for cluster auto scaler, they kind of have this awesome experience out of the box, but abigail. I think your question is absolutely pointing it like.

A

We could decide not to maintain these labels for cluster api and then it would just be up to the user to set the balancing ignore labels whenever they were trying to use this. So you know really- and we've only got about seven more minutes left in the scheduled time slot. This is like a good place to be in terms of the conversation like you know, should we be maintaining these labels? Is the is this something that we want to do as a group?

A

You know to try and make the experience more seamless for cluster api users, or should we just document here's how you ignore these labels? Here's how you configure it, because certainly it will be much easier for us to maintain if we don't have to look at these labels in the node group processor, but that's at the cost of making this more complicated for our users.

A

So you know really that's kind of some of where the question is yeah go ahead. Stefan.

D

I wonder if an allowance approach, plus documentation would be easier than the other way around, so what I don't know is uh about which labels do we actually care?

D

Let's say even if it wasn't right, so I mean it's kind of the same problem just the other way around, um but maybe there are not that many labels that we have to care about. uh Maybe.

A

Yeah, that's that's kind of like up to the users right like you, users are always you know they maybe label some machine deployment. It's like this is my super special deployment. So when I create workloads, they all have to go to the super special deployment. I mean in general the what the it seems like the auto scaler community has tried to do, and this was easier for other clouds because, like you know on aws, they only have a few that they they want to maintain.

A

But for us, because we we work with all these different clouds. You know the number of zone labels that we might want to include could become very large. You know, gonna see what jack said here.

E

Yeah, I was just.

A

E

Plus one to having a having a set interface, so you just comma separated string or something like that that we clearly document.

A

Yes, so the autofeeler does have one right: it has a flag that you can tell it. You should ignore this label. So so it sounds like what you're saying is. We should maybe back out of like doing this auto magically for you.

E

A

E

My vote because the I don't think that the downside of maintaining a set is huge. I think it's just really predictable, that it will almost always be incomplete and um I guess you could arg, I'm leaning toward the argument that that's sort of misleading to the user by even maintaining any set you sort of suggest that we've got this under control, but I think that there's very little confidence that we will have this under control for any particular user. So it would.

E

I would have a slight preference towards just kind of asking the user to do that work as a way of um actually producing a confident contract. You know, otherwise it's just yeah. I think I that's just my my view.

A

Yeah, no, I appreciate that, and you know, certainly it doesn't necessarily.

A

This does not necessarily need to be in the code for the auto scaler, either right like if we develop helm, charts or um you know, cluster class add-ons or whatever that deploy the auto scaler for you like you, we could certainly encode this knowledge in those places as well like when, uh when starting the auto scaler this you know the reason this came up for us is that at red hat we do a lot of automated testing of the cluster auto scaler, and we do have a series of tests that we do with the balance or node group stuff and so like, as we're now deploying uh kubernetes with ccms we're starting to see another explosion of these like labels that the the ccm's are using to designate some sort of zonal differences right, they could be ignored and that's the symptom.

E

Of all these ccm projects have been delaying enabling all these cloud provider functionalities for, like six years, we're gonna wait until it's out of tree and now out of trees, arrive because all right, great ship, all the features so now you've got a million features that come with.

A

Totally totally- and you know the csi drivers and everything so yeah.

C

May I ask the question so basically the specific, so the problem arises only if the user opt-ins in these cross node group, balancing okay so quest the the my comment is that there is let me say: I see a problem or at least a risk in the current auto scale, behavior that consider all the labels, because if I have a cluster perfectly configure it and someone goes and apply a label on my node, basically screw up of my load balancing thing. So this is the the biggest risk.

C

So starting from this is the bigger risk, because I'm let me say what I see and I'm the administrator I want to set up on the scatter and be confident that my setup works from that moment on so the behavior that the autoscarer considered all I see it risky. I think that the recommendation that we have to do to give to our customer pi user is that if you want to obtain in this feature, you have to pick up also the label that drives this behavior and set it explicitly.

C

So your setup basically is less fragile. In my opinion, that's the best recommendation that me that, and if we go down this path that we suggest to explicitly set up the label that drives this behavior. Basically, the in our.

A

C

Became not required, does this make sense.

A

I I think so let me just repeat it to make sure I'm following you so like you're you're saying the preference would be better to make the user explicitly aware of these labels that they need to know about, so that they can absolutely be assured like what the auto scaler is doing and there's like kind of no magic going on, and that we should just fully document this out and kind of show. People like hey if you're working on this cloud use these you're working on that cloud use.

A

Those is that is that kind of what you're getting at.

C

Yes, but I think that the main reason is that we should move. We should steer user away from the default behavior that consider all the all labels, because this is fragile.

C

So if we want let's say we want less less less problem, we want to steer toward less less surprises and the surprise surprise is pick up your label, that's it. This is. This, gives you the less fragile option? In my opinion,.

D

I think that's what you're getting right. If you said that balance group labels field flag or what it was, I think there was something like everything else is ignored and.

A

D

A

There's two ways to look at it. One way you can tell it to ignore labels specifically the other way. You can tell it to only look at labels and not look at anything else.

D

uh One question um the list that you currently have on that pr, how much um controversial opinion is already in that list? I mean it's something like um that: we wouldn't care about zones. I guess that's something that that different people might see differently right that that a different zone is already a different note group. I I don't really know just.

A

C

D

uh Would work for everyone or is more like okay, it's maybe already it's not! It's not very likely that that will work for a lot of people, because they just have already different opinions and everyone does it differently, etc. That would that would mean that it's hard to just add something hardcoded which is kind of the best way to do for everyone.

A

Well right so like where this started is that in kubernetes you know, there's like a set of well-known labels and annotations, and one of them is the well-known zone label, um and so the well-known zone label is one of the main things that the auto scaler ignores right, and so it ignores that by default right, um unless you turn on the the flag that allows it to only look at a certain label or specific labels.

A

So, right now there are some of these that the auto scaler is already ignoring ones that the community has agreed on. Like you know, topology.kuber zone, that's like that. That's well known! The community has agreed to ignore that some of the other ones like, for example, the csi drivers and I'll just note we're over time here. So if anybody has to drop, you know, please don't feel bashful.

A

um Some of the other things like the csi drivers right when csi was first being defined as a spec. The various driver implement implement implementers had not yet agreed on the well-known zone labels right.

A

So that's why you see, like aws, has their own zone label right and hence why the aws node group set processor wants to ignore that zone right so like where this started was we started to automatically bring in some of the other labels that were being ignored by other pro providers, and you can tell as it's grown because of the testing that we're doing like internally on you know like ibm and alibaba, and these other cloud providers, like you know, we said.

A

Okay, maybe the cluster api implementation should have all this stuff baked in, so that, regardless of what you're working on it works, the way you would kind of expect it to, but it does introduce a problem that we're all identifying here which is like now. This list could just keep growing right.

A

We could always have to be maintaining this list, and so I think the wisdom there from what I'm hearing is like probably better, to teach our users how this works, how they can enable it and when they should enable it, and then we don't have to worry so much about having the auto scaler automatically. Do this stuff for us.

D

Yeah, probably good surprises agree. I mean I guess every csi driver nowadays has its own zone label. I just remembered on openstack, for example, right.

A

D

Don't know if we get everything in there and different users, don't know how it works. Then yeah not sure like this, that they do.

A

um Okay, since we're over time, does anybody else have any comments before we kind of wrap things up here? Oh I see uh well, stefan and fabio. You guys had your hands. Did you still have something you want to say for breezy, okay, cool um thanks everybody for your time.

A

I think this gives me some clarity about the direction we should take it and what I will try to do is I will try to start working on putting together the documentation for this stuff and backing out the changes that we have in the auto scaler now, so we can just make it a little simpler, we'll rely on the documentation and you know we'll start with documentation in the auto scaler and then I guess that'll get replicated to our api book as well.

A

If I understand the process there, all right cool, I'm going to stop the recording thanks everybody for coming out.

E

Thanks so much mike.

B

Thank you mike. Thank you.