KubeVirt SIG Performance and Scale, 30 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2021-09-30

Description

Meeting Notes: https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.qs7aweajr18k

A

Okay, uh welcome sixth scale september 30th. I link the doc in chat. uh Please add yourself as an attendee.

A

And feel free to add things on the agenda.

A

Okay, uh let's start with the first item. um Add all the scripts and configurations in the upstream cic system to configure the perf cluster. So look at this.

B

Yeah, so this is to configure the performance cluster in the cicg system so after uh well, I did many tests just to verify the performance. Now I want to include the system that I was testing to uh the cicd system and then we will have all the performance jobs that we discussed before running there um it's merged now.

B

So, um but it's it's still like missing some. uh You know some jobs to create the cluster and and do that, so I'm I'm doing that with uh frederico, that's responsible for the cicd system we covered so um hopefully we'll we'll get this cluster. You know uh ready soon. So what is.

A

uh What does this mean like the the sig scale cluster like this, is a um like a dedicated job or cluster, or something that is that's run for things that we want to test or is like? uh Can you describe this? No more.

B

Yeah, so right now there is like a cluster to run the functional test. Okay, so every time that someone writes a pr it has the functional test, run unit test and functional test running, and it runs in just a special cluster.

B

However, it's shared with many jobs and it's, I would say it's kind of impossible to have any performance job there. So we have another cluster that will run the performance test there and the test should be isolated, will not be collocated to any job. We're running bare metal for the functional test. Actually, it has uh well the way that it works. It's creates a vm. Install kubernetes inside creates a cluster, so it's used nested virtualization, so the functional tests- it's not it's not uh regarding performance, okay, so it's just as the you know, functional test.

B

So we have the dedicated clustering bare metal nodes um that will run the performance jobs that we were discussing. We are planning so uh yeah.

C

A

Cool um okay, pretty cool, so this is um so eventually like. This is like we have this cluster. We can start doing a little bit of you know, generating some bass lines. Some thresholds um that continuing to you know this is basically where we'll it will put a lot of those things like we'll um to uh to test like all our performance things. Okay,.

C

Yep, okay, cool all right is that uh is that all uh you had their marcelo yeah, just something that I want to mention before uh saying. uh I will.

B

Be on pto uh from october 1st to 2 30, so the whole october? Okay! So just to mention to you guys that okay, we'll miss you.

A

All right thanks, miscella, okay, uh let's go next one uh member usage, cluster profile or enlarged clusters. Let's take a look.

D

Yeah, so so that's me hi guys, so I've experimented with cluster profiler, which was recently merged, and I had some problems with running it on the large clusters. um uh So I noticed that there's uh there's nothing wrong with start and stop logic, request, logic, uh they're like simply broadcast, but there's a dump request, uh which basically works like this- that weird api gathers in memory uh all of the profiles from each of the keyboard pod pods right. So I noticed that single pod produces, like around nine megabytes of uh profiles and provider results.

D

So it means that we, when you're trying when we try to run at scale and this profiler, we quite fast get out of memory errors, because when we calculate 500 nodes assuming like four weird handlers on each node, we can get 20 gigabytes of memory usage by weird api, which yeah, which sometimes it's just unvisible.

D

So I have some thoughts how to uh how how to fix this. But I was just uh thinking if you, if you guys, have any input on this. How how would you like to approach it especially and david was the the offer of the original uh of the original solution.

E

Yeah, this is really interesting. Sorry, there's a ton of background noise. I don't know how distracting that is to you all. uh It's nice chainsawing a tree behind me. um I don't have any immediate thoughts other than to minimize uh the number of nodes that we profile at a time. So you can have some sort of selector on just make sure we don't get all the vert handlers.

E

For example, that's probably the one that's causing the most problems, and maybe there's some optimizations to how we gather and dump uh these it doesn't require it to all be stored in memory um yeah. What were your thoughts because it sounds like you actually have an environment where you can uh do this sort of thing.

D

uh Yeah, so yes, I was thinking that we we don't need actually have your api together of the results in in memory. We just basically can change a bit the api by which we fetch the profiler results right, so we can basically from a client side.

D

Okay, we can query with api uh one put by one, for instance, right uh or like change stamp request like for cubic pods to actually dump the results into into their volumes and then have a client to actually uh traverse all of the volumes and just you know, copy the results into into clients uh own file system uh yeah. So here these are like my two, like initial ideas, how we should solve it.

D

Basically, just just remove this cluster provider results extract which, uh which is present, because, as I mentioned there, I think it like won't fit into memory for a large large scale uh yeah. So if you guys don't have any like uh strong opinions on this, I guess I just try to propose my implementation like. I think it won't be a big change, and I just just see for any good diamond.

E

I think the biggest thing I'd like to preserve is the ability to retrieve this information without having to be inside of the cluster, so just using standard cube control over ctl or qctl or ctl tooling, outside of the cluster, because if we, if we dump it to a volume or something like that, then somehow we have to get that information out onto our local, like laptop, for example, or wherever we're going to be analyzing this, uh and if we're not doing it through the api server, then we have to deal with ingress. Somehow.

D

Yeah yeah, okay, that makes sense. uh I remember that.

E

What I mean, how useful is, for example, all the vert handler um information? I mean it's just as simple as saying like a flag that um somehow allow or deny lists on what nodes you want to collect it from or whether you want control plane uh like a cluster control plan. Only I guess is there a way to narrow down the amount of information that we retrieve in a way that prevents having to just send so much information. Are you actually looking at every handler's results, or do you just want one, for example,.

D

Yes, so if you're talking, if you're talking about like hundreds of nodes scale, like probably doesn't matter to weird handler, uh how many nodes in the cluster are right, because it just looks at its own environment, which is which is a node, so it's it's only a difference for let's say weird api or weird contour controller, how many class, how many notes in in in a cluster there are. So I guess it makes sense as well uh to like to add this to do a selector for only control play notes.

D

So, if you have, if you have such a large cluster, then uh then profiling build api and weird controller. That's useful, but maybe looking into every weird handler profiler results. It's it doesn't. uh It doesn't do much difference from the from the cluster of size like few nodes right that.

E

Would be my expectation um yeah, it would be, it would be smaller. I don't know, but these are all really great thoughts. I'm glad you noticed this, because I did not notice that it was that much information um yeah that could be pretty bad. So I guess bird api is just going to swell in the amount of memory it consumes, and then it doesn't really like memory is weird like that. Where once it grows, it might not really shrink how much it's using it. Just forever looks like it consumes a ton of memory.

D

Yeah yeah, that's that's the case. It's just uh just tries to gather enough from in memory everything and once it has gathered everything only then it returns steers out to clients. So.

E

Maybe there's a way to do a zero copy transfer of that information as well that I don't know how, because the way I structured it, I have that that aggregate structure, the cluster profile results that stores everything. So it might not be pretty practical either. Also, what's the biggest um uh result, like there's lots of different results that are returned, is there one? Is it the memory profile? That's the biggest, if it's just the memory like the heap?

E

Okay, let's see that seems reasonable value. As heat, I was both taking around three megabytes, so yeah.

C

E

Could just have a selection where you don't always want heap or alex as part of the results, and you care mostly about cpu or something like that.

D

Yeah yeah, that's true, but at some point someone might want to like use heap and alex and what that. So. I guess I guess that, but having the solution which has all of the profiles and then doesn't use that man that that many of memory- it's like visible, so let's try it and maybe have the selectors on the type of the of the pod and type of the profile uh somewhat optional. Let's say.

B

Yeah makes sense to me, so is the vert api dumping that it's overflowing the memory or is it when it's aggregating the handler.

D

Yeah, so the the memory, the out of memory errors they they happen when viewed api, is trying to collect the response from each weird handler. Basically, and that's when, when the memory of overflow happens,.

B

D

So you see so you have a map uh mapped by each of the pod as a key and as a value just results of a profiling. So if you iterate over 500 of pods or even more, then you know when you're reaching the uh the second half of the pots, then you get out of memory errors.

B

So maybe size the filter that david mentioned like, if you can just say which node you want to get divert handler, because maybe you just want to get from one one in it and and then from the other components.

B

So some filters might might work like that. You know, and probably there are if another thing another idea might be like: if, if you can filter you know bernoull, then you can get like a you know, a set of uh of them for a different time.

B

You know, like you get like the first 10 knows that you get from other nodes, something like that, instead of getting everything at the same time, of course, you'll be a different timestamp that you are going to measure, but is a way that maybe you can just you know, have less information one in one dump. So.

D

Yeah yeah, that makes sense. I think you would just have to agree on a subset of filters which, which we should implement, because uh you know we have. We can't filter from the by the node name, but the number of nodes type of like weird spot, but that's just maybe too much work uh too much effort, uh which eventually could be reachable by some simpler, simpler filter.

E

My advice here is to pick the filter that makes most sense for your use case, so a selection of nodes that only profile um uh cuber components that live on these set of nodes works, for you then implement that if you just want to profile specific instances like only cluster controllers or cluster controllers plus, I don't know this vert handler one bird handler, I don't know or if you want to limit the amount of information you give back to say, only get back, cpu results and not like the heap and alec.

E

Whatever else we have figure out what.

F

E

The most sense for you and I think, you're, like the primary consumer of this at the moment.

C

E

You have a lot of say in what uh what's best.

D

Yeah, okay, so uh I have to think about it and I just I just propose something.

E

Sounds great yeah and thanks for bringing this up, I'm glad that you um are starting to use this. Did you get any results that were actually useful? I'm curious if you were able to act on any of this, or is it more still experimental, trying to get back some cluster profiles and to see what's useful in it.

D

uh Not yet so I gathered a few profiles, but I have uh actually spent recently some time trying to use go tool p pro, so this like go to which helps you visualize the profiles and there's some you go is complaining on the binary missing, um so it can't actually correctly interpret the uh the flow of the uh yeah, the graph, the you know, which function to which the you know deep.

D

For instance, memory heap grows like where are the alucs and so on so I'll have to work on this and see if that's actually a problem with, uh I know maybe my setup, maybe my cluster or maybe that's that that's something which uh which needs to be done on on the cube virtually on the profiler.

D

E

So I did the graph and things like that once or twice using the profile results just to make sure it worked for me it did work.

C

E

Did it um does it execute at all for you like or does it just say, you can't read the results.

D

uh It says it says: main binary is missing and the graph is just a one note graph or like something like this so yeah. But I just try a few different things and because, if you, if you're saying that yeah it used to work for you, but maybe that's something wrong with my setup. So.

E

D

E

That there's a regression like I worked on this and I don't know if I tested the last iteration of it before I mean I tested it, it would dump the results. I didn't try to interpret the results, for example, so it's possible. I have messed it up.

D

No worries: I try to figure it out.

E

Excellent. Thank you.

A

Cool thanks so much it would be really cool yeah to see some of those those graphs that would be put together uh a bunch of cool images, and we can see I'm sure, we'll learn a lot from that. Okay, cool thanks to moss. um I wrote as the item here so for the with the action thing that we can go on. It's add a filter to limit the number of nodes pods that we can gather info from okay. um Next, uh let's go to the vm pool discussion from david.

E

Yeah, uh so I don't want to spend the whole time on this. Like I did last time, I wanted to make you all aware of a couple of changes and a couple of things that I'm thinking about as kind of we're getting close to finalizing this design, but um first off, can you guys hear me: is this like background noise? It's so distracting that it's, I need to read. Yeah.

A

Yeah yeah, it's fine.

B

I got that yeah, it's fine.

E

It's driving me nuts, there's somebody with a chainsaw right outside my window and I don't know if I'm gonna be able to get through the day but anyway, um all right. So the biggest change that I made to the design was that, after talking to roman and really talking to others, they've kind of brought up a few times, I've converged the virtual machine config into the vm pool and made it like a template section similar to how deployment sustainable sets work. Where you define you want deployment with so many replicas.

E

You actually have the template and the pod within the deployment.

E

I have a vm pool where you define the replicas everything you want and you actually have the template of the vm within the vm pool, and I was hesitant to this until I began thinking about the use cases that I wanted to keep them separate for, and I don't think they make sense. So the reasons I wanted to keep them separate was, I was thinking. You'd have a one-to-many relationship, we're gonna have one vm, config matching multiple vm pools and the reason you'd want to do.

E

This is perhaps graduating a vm config to production and turns out. I just don't think that makes any sense at all, because vm configs are going to be namespace scoped you're not going to have your prod and staging and dev environments in the same name, space and it's it doesn't make any sense practically in the kubernetes environment. To me anymore.

E

The other reasoning I came up with was you could have versioning of your beam configs and you could um have a vm config that you sign one version to the vm pool and then create a new config where you have assigned the next version, and you could roll back, but deployments already have this kind of behavior, where we save a history of like the transaction history of all the different changes that have occurred to the deployment and there's a revision history associated with each one of those changes, and you can roll back.

E

And if we're going to do that kind of behavior, I should probably align with how other kubernetes primitives work today. So that's my thoughts. I think it makes the bm pool spec way more complex. Looking because we have both the tunables related to how to manage all these virtual machines uh with the virtual machine, spec itself and it's kind of verbose, but maybe that's just the nature of what we're dealing with any thoughts about combining this virtual machine config with the vm pool. Does that sound okay to anyone? Everyone.

A

It's it's interesting, the um yeah I mean, I think I think I've gone back and forth on this idea and that um so what I liked about the thing was some of what um I think. Maybe the last thing you mentioned was that it does simplify.

A

um It. Does simplify the the vm pool, object quite a bit. um I know when I was originally thinking about this idea. I think I used the either the vm template or or like a running vmi as like a and using the object reference as like a way to like take that thing and just kind of multiply it yeah I mean I, I I think what kind of where I'm going with this is like yeah, it does having it in there. It does make it more complicated and only more complicated.

A

It just makes it more complex and um yeah. I I liked the idea of having some sort of reference before, but I mean the technical reasons, for it is really um I I don't know other than just simplifying is was really the only the only reason that I had maybe easier to read, but that's that was kind of it.

E

It felt better to me to have in separate separate resources for the similar reason that you're you're, talking about where uh I felt like it was easier to kind of grok what was happening. um But I don't know if that's the case or not. uh I think that these are probably just expected.

E

Expected usage patterns at kubernetes now, so I'm not sure that anyone would find that helpful is more as confusing that these two objects exist rather than one object. When they're used to standard kubernetes primitives, where uh one object would exist and kind of embed the thing that you're going to replicate.

G

Yeah, it's it's always uh what's calling you, uh we.

D

Can't hear you very well.

G

Roman you're you're mike's doing that thing: ah yeah yeah fabian, always.

H

Called it the cookie cutting cutter pattern, which everyone expects yeah so like yeah here, is really the thing. What you will, what will be used, and here is how often you will get it and you have it on deployment stateful set demon set. It's always the same thing, and also on other controllers, which bring down crds, yeah and yeah. Yes, yeah, I can see. I mean the object is big. That's really not so nice about it.

H

Then, on the plus side, it feels really more natural regarding to normal kubernetes management flows and how you would expect to update them.

E

I think the thing I I dislike most about it being embedded is we have like layered templating, so you have a virtual machine template and you have the virtual machine, instant template inside of that and it's just kind of maybe you call it stanza or something I kind of hate. I hate the nested uh part of this, but I can't do it.

H

And it potentially opens very understandable support for uh commands like cube, cuddle, roll back and so on, yeah, because yeah they are nowadays mostly generic, so that everyone can make use of them.

E

Okay, so maybe maybe we can move past that discussion unless anyone has any really strong feelings about avoiding uh the convergence here.

E

The strong feelings, the vice offended by this okay, all right, so the the last point I had here is uh roman now we're talking about this uh ordered selection for scale in and the different selectors, and things like that, and I want to make sure that we document and have kind of strong use cases for why we need kind of these custom selectors, and things like this to exist for selecting the virtual machines that are going to be torn down during the scale in process ryan.

E

I know this is one of the features you're most interested in. um Do you have like a real world example of like how, in practice you might use this.

A

Yeah, well, we can, let me see, um I'm gonna go to your document, you have them all captured in here. I think.

E

E

uh Look at that I had an example. I think.

A

Was it in a comment or something.

F

No, I think I I put yeah.

E

F

Don't know what I'm talking about.

A

uh Is this it? Here's uh yep.

E

That's the one uh we can work with that, there's also um one in the uh let's see yeah I have one uh automatic okay. So there's. Let's look at the example that you have. The exact same example is in the um in the document today as well, though. Okay.

A

Yeah, okay, the okay, so start the so label, selector, um okay, so the so for everyone. In the background, as uh we we're scaling in vms as part of the pool, we have taken number of replicas down from say 100 here to like 90.. So what are the vms that we are going to choose to terminate um so the order policies are selected in order? First? Is the label selector here so the um okay?

A

So the idea behind this is that, as as like an admin, I know I have vms running but they're, not they're they're, not using a lot of traffic or they're, not in use by someone or customer or something. So I know that those are safe to terminate but they're still running, because I you know I want to have them around in case. um You know running already in case that um someone shows up and I can just provide them with a virtual machine.

A

So I they're they're, not important, um so I can remove them because I'm uh during the scaling process first before I want to um remove one that is being in being used by someone.

E

Okay, got that one uh that makes sense to me. Do you think you would ever need this kind of ordered.

E

Do you just need one selector like say, select ones with this, or do you actually need this kind of ordered policy when you go through a tier of different selections? First.

A

It's a good question. um I think um it's it's I it's hard to say like I, I think right now, it's sort of um it's like a it's an on or off switch, but I could very well seeing. I could very well see the case where um it needs to be more than that. um If, if you know, if I had to make a choice like between like, if I had the choice between two bad options- and you know one was- I knew was worse than the other.

A

um This would be a way I can sort of delineate those two options.

E

Okay, that makes sense. I I think I can get behind that. um So that's the label selector uh I I can write a use case for that and I think we're good. So the node selector yeah go ahead.

B

Oh so one question so who marks you know? Who creates this label? You don't know.

A

um Anyone could like an admin. So what I was thinking like when I was talking about these cases that I was thinking that, like um we write some operator to do it um to do the labeling um or there would whatever some controller. That would do it, but adam can do it um really anyone you know that has access to this api can do it, but I figured it'd be done automatically.

B

Okay, so doesn't it make more sense actually to mark the ones that we don't want to delete instead of mark the ones that can be deleted.

A

Well, so we're we're doing since we're doing deletion, um so if we did it that way, that would sort of that would be the reverse of the order. I think.

C

A

Then, because we, I guess so to kind of start from the from the bottom here like we, the whole idea is that, like we need to have um like well, I mean I guess we could do that way, but, like the the kind of the way we have here is that we have. We have one policy, that's going to be. That's always going to be true. um I guess like. If you just do your suggestion, it would just be the we would just look at the list in the reverse order.

A

I mean I don't it you get the same result. I guess I mean it's. I mean.

E

Difficult, though, because you have new um virtual machines coming online, you have to immediately mark them as don't delete them, or I don't know.

B

Because I'm just thinking like who is going to mark like it's, not important, you know, so it will be all all the nodes not important, and then you remove the nodes that are not important, I'm just thinking about the workflow and it's not important because it's not running uh important workloads in it. So um I'm just thinking about the administrator. You know the logic that someone is going to use that.

H

I guess one example would be like: okay, there are less than five people logged in on these machines. They should be, should go first or there's no one locked in right now, so they are preferred if you scale down and that can be done automatically just, for instance, the guest agent can report. There is no one logged in you, see that and you mark it yeah.

H

And that's all automatically right and if someone logs in you remove it, I mean there is an eraser. I guess that's tricky, but yeah.

A

Right yeah, I mean I, I can kind of live with that. That idea, but most it's mostly yeah like being able to. um I don't know I to me, like logically, I kind of um yeah I mean go the way. I mean it that if to me like I'll mark the ones that don't matter but yeah, we could go. I don't know I mean we could go. I mean I could.

A

I think we can also solve for that use case marcelo, just by simply putting the um the most important at the bottom right I mean I can I can. I can deal with it both ways.

H

But I I really think what what the most common use case would be to mark the ones which are less important, which you want to take first, I really that makes a lot of sense for many cases.

E

That does appear to be the natural way coming from other management platforms.

I

H

uh Can we look.

H

What I also wanted to ask here on this section is: there's a base policy, all this, for instance, what I'm used to from other kubernetes objects which try to the the best candidate is that they have quite a lot of uh quite a lot of criterias and how to test it like the oldest first, uh but also the oldest, not ready ones, first, for instance, so there's kind of, and that's not the only one there also. Then it also considers how long of how long.

E

Your mic did that thing again: yeah. We can't hear you. Oh oh yeah, I mean I could kind of.

H

Hear you but okay.

E

Okay, so I just wanted.

H

To say the bass policy sounds great. Is it this intel, but is this intended to override one out of.

E

Criterias, your mic is so frustrating.

J

I mean it's loud and then it's slow again right, yeah come on come on. It doesn't even.

E

This is why you you never upgrade. You never have great fedora, I'm on like fedora 3 or something it's on like a a 20 year old, laptop.

E

Roman I'll just speak to it, because I can't hear you it's meant to be a catch-all, um a base policy where you go through the order policies and then, if nothing hits- and you said the base policy that it would, it would land on that and we can set a lot of different options for that. It could be oldest, it could be newest, it could be oldest and not real. I don't know like anything.

H

J

I just brought it up because.

J

There are a lot of criterias which considered when doing a deployment. You are fighting it permanently.

J

It takes the I never mind I give after I give up go ahead. Well, how would you represent it then.

E

I

That's the question for me like um on a deployment and replica set, it normally.

H

Takes the newest ones first, because the the least amount of ready and then it takes uh ready ones so which pasta ready once uh the readiness probe they take them last and so on right. There is quite a lot of.

I

Is that configurable? No, that's all not configurable.

E

Exactly so yeah I wouldn't.

I

Worry about it.

H

I just wonder when I have this one: a subset independent of that one which tries to do additional smart decisions, or is it just that.

E

Independently of all this, I think that the first ones that are going to be selected are the ones that are shut down.

A

Okay, yeah. That was another thing. We talked about um david. That was like uh we that was like we could. That could be an optimization or I don't know if that I don't know if you have that in the doc or something or if it was something that could be turned out on here, but that was uh or if it's something that's just assumed, or what did we end up? I.

E

Think that's assumed and if you want to.

E

So there's a scale and strategy called opportunistic, which will only scale in ones that you've terminated.

E

So that's another path, but I think anytime, there's a terminated virtual machine laying around that that's going to be the one that's selected, regardless of any of this.

A

Yeah makes sense.

A

Okay, um let's.

E

Selector, can we yeah.

A

We'll talk about that.

E

For a second, but that's something you brought up as well, I just want to make sure that we have a strong understanding of why we would want to pick a specific node. I understand the label. Let's talk about the node.

A

Sure so um my thought behind this is that um when, if I am monitoring my node health um they're going to be times when I have a node, that's unhealthy and my intent is going to be that I kind of want these this node to be drained because it's unhealthy, um I don't know, what's happening there. um I really like my workloads to scale down so let's target those at a higher priority than ones that I know, aren't healthy notes.

E

That's interesting so the example is really a node, that's being is it will we say that the node's being drained so we're trying to shut down a node or something like that or are we saying that uh you've detected that this node is acting strange? So you want to select the what would be the difference between a node selector and some sort of automation that labels every bmi or vm.

A

Yeah, so um with the label selector, I would expect I expect to like the sort of the level of granularity to be or sort of the level of protection to be based on like these. Are you know, vms? That are just that. I don't care that much about so I'm, first and foremost like that's fine, let's just get rid of them. I don't really care. um If I need to then make another choice, you know I've. I've went through those. um If I have a node, you know, maybe, for whatever reason it could be unhealthy.

A

um It could be also that the hardware is not as good. um It could be um a number of things that um that I'd want to distinguish it to be killed next, um anything that I run on the node. Maybe there are specific types of vms I run on that node um and those you know. Those are ones that um that you know that I'd want to kill next, so I'll use.

A

um I use those nodes as my how to distinguish set of the labels I mean I could use labels here, but I think the idea is that addition to um this would be that it's like the node. The note health like something is going on with the note or something is different about this node- that I would rather um go to that node next for to kill the okay.

E

So you you point out one thing and I think it makes a lesson to me and that is the hardware is different on this node versus the other nodes. So I know in your case for what you guys are doing. um Maybe there's there's more expensive hardware. Maybe there's less.

E

Maybe there's older hardware: you want to phase out over time, or things like that, and you have the opportunity to begin draining things off of that node in a natural way. I guess what would be different the difference between well, I'm trying to think of that's the accurate way of doing that, though, or if you'd want to mark that note as unsketchable and begin shutting down uh the the workloads on that node. If it is gonna, be taken out of rotation or something like that.

A

Yeah, like I kind of the scenario I see, is that like at this point um I have, I may have marked it as unschedulable and that um you know I may even be attempting to evict at this point, um but I'm in a crunch now, because um you know whatever reason I need to bring down my number of v, my number of emi's.

A

It needs to be scaled down, um and so at this point I I'm just deciding that, like um okay, I've had enough like these workloads just have to go, um you know they're be blocking because of eviction. You know pod whatever disruption, but something, but now we're we're. We're deciding like it's time to um it's time to remove these these vmis, and this would be like you know, it's sort of the easier way to do it. So.

H

I think this may collide with a few mechanisms which you can already use like. One is the unscheduled label. Marketing is unskippable, another one is having uh having on the vm pool, template uh affinity, affinities or undefinities to specific labels like you would like you, you just decide once on the pool or on your on yeah, for this pool that whenever a specific label appears on a node, you prefer that new vms are not scheduled there.

H

Then you get that kind of automatically there it's, because what I'm a little bit what's a little bit unclear with me on for me on the selection policy is what is done next, so you say: set the node select there, but what? But? How would you then, for instance, what's the intention? The next intended step like do you expect that new vms are also then not created there, so that there is an empty affinity automatically added to or is it independent, that's kind of hard to get from justice yeah?

H

I think there are ways to ex.

A

Yeah, I think um I think in yeah I think in general, so in general like if I'm, if I like, just assume this is the only pool in my cluster. If I'm scaling it down, um I wouldn't I wouldn't expect any more vms to land there, because I'm not creating any more. um So it could be a factor like that.

A

So that could be true that, like we like just because of this, as being my only pool, I wouldn't expect anyone's to land in there anyway, and but it also could be the case that I I've marked it in a schedule. um It could be either one. So I would be forceful and then you know you're not no, no one's going to land there. But if, if I have a sort of a fixed count, then then I know anyone's going to land there.

H

So you would expect that this would just do the scale in part considering the node selector, but it would have no effect on or creating new ones. Hey you're going to have to change your.

H

I did the switch with dnf, but it's still happening, so you expect that this really just affects the the scale in part, but would still do the same thing for creating new, vms or scaling up again and then, unless you do specific preparations with on the notes that nothing can go there again or yeah.

A

So, okay, I guess like so, if I, um if I'm scaling in um I could be, I could possibly remove from node two like I could remove yeah. I could move from node two and let's say we scale up again: a vm could land on node two and that's like I'd, be okay with that it would be up to sort of me to decide like okay.

A

If it's the node's marked on the schedule or not, I think I think maybe the way to like, like the way I'm kind of looked at this is that we have label selector. This could technically classify. We could like. We could label a vmi anything it could. This could capture every use case. What this does is it it's sort of it's a subset. It sort of allows me to not have to have to label everything it can. I can also distinguish this way by note.

A

It's sort of like a way I could without having to deal with. You know the cases where, um like basically writing, a controller that labels based on nodes and then you know having it in this field. I could just use this. It's as a way of doing it and for all those reasons before like you know, because maybe the hardware is different, maybe because um I know it's not in a good state um on you know the nodes that needs to be remediated, whatever um any reason like that.

A

So if it's because of.

B

Hardware shouldn't it have like also label like uh if the hardware is different or if you.

A

Have a gpu or nvme whatever.

B

A

What I'm saying yeah like this I'm saying is that this these could fall under label selector. But what I'm saying is that it's more convenient to have a field to have a field that explicitly states like okay, we can control. We can use vms on this node as the um we can use those as the next ones to be killed instead of having to create a label and mark the ones per node and then effectively doing that. You know like effectively like I could like this.

A

Could node 2 could be a label up here and I could mark all my vms on node 2 as node 2., and we could add them all in this list, and I get the same effect. But I'm saying that this is quite a bit easier to just do it this way.

B

Yeah, I think it makes sense. It was just just uh think about the use case. Yeah.

A

Well does like so the the idea that it's, like you know, different hardware different, like node states, um sort of like does that make sense, and it's like a way like a reason. Why like to me, it does like that. I wouldn't want vms like when I'm scaling down. I want to take those are higher priority on my kill list, because you know I just the node is not healthy.

A

I'm sort of you know making that assumption that that we don't want healthy vms to land on unhealthy nodes, and so that's why I'd want to have this here.

A

Or rather the other way, I don't want healthy vms to be running on on an unhealthy node.

H

Yeah, but if you notice unhealthy you, you mark it on a schedule level and you would start evicting them from there.

B

Yeah, I should remove the nose that is not healthy, so.

A

Yeah I I agree, I mean I you should remove the note I mean but like what I'm saying is like if I'm managing this using a pool um and- and I'm doing- and I noticed that there are vms running on this unhealthy node and I need to scale down. I would rather target those as as opposed to um a healthy one like. What's my next option, it's the oldest one. Well, I could be targeting a healthy vm here. I would rather target these than I would rather target these.

H

So this brings me then, back to what I initially said regarding to the other scales pool types and kubernetes are doing more than just using the base policy like. I would, for instance, expect that you have liveness probes and redness probes on the vm and that.

H

Yeah that I mean and if you're so so that you can so that would help you to get rid of vms, which are not working properly with the liveness probe and redness bro could potentially be a hint on what to take next and.

F

Readiness is different, I don't know it makes yeah it's just life just.

H

F

Of sense but yeah.

H

But liveness does it anyway: you specify the liveness and if it fails, the vm just gets killed and then it gets.

F

C

H

With healthy and unhealthy nodes, what you normally will see there and I'm not sure if the scalene helps you there is that nothing happens automatically. So you you, then you could have the node selected there and you your control. Our pool controller would potentially delete the vms, but they would be stuck because because the node is not behaving properly so that our objects don't get cleaned up from the cube light. You get no confirmation, you have to do forced deletes. So I'm not sure. If that helps you or you need like francine agent or something.

A

Yeah yeah, so I I I'm in that case like I'm, okay like in I was gonna, bring that up like when you were saying like because yeah like I'm. Okay with that, because now my pool has like a correct understanding of like the cluster state, I can handle their mediated node like separately. I can be like okay, that no one's a problem we're going to deal with it separately. I don't I don't want those vms in my pool here. I'd rather just start a new one.

A

You know I'd rather like if I'm scaling down I'd rather get rid of them and then, when I scale back up, you know I'm not going to be like, like those are gone like I, I don't I don't care about them. I don't. What I would rather do is get rid of one that I know is is on a misbehaving. Node then yeah.

H

What I mean is that you potentially cannot get.

A

H

um You you will probably.

A

But it's okay, though, like what's the alternative like if the alternative is a healthy vm, that's old! I would rather get this out of my inventory.

H

Yeah, all I mean is, if scale in you would probably have to somehow that then you're changing the meaning of the notes. It has a different meaning. So one meaning is okay. This, when I scale down, delete these vms first, once they're done, you create new ones whatever. But what I just wanted to say is they do not go away, so you still have them in the inventory if they notice issues so right. What would probably help you?

H

There then would more be something like in addition to something like unmax anyway, label having something like, and a second number where you, where you allow, I don't know potentially having 110 replicas.

H

If you kind of see that 10 that a few of them are properly unhealthy and it's more important for you to have 100 ready ones than exactly having around 100 replicas.

A

You know what I mean. I don't quite know what you mean. I I like the my understanding on this is like if I, if this, if I know that this vm like we'll, make the assumption here that it's on a node, that's not responding. If I know this vm is is bad, it might like. When I say inventory, I mean the vm pools inventory like if I know that the the vmi here that I'm holding in this in this vm pool inventory is, is on a node. That is not responding.

A

Well, I I would rather not have that be in my vm pool, um then, okay.

H

So it has nothing to do with the node selector, so you would want to have it ignored automatically. So it's something different. You mean or.

A

No, no wait. I don't understand it's like what would be ignored.

H

So what I mean is you, you have now there's no selector thing here in the order policies, and you tell it's something like uh you cannot. So I have here here different things here. One thing could be that node selector means shut these vms down in a preferred.

H

So when you scale in take them first, but you also mention something about unhealthy notes and there you have them the issue that you would have a different meaning for the node selector, because it would mean that these nodes are unhealthy and potentially you cannot even delete the vms there because you don't know their states, but you want them ignored. Now.

H

That's the two cases I feared, so all I mean is we probably need to sort out more clearly what what cases we want to take it with what because otherwise, it's just very overloaded with what this means and what you would expect from it. It's getting.

A

H

Yeah, I was thinking both yeah, but you cannot put both into these fields. I think that's too much into overload.

G

That's what I mean here.

E

I'm gonna make a suggestion here, especially for the sake of time and getting this uh worked on at some point virtual machine pool. um Why don't we follow up with the node selector or whatever? This is after this? uh Maybe the initial implementation takes place, and maybe that gives ryan.

E

For example, you a chance to adopt vm, pools and kind of discover the cases that you would uh want to select things in different ordering like in the real world, and then we can work through those exact scenarios, uh because I I'm not sure it's clear to me node selector.

E

So what I'm proposing here is to take note selector out of the uh design document for now keep these ordered policies, they'll just be label selectors and keep the base policy uh with the understanding that we can expand the ordered policies to include things like those selector or perhaps something more accurate for uh exactly what you're trying to target uh in the future. Does that sound, reasonable.

A

Yeah, I think that's like I said before: like label selector, we we can do everything that node selector is currently like. All those cases that I mentioned can be covered by node selector in one form or another um yeah I mean it's just I think. Well, the only the thing here is all the different reasons I mentioned it. It would be convenience, but it we can. You know this is something that we can like. I said we can expand on. I think.

H

I think, and I can understand all the cases you brought up- I'm just not sure if it's really convenient also in operation, if you put them all into this node selector. So I think it's great if we have the chance to discuss the use cases for this separately to see where it best suits, because I think it will not all end up there but yeah an exciting general yeah. So.

E

There might be, for example, ryan. um If we we see the use cases that the node is unhealthy or unresponsive. We can start saying all right so vms that are uh running on nodes that are not reporting, like their health check and everything target those first and then that's like a catch-all where you don't have to actually list your nodes and this node selector we're just gonna. Do the right thing dynamically.

E

So I think there's just some more thought that could be put into this to make it actually easier on you all to catch these cases rather than having to set specific nodes in the vm.

H

Yeah I just want, if we invest here in a pattern where we know upfront already for many of these use cases, you then do not only have to change the node selector here, but also have to change other places to get the effect you want so yeah.

A

Yeah just to make it more easy: okay, yeah, like I said, label, selector, no matter what is serviceable and if we need, if, if there's when we talk about like state and readiness for like this yeah like that, could be a whole large discussion. I mean you've already talked about the state of the vmi and node state. There's another one yeah. We could include that in there um that's how we want to terminate automatically yeah, that's possible. Yeah, I mean yeah. I mean the only other thing I like I mentioned.

A

Like I said hardware yeah I mean again, we could do labels for that. So I think at least for the for the time being like this would be the thing that could solve the use cases either way, and then you know if it becomes something that whatever, if it's just a pain like we just need to expand it, because we have a clear use case on whatever labeling or something you know based on nodes. Then then that's fine. We can talk about this.

E

Okay, so I'll capture this in the document I'm going to take out the node selector and the examples I'm going to make a note about this discussion and kind of the future thoughts on what need to what we're going to look at after it's like a follow-up. I guess this is a I'm documenting that this discussion has taken place and that there'll be a follow-up on how to handle this after the base. Implementation lands. Okay,.

E

I think this is really close. I think this could probably be worked on like in the next week. We just I'd just like to get some. I'm gonna finish out this last, hopefully last revisions, and then I think I'd like to get one final final round of feedback where people give hopefully looks good to me, and then this might be something I can start working on cool.

A

Nice, okay, um all right cool. We got some just in here good and then all right. I think we're at time um any final thoughts here. Last few seconds, some people before we conclude.

A

Okay, all right thanks for your time, everybody all right, bye have a good day. Bye thanks, bye, bye thanks.