KubeVirt Community, 19 Jan 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: KubeVirt Community Meeting 2022-01-19

Description

Meeting Notes: https://docs.google.com/document/d/1kyhpWlEPzZtQJSjJlAqhPcn3t0Mt_o0amhpuNPGs1Ls/edit#heading=h.u74oyrl72es0

A

A

Okay, so I've started recording um not quite going we'll wait until the minute ticks over, but just wanted to put out there. The kubert summit still coming up. um We've got a good number of sessions already, but I was advised that uh you know more, wouldn't hurt.

A

So, even though it's after the deadline for the call for papers, we certainly would take more. um You can reach out directly to any one of us just go to the cooper dev mailing list and uh let somebody know.

A

And now we have three minutes after the hour so begin. The meeting uh welcome. My name is chandler wilkerson I'll, be helping chair the cooper community meeting for january 19th.

A

So everybody go ahead and add in your names on the attendees list.

A

Capitalizes, it.

A

And first order of business is: do we have anybody who would like to introduce themselves being a newton kind of? Can you put.

B

The tell me a little bit about what you're doing with cuber. Can you put the link on the chat.

C

To the document.

A

Sorry, my audio is very bad.

A

A

Can you repeat that I'm sorry, I think my speakers were off.

A

D

Is asking for a link to.

C

Google docs.

D

I I posted it we're okay, very good. Thanks.

E

I think at least the link to the uh meeting notes is also in the uh the invitation right.

A

Yes, it is, and it hasn't changed we we might talk at some point about whether we want to have a different one for every year or, like start, you know having them archive, but.

F

My name is sorry hi. My name is abhishek. I joined this community recently, the cncf community, where I found this this particular link for keyboard, so uh I'm very new to this keyboard. uh So probably if, if someone can have help me on understanding how this cubework will help- and I I have requested for the access of this link, can you please approve.

A

Okay, I'll look into that, I'm not sure if I have the the right access you're trying to get uh which link the the one to the meeting notes.

F

uh This box.google.com, whatever the thing you open, uh I have requested for the access. Oh.

A

Okay, so in order to get access to that, you have to join the google group and then you will automatically have access to it and the group right that would be.

A

I'm assuming it's right on the front page, but I'm out of practice for finding.

E

A community page.

A

Which would be there.

E

Yeah this one yeah, I think so.

A

A

So yeah join our forum. I believe that gets you into the google group and um that's open to anybody. So once you once you are on that, you'll also have the calendar, invite and uh access to the docs.

A

Okay. Thank you. Welcome.

E

And, besides that, you, uh if you have any questions you, we also have a slack chat. That is, that is in the on the same page, so um feel free to join and ask questions, for example, uh in any of the keyboard slack channels that we have there virtualization equipment. I guess first of all, maybe virtualization. First.

F

A

Okay, if we don't have any other uh new introductions, then we can move on to the agenda. I think we have something about coveralls.

E

I I heard that my audio volume is very low. I'm not sure why that is. um Can can you still hear me okay or um is it low.

A

I can hear you okay,.

E

Okay, so yeah, um okay, I wanted just I I had a question to ask uh regarding the coverage that we are um running on all keyword prs. So, um first of all, um we had a time I think where we were beyond 70 of coverage and this value is slowly starting to drop again. So I think we are now near the bottom line of 70 and maybe losing beyond that.

E

So I was thinking about whether do we want to enforce something like a lower boundary of coverage value or do we just um do we just want to that to happen, naturally by pr um reviewers that maybe tell people that they should add a test here or test there, because I think in general coverage is just something that is a best practice.

E

I don't. I wouldn't say that that coverage in itself is bad, of course not, but sometimes it's a little bit awkward. If you have yeah yeah, you have the famous uh test test coverage, uh things that just run things and don't test anything or something like that. So what is the general opinion of that? Should we should be enforcing lower boundaries, or should we just um keep it like like? It is now.

G

So one thing to to add a little context to what daniel's uh bringing up here um we had a pr this week that was exclusively removing lights and and in general, that's absolutely good, because less code like if the code's just cropped and not needed, absolutely take it out, and so, but unfortunately, because it was removing lines from the total.

G

That's one of the dynamics that dropped our um our coverage beneath that 70 threshold and the coverage bot rejected it, based solely on the fact that it had fallen underneath the the line, and this is a gating uh uh lane, and so that um that just begs the question as to wait a minute.

G

You know this is something that's intrinsically good in any any developer would look at it and say: that's absolutely a good thing to do, but the because of our artificial rules that what that became, problematic and so that that's kind of the context to where this question's coming from.

E

Yeah, and also also I was- was just thinking about what what the community thinks in general about coverage also, so uh I want to want to get a feeling of how we should be handling all this coverage stuff.

D

I think it's a good tool so to have the coverage reports is valuable because we can go in after the reports made and see what areas are covered. What areas like coverage and I've certainly used that in the past, to kind of understand um where we need improvement.

D

Like stu pointed out, it can get a little hairy when we're doing changes that are actually good in the coverage report we enforce. It doesn't necessarily match reality, so we also have like lots of generated code as well, so we could have enormous amount generated code that might not technically be tested. I'm not sure if that happens or not, but that would be another thing that would kind of put it in like offset the uh the percentage tested versus um what was actually like new logic.

D

I feel uncomfortable with making a hard requirement um that we should stay around a certain threshold or not drop below a certain threshold, or something like that, and I think that um it's kind of on the reviewer's part to use this as a tool use the coverage as a tool to understand.

D

Well if it dropped a lot during this review. For some reason, well, why did that happen? Is that something we care about? Is that something we don't care about and use that kind of as a discretionary thing? That would be my take at least.

E

So you're saying, if I understand correctly, that we shouldn't impose any enforcement on, for example, dropping by a certain percentage or below a percentage right.

D

I'm hesitant to do that. I'd be interested in other thoughts, um I'm hesitant because I think there's going to be situations where it drops and uh that's okay and then we have to make a decision. What's our policy at that point like if we're saying that there's reviewer discretion with this unit test or this coverage test and the threshold then does it have value to begin with for us to enforce it? um I I'm unsure yeah, so.

H

I can I can, I think an example will be interesting, like if you have a project, a very small project, and you have two fights in this project: production files and one.

C

Of them is fully covered and the other one is not covered.

H

And then you do a refactoring and say: oh, I don't need the. I don't need.

C

The production.

H

Code of the covered fight, so you delete it with its test. So now you have zero coverage.

C

Just before you had fifty.

H

Percent garage now you.

C

Have zero coverage because you.

H

Removed half of the code, so I think so. This is what happens sometime. That lowers the.

C

H

Not worth it's just reflecting that the rest of the code, that is, there is less covered, so a rule like having a minimum it's problematic in this sense, I think I guess this is one of the scenario that you gave here. That happened.

H

You you added code of tests, but, and you.

A

H

More but but compared to the everything else, you just added more code, so it's comparable. It's less coverage.

D

Yeah, I guess here's another scenario. uh It doesn't always reflect even when um more test coverage is necessary. So let's say somebody makes a 20 line change to one of our controllers or something like that. That might need quite a bit of test coverage depending on where that is in the code and like what critical code path that is involved there. Now, if that doesn't get touched, we're talking about a fraction of a percentage difference in our test coverage report, but that doesn't reflect how critical that one area was.

D

So it's not catching necessarily really important things either.

E

Yeah, it makes sense.

A

Is there a way to wait coverage.

A

To to give it a wait,.

E

I don't think so. I think our coverage is equal in that point, but um I just wanted to uh to to get back to david. What david said about uh that? We are uh also covering uh generated code, and I think I worked on that topic and I just removed all the generated files from the coverage report. So I think we should be at least good there.

D

Does that account for like new apis and things like that as well like how.

E

Yeah, I think we are filtering by uh by file extension.

E

So if there is some, for example like like most of the, I don't remember exactly, but I think there is something like generated go or something in in the file uh most of the time and at least those we filter out when we create a report, I maybe you know better whether there are other extensions that we should take a look for.

D

Yeah, I'm not sure I just know that when we generate like a new api, the client code and everything associated with that uh and like the staging folder is pretty extensive. um So maybe that's not included in the coverage. Maybe it is. uh I haven't looked at it but yeah. If we're uh ignoring some of that, then um maybe it's more accurate yeah.

E

Okay, so so in general, I think that that we shouldn't enforce any boundaries and just uh leave it to the reviewer. I I was just thinking about probably that maybe we are whether, when the pr got merged, then it's too late and if someone overlooked, for example, missing tesla something um there could be situations that the reviewer, because he's also just human, uh just overlooks the fact that we didn't we. He forgot to ask that the author of her got to ask or something but yeah.

E

I think that's something that that we then should see, at least with the coverage uh uh values uh decreasing to a certain point: yeah. Okay. I think we should leave that to the reviewer.

D

Yeah, it is a risk. What would it look like just as an alternative to this? uh If we made this required, um this coverage report uh required not to drop below a certain percentage or whatever for us as a reviewer to override that so treat that as a signal that has to have a reviewer intervention to ignore.

D

Would that be as simple as just a comment to ignore that part of the test lane or how would we handle that.

E

You can you can just override it with with a default override command, so um at least that should be good. Then.

G

That's an interesting suggestion, so basically we would say we would still have a general guideline, uh but in certain cases we could intentionally ignore it.

E

D

E

D

Well, I was just going to say if, if we do that, um we would need to write down a policy of how that is handled like why it's okay to ignore it, something to point to that's kind of like a contract that we've all agreed on that this.

D

This test is informational as a signal to um the reviewer that something needs to be looked at, and it's reviewers discretion whether something needs to be changed, because I don't want to see it enforced where people have to do kind of silly things, just to increase the code coverage for their pr when it makes no sense. So we need to just be explicit about our contract there. So there's no confusion about what that failing test means and what, whether it's okay or not to ignore it.

H

It it, I think that if you put the the check in uh it will be red and then you ask someone maintenance to override the check. That's that's nasty. I think. Let's take it's complicating everything in my opinion, how how about just putting a rule that it cannot drop more than two percent per uh per pr, or something like that. Does this make sense.

D

I think that was.

H

D

I think that was kind of um something like that was what we were talking about. I thought.

H

No, I think, uh currently it was daniel.

F

H

Can you please uh correct me.

F

H

Case was that it dropped below 70 or something like that, like a hard percentage just below it, and that's it no.

E

Yes, actually it was like that and uh to be clear on that, we have two boundaries that we can uh enforce where at first we can, for example, impose a certain coverage decrease in the ndpr buyer percentage that we want, for example, if we, if the coverage decreases by two percent of this pr, it should fail, and also we have the other boundary that in general, if the uh coverage below below a certain point, uh um then then it can fail.

E

So we have those two options to to enforce those rules, but yeah, like you, said edward, uh it felt be below 70 percent, and um what we did at the moment is that we just uh uh load the the percentage to to um to give some room for discussion and to decide what we exactly want to do.

H

Yes, so maybe maybe.

C

H

A thought, maybe just a matter of.

H

We know this is not accurate. We know that is. uh Coverage is not the most important thing. It's just the signal of the state of what's going on. So, if you have like 20 percent covered, you are in bad shape in general. If you are 50, you are somewhere in between.

H

So maybe the the rule should be that this uh that.

C

We should not go.

H

Over something like a number, I don't know what is the number like? If we see that we are at 70, then maybe it should be 60. I don't know, and if we reach this uh this number again, it means that we need to stop and and check what's going on, because we we are dropping coverage too much and and it must be increased, so something like that or maybe this needs to be monitored in the long run.

H

Yeah. Something like that.

E

I think it is already monitored right, so we were at a certain point where we had. I think around the the peak was around 73 coverage and, like I tried to describe it, it decreased over the last once again so um and now. Finally, we have been hitting this lower boundary of 70, so yeah.

E

I just wanted to to uh put this into discussion, but I would be fine with something like um uh let let the general decrease probably not be minus two percent or something in pr, because I think that would be some sign that there is something definitely wrong at least um and the general rule would be okay. So we we drop the lower boundary altogether and just um uh let the reviewers decide.

H

Yeah and maybe just uh take an action to investigate why it dropped so much in the last the last month or something we maybe.

C

H

Requires investigation, what happened like.

E

And, like stew said, I think, a general removing code is not a bad thing right. So that's the fair point.

I

In my opinion, I have to say that if a reviewer sees that pr drops coverage above two percent, then I mean he would likely not approve this pr and see that something is wrong. um I'm not sure. What's the value of uh not allowing and forcing that vr not to merge in in such circumstance,.

H

It's it's just a just a warning for the reviewer. I mean some. I for example, when I'm reviewing I'm not. I don't think I always look at the this number, I'm looking at. It's, not red or something yeah yeah, but I think it's just a warning for this two percent, or whatever number you put.

I

Right but right now, if the the coverage drops by any percentage, then the test fails and it's marked as if this fails. So I think that it's already a warning right now.

H

It it doesn't fail. If you, I think, uh no, it doesn't.

C

H

Now like this, I think I saw that I saw pr that they reduced like 0.08 coverage and it doesn't fail.

I

I think I think that it is failing, but it's not um it's. It's not required, so you can still.

C

I

But it shows as if it fails.

H

E

So, okay, I think I think.

E

I think that the general tendency is not enforcing a low lower general barrier and letting reviewers decide, probably on on whether that br is good good to merge. So at least that's what I understand- and I can of course live with that.

E

Okay, so um if you don't mind, I would just go to the next uh topic. If you want chandler, please.

G

E

Okay, thank you. So um another thing I wanted to mention was uh that we sometimes have problems in uh reviewers not being active anymore, because they have left a project for some reason or are inactive in in another sense. um So what I saw what kubernetes does is that it looks at least like that that they, every year or start of the year, they they run some tool and to clean out the uh owner's files with data from the dev stats from kubernetes, and I was thinking about that.

E

We could probably do the same um so that there would be some automated or at least or even a manual run-off of this tool and just clean out the uh the owner's files.

E

So would you rather think of that.

G

That's we could uh one thing to keep in mind is that we do have a published governance doc, which does touch on this. So if we decide to change course that we'll have ramifications that need to be actually documented.

E

It makes sense.

D

I don't think we're gonna have much pushback uh from removing people from the reviewers section of the owner's files. I think that we have some policies in place where some just I think a few people have to help decide for the owners. um Sorry, the approvers, um I'm not sure exactly what the policy was so well. I guess what I'm trying to say is. I could see an automated policy around reviewers, which are the people that automatically get assigned to pr's, so we could do something like if um people within the reviewers list.

D

I haven't done a review. I don't know in three months, then probably they aren't doing reviews anymore um and I think it would probably make sense to remove them in an automated way and it's easy to re-add people uh as well, um but the approvers um that's difficult to automate. I think, because that that has some different um implications.

E

Yeah, that makes sense. um I think that- um and I am exactly exactly trying to aim at this- that inactive reviewers get at some point somehow, after a certain period of inactivity, just uh removed and so that we don't have people that are inactive assigned reviews and that's exactly the problem that we're trying to solve here. Yes,.

D

Yeah, I I don't even know that threshold probably should be just a couple of months or so. If and then we could add them back, they won't, but there's a there's, um it's bad. If we are auto assigning people who are inactive to review things, because that just means prs start to pile up uh so being aggressive, there isn't necessarily a bad thing.

D

For example, I went on paternity leave a few years ago and I probably should have been removed from that list, but I wasn't- and I probably caused a backup of prs, because that's and I'm just thinking about that right now, so it is a problem.

E

Okay, so yeah I'll take that and have a look into that and probably create created beyond that.

A

Now would this be mainly focused on cooper coovert, or would you be looking to kind of spread the monitoring of reviewers activity across other repos.

E

I think I think what what I was trying to solve here was that we have a lot of ripples under the qubit.org, and I think this would make sense to to do this for all repos. So at least I think so because uh I've seen, for example, not only in cubic cube root, but also in kubernetes. I, for example, then the people getting assigned who are no longer active on the project and yeah. So so I think this would make sense for the whole cubicle.

D

Yeah, I'm not sure, um certainly the repos, you just pointed out make a lot of sense. I probably need to be reviewed removed from keyboard ci and like project infra, because I don't do a whole lot of reviews there, but I think I still get assigned sometimes um so those definitely make sense.

D

uh There are some repos in the keyboard or that, maybe just aren't very active. um I mean there's a lot of projects in there now, so it's possible or foreseeable that um there's just not a lot of reviews going on at all on some repo. So if we had a policy where people got removed after not reviewing it after so much time or whatever, it might just automatically remove everyone, because there was no pr's for a few months, I don't know how so maybe it's a repo by repo sort of opt-in.

D

Maybe that makes more sense. I'm not sure we have to look at the exact policy that it definitely makes sense for our majorly active repos like ever cuber keeper ci, maybe even like the storage uh repos. So the containerized data importer and things like that.

E

Yeah yeah, it makes sense, maybe.

H

Maybe it makes sense just to send an email to them to the maintenance warning about this and then asking them to take action because to nominate an approver, you need, uh you need approval, I need maintenance so to remove one. I think you also need the same. It doesn't make sense that someone will automatically remove someone.

E

No, I think, I think we're not talking about approvals, we're just talking about reviewers. At this moment, right approvers shouldn't get touched. I think, because um we're just talking about that that people that are inactive are assigned actively reviews because they are in the reviewers list and, um like, like david, said this: let's, let's pile up, prs and yeah, I think this doesn't make that much sense. So um someone else should the people that created bi could get a fair chance of of getting reviewed their pr within time.

E

So I think that's the main problem here.

H

Okay, I thought okay, but but just this is my note about this. If you are talking about the viewer, then I must say on all my pr: none of the ones that were assigned as reviewer were reviewing my prs.

H

So so this is, this mechanism doesn't really work. I mean I need to pick them my myself and that's it. It will probably work if we have uh owner fries under specific photos, because then it will be more focused.

H

Currently, you get a random random, almost random reviewer on anything.

H

So at least this is my experience, I don't know if others have different experiences here.

E

No, I I completely understand what you're saying and I have the same problem most of the time, but this is. I think this is a different problem um from from that one, but I'm I was thinking about at least but yeah. This is a different problem that also needs to get so somehow. Maybe this is also something that we should talk about.

J

um Hi, this is something really that I want to bring up. uh I put it in the chat already um it's the fact that uh reviewer is actually actually tied to branches. So uh if you're in the owner file of a branch, you can still be assigned as a reviewer for that range, uh I.e, backboards, uh even if you're removed from main- and I think it's wrong. I think the owner's file from main should be used for every branch.

D

Oh, that's really interesting, so the approvers is that apply to approvers, so approver, who maybe got removed, would still have approval to all.

K

The previous branches, yeah approvers, reviewers everything that's been owner abuses. That's interesting, never thought about that.

A

So you would have to cherry-pick removals of maintainers and approvers yeah.

D

It's interesting, I I don't know what to think like. I don't know if any action needs to take place because of that, but that's uh it's a good thing to understand. At least.

D

A

So okay sounds like the conversation's kind of slowed down on that. uh Do we want to move on to the next item.

E

Yeah, I think so. I think I think that that I have everything regarding opinions on that, and uh just just my my conclusion from that is that I'm going to create the r for a couple of repos and ping, the maintainers of the repos and just ask them whether we would be good to go with that. What I bring up.

A

Okay, so the the next item is one that I put in, uh I kind of mentioned this at the top, but before I started recording, so I want to talk again, we want to make sure that everybody knows about the kubert summit coming up. It's in february mid mid february.

A

It's a two-day virtual event and though the call for session proposals has closed as of this previous weekend, uh we wouldn't mind seeing another. You know session proposal or two. We have enough. We think, but it's it's kind of on the the envelope of being enough. So if anybody else is especially somebody from uh like a newer, um maybe newer organization that is joining the the cooperate effort wants to put in the session proposal, you can reach out to us directly either through the dev mailing list, or you can send it directly to me.

A

That's cwilkers redhat.com and I've got that email address in the notes as well.

A

So next uh edward.

H

Yeah, I just I'm just every week, I'm finding something else. Another small point about the backward compatibility. When we talk about the upgrading conveyor, so I wonder if we can, if, if you think it makes sense to have some kind of user.

L

H

Policy for a developer to what scenarios to think about probably some kind of metrics that they need to consider when, when they add something.

D

Yeah uh in general, we have a guide when it comes to our api. I believe so backwards and compatible changes would be removing a field from the api or modifying the name of a value within the api.

D

We don't have a great policy when it comes to things beyond that today and I don't even have enumerated lists of everything involved there, it's kind of complicated, so it's the handler and launcher communication channel.

D

We probably have some things around our back with the operator, so you can't remove our back from the operator or else it prevents the ability to upgrade, and things like that.

D

I think it makes sense um to maybe at least begin documenting what we know causes backwards or even for its compatibility issues.

H

Like the minimum is here, I, in my opinion, the menu is what what are the options that can be deployed so in the deployment of the upgrade that we can. For example, I think we had already this uh previously discussed that handlers can be of a different version at some at a specific point of time and the controller and the handle can be different version, so it causes the launchers to be of different versions for different handlers. So that's this kind of thing. Yes,.

D

Yep yeah, it's good to to document the different possibilities. So people understand how these interactions occur and the things to be aware of yeah. It's tricky. It's definitely tricky.

I

I would also maybe add a deprecation post.

D

Yeah that might be or um orthogonal to the backwards compatibility. Well, it is related right because um it defines when we are uh allowing backwards and compatibility changes to occur. So it's the process leading up to a controlled, controlled breakage of that sort of thing.

D

D

So edward is that something that you're planning to work on.

H

Sorry I had uh yeah, so we are. We have a current feature at the moment that uh is affected by this. So what we, what.

C

We are going to do.

H

I think is have a matrix, a table with different options and and going to rider what, for this specific feature, if it's okay or not, so maybe we can template it in some way and then we can put it for review. Does this make sense? And then, if there is additions to it, we can edit.

D

Yeah makes sense, what's the um what's the pr or feature that you're kind of uh using as the catalyst for uh needing this policy. Just out of curiosity.

H

It's uh I think it's the the one that it's about the srv hot plug.

H

It's we change the there is a proposal now to change the the way we hot plug the the srv devices.

H

So, instead of doing it as part of the migration flow that we unplug and plug back in the migration flow, we will outplug it as part of the reconcile.

H

That was the I mean it will, every time in the reconcile of the of the retender, it will look at the desire, the desired state of having n number of srv devices and based on the current state. It will try to reach the desired state by hot plugging, the whatever is needed or not doing anything.

D

Got it yeah? That's uh that's a really good example, um so I imagine that it's possible. I I don't remember all the context of this pr, but I remember it now that we have to carry some old logic with us for a certain amount of time. um Is that the case here and then we want to be able to to remove that in the future.

H

I think it's it's that currently there is a mechanism that does the hot plug as part of the migration flow and, uh and- and we are we are turning into this reconcile loop and there is the there is. There are changes in the communication between the defender and bit launcher, but we at least there is uh consequences in the, for example, if there is a new built-in old, build launcher or new vehicle launcher and hop on will handle getting it matters here, because you added the new grpc come on.

D

Got it so eventually, what we would like is to depreciate uh or deprecate um duplicate the old behavior, uh where it was handled, probably like burt launcher or something the hot plug, um but uh there's a certain amount of time that we have to wait for everyone to um update and for these apis with invert launcher to be available for the hot plug, and things like that so yeah this. This policy um would define. I guess that timeline for when we can remove that logic.

D

um That's looking to see if this new hot plug api exists or not and invoking it if it exists and ignoring it, I guess if it doesn't eventually, we just want to always assume that it exists.

H

Yeah, well, I'm still optimistic that uh my metrics in the end will show everything is green, but uh but yes, it may reach this point that we are. We need to wait for support both or something like that.

A

So I just made a guess at which pr you're talking about is this the right one? I can put it in the notes.

H

No it's, this is the it smells. It's handmade.

E

H

It's not mine. Also, I think I will I will try to edit later on. You don't have access to.

H

But anyway, I think what I'll do we will prepare this document with the metrics and then we will share it, as maybe as an example for uh for what we are talking about.

A

Okay, I think we can move on to the open floor. Adamar.

I

um Yeah hi, so this is an issue. I've met in order to raise a discussion.

I

So basically there was an attempt to to run cooperate on a very large cluster, and the problem is that we always have two replicas of beard: api pod.

I

um The problem is that, if we're trying to create a very large number of vms at once, then these two replicas have to run all the validating web hooks for all the vms, um and this is a major scalability issue.

I

um Also, we saw that the cpu utilization for these two replica was very, very high all through the this period, so I thought to basically leverage uh kubernetes horizontal pod autoscaler in order to auto scale these components. We can start with the virt api and we can base it on cpu usage, which would be the easiest, and I just wanted to raise attention on that and maybe raise the discussion.

D

So this is the horizontal part out of the scalar, so this means we would be creating more replicas of bird api to handle the increased load from uh the validation and mutation webhooks.

D

J

First thought.

D

On this is to understand um where the scalability problems are actually occurring within our controller, so I think before I would look at horizontally scaling. I would look at if there are any really easy efficiency winds that we can have within our our component there. So maybe.

C

D

Something super inefficient. I I don't know I'd like to understand that before um maybe moving on to something more complex that uh might even potentially be hiding the issue. So if we can create more instances- and we are just kind of scaling- an inefficient component- that's not necessarily great, but if it's in fact that you know we're as efficient as we can realistically or practically be then yeah looking into horizontally scaling would uh would not hurt.

G

One thing that comes to mind um along what you're just saying david: the lines of that um when you look at a average controller and there's a lot of contention going on, we do log a lot of retries due to the pattern we use of you know, grab objects fiddle with it and issue an update as opposed to patching. I know that we've got.

G

You know independent efforts to do that, but I wonder how much how much more responsive could we be if we were indeed being more surgical about how we actually updated objects and didn't end up retrying five times or making that number up, of course, but.

D

I wonder so specifically we're talking about the vert api component here when we're doing the retries in our revert controller loop, I'm unsure, if, uh if there's a collision um in that update, if it actually makes it to our api or not or if it's caught before our api server would see it.

D

So I'm not sure if those requests are forwarded on to our mutation and validation web hooks, they minor point, but that would be something to to investigate a little bit under normal operation, where we are just doing, updates and creates and deletes of objects.

D

I'm really surprised that we would need to go beyond two replicas for vert api once we start introducing uh console and vnc access or anything that has a persistent connection, that's being forwarded from bird api to controller to vert, launcher pod.

D

That's when things get kind of hairy um with performance, I've seen so definitely in that scenario we would want to have the ability to scale api for apis under normal operation, where that isn't like a huge burden. If we have problems with two replicas, I'm really curious where the problems are like. If we could do some profiling, for example, that would be really interesting. We have a prof profiler.

D

Now it's one of the features that was added sometime in the middle of last year. That allows us to do a cpu profile of a live system over a certain time period and then create like heat graphs and things like that. Out of it.

D

um I would be interested to see when we do one of these load tests or something like this, where we're spending the most time uh invert api, when this is occurring to give us an idea of this is something that we could improve or not improve, and maybe that's um in addition to this discussion about horizontally scaling, because it makes sense for people to need to horizontally scale.

D

I'm not sure if the automated way makes more sense or just exposing the knobs for people to adjust the replica account for their api component because they'll, they know up front that they need more dues to their their.

G

Environment yeah. I was just about to touch on that um one of the concerns I have with doing a horizontal pod, auto scaler, based on cpu usages. It might be too late if you hit the api server with 400 concurrent start vm requests. uh The damage is done by the time, you're actually rolling out new replicas. I think- and so it might be something where, if we had a way to be smarter ahead of time about anticipating that, we could need that many api replicas to have them ready.

D

You're you're right, you're right and I'm looking at this, um this bugzilla that's attached to it.

D

It's unfortunate these these comments are private. So if somebody isn't working at red hat here, they won't have any context, but I can say.

D

Yeah, it's it's unclear exactly where it's breaking down here, so our web hook um contacts yeah, it is our web hook. That's timing out.

I

Yeah, I think that the problem was validating web books.

I

And by the way, I I totally agree that we need to further investigate that that we don't want to hide any inefficiencies, but I do think that this have some kind of a limit right. So if, if we're using like a thousand nodes cluster, I guess the two replicas won't. Do it even if we're very efficient. But I also agree that we can expose the knobs to hco or something like that.

I

Instead of doing this automatically, although in in the horizontal part of the scalar, we can have a maximum and minimum um replica count so yeah, but either way either.

L

One of them is good.

D

Yeah, I'd definitely like to understand this issue a little bit better, the the bugzilla, um because there's there's potentially things that aren't even related to cpu and memory here. So uh when our web hook is hit with an update, a vm or a vmi that can cause a chain reaction with our api. Like kubernetes api access, we can say um an update is occurring on vmi as a result of that, we need to go off and gather some information about the larger system as a whole.

D

Maybe grant do a get on some pvcs or other information, and it can have like an amplification effect on the web. Hooks client access to the api server with the kubernetes api server in a way that we could get throttled.

D

So it could be that we are actually timing out not because of cpu and memory usage, but because the kubernetes api server itself is throttling us, because we're making two mirror requests, or things like that. So something to understand a little bit more fully before we look at an unmanned solution.

D

It's also interesting, so I've seen a lot of performance tests done and uh this performance test is different and then it uses persistent storage where previous performance tests have used ephemeral, storage, so not using pvcs.

D

I wonder I have a feeling it might have to do with the fact we're using pvcs and that maybe our web hook is doing something inefficient. Once pvcs are involved like introspecting pvcs, maybe it's not using an informer when it should, or maybe it's difficult to use an informer there we're having to do a live, get requests unsure.

A

So I made an attempt to capture that in a pithy note, so so I heard the performance, profiling and uh check on inefficiencies in the components uh before jumping to auto scaling as a recommendation.

D

Sounds great, that's where I would start and that's not to say that auto scaling shouldn't be investigated in the future. We just need to justify it.

I

Thanks for the great points of discussion.

A

And with that we're within two minutes of the top of the hour, does anybody have anything else they'd like to raise.

C

Yes, let me explain: uh here's andra from ddesk how we approach the same issue you having. We have a central api that controls uh all the load and then starts to control in every one of the hundred clusters. We have uh the load not only in this specific cluster, but across all clusters, because these clusters are also across data centers and for you understand what we are trying to achieve.

C

It's high availability and also auto scaling. It's something that we are approaching. uh We are not counting 400 percent of auto scaling of kubernetes itself. We are also controlling the load uh that are coming to the cluster and if something goes wrong with a specific cluster, we can spread all over several clusters.

C

The load coming for you know how we are addressing, like everybody, logging in at 7am, coming to the okay.

A

So do you have kind of predictive modeling of that.

B

C

We have an api that controls uh the load and can, uh let's say, load balancing across clusters: okay, because a single cluster doesn't fit our needs. We have 100 clusters with up to 1250 nodes in each cluster.

C

Is there anyone working on the part of virtualization of gpus?

C

This is. We are very interested in this topic because we are having some issues on the nvidia licensing terms and things.

C

I can send you a send a link of uh exactly what we are talking about here, just one second,.

A

Well, it might be better to bring that to the mailing list, because I think people are already starting to drop the meetings kind of that time.

B

Okay, no problem at all, but I can definitely capture that in the notes. If you'd like.

C

Yeah just put the link here: uh okay, it's on the chat.

C

Is there anyone working on the part of virtualization of gpus, and this was the red network.

C

C

Okay, it's called para virtualization, not virtualization.

C

C

A

Okay, thank you. Yeah thanks everybody for attending uh we'll go ahead and wrap it up.

A

Enjoy your week.

A