Kubernetes SIG Node, 11 Jul 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20230711

Description

SIG Node weekly meeting. Agenda and notes: https://docs.google.com/document/d/1Ne57gvidMEWXR70OxxnRkYquAoMpt56o75oZtg-OeBg/edit#heading=h.adoto8roitwq

GMT20230711-170511_Recording_1542x1020.mp4

A

Hello, hello, it's a weekly signal meeting today is July 11th 2023 welcome everybody. uh Today we have few agenda items. uh We will start with Ace.

B

Yeah so um I opened this issue a week or two ago. um Basically, all node usage um is reported very differently for secret between V1 I, mostly put on the agendas today to see kind of I. Think I at least have some alignment on what I think the solution is. um But I was hoping to either get a consensus here and try to move that forward.

B

In terms of uh I mean it's a run, C fix, it needs to be picked up in C advisor, potentially and then picked up in kubernetes, but I kind of would like to get that train moving um if people sort of agree.

B

So the very short version of this is um curvy two there's no Group C group like memory.usage, and we used total minus free to approximate it, which doesn't really match the calculation that was done in V1, and it can be off by several hundred megabytes, almost up to a gigabyte in some cases, um and so what are the fix and run C? That I suggested uh basically just matches the calculation in V1 and I might need to uh sorry, not this one. Actually, it's at the very end.

B

If you look at the pr that I linked, um but it basically just makes it in on plus file.

B

B

And yeah I think the other question is, then: how do we actually like test this right um because, ideally, like the testing that I've done, I basically take a node um and you can look at it on secret, P1 or V2? You just reboot it and change it um and you see what gets reported um and you can check like proc mem info and all the actual usage is exactly the same. It's it's purely a reporting calculation issue, um so yeah.

C

Yeah we need to get the Run CPR I'll poke some run, C maintenance and take a look at it myself too.

B

Correct yeah I mean there's no objections. I think we can move forward but I. Definitely uh if, if anyone is everyone see maintainer and wants to review this, please do take a look. I'll just bring the folks there and see yeah.

C

I'll, take a look and I'll also check with the kernel subsystem maintainer there just to get their thoughts.

A

Thank you for bringing it in. So um if it's run C and then see advisor and then kubernetes, please pink uh people more aggressively. Thank you for bringing it to the meeting yeah.

B

B

A

A

Okay, um Kevin's item was uh strikent, probably it's already merged. uh Next one Karthik.

D

Yeah hi, so we discuss about this in previous one of the meetings, so we had some feedbacks and concerns as well, so we are looking forward to no more opinion from the community and how to take this forward.

D

So, in short, what we are trying to achieve is that uh in currently, if you want to resize the compute of an e node, we know to manually restart a cubelet. So with this approach we want to dynamically change the values to the cluster level. So this is the main intention behind this cap yeah. So we want to know more about it.

A

Yeah I think we shared a lot of feedback there.

A

Typically, the process is like: we need to Define scope through, like iterations of understanding what back and what the minimal products we can get uh with the understanding how it will go going. What will happen going forward? I.

C

A

This cap has a few feedback items already. Do you know the feedback? Do you want to reiterate, try to discuss on this missing.

D

uh Yeah, there is concept regarding the stability of the cubelet and uh there was a one raised. So so we want to understand uh with the which perspective, so we should need to tackle so that we can address that issue.

A

Yeah I think the bigger question was semantical like uh do we even want to support this kind of API and what it will mean for? um uh If you all do so, one of the suggestion was to concentrate on Cub Resort, uh making Dynamic, for instance,.

C

D

A

I think uh you have issue and you have a PR right for Gap, so I think it was raised in one of those I. Remember seeing it.

D

A

Okay, but I hear this request quite often, and now like there are two types of requests. First request is: uh let's make node be more Dynamic and report it status uh quite like proactively, like uh that's probably what you suggest right see.

C

A

Monitors the usage to update the status correspondingly another approach, people asking about is API based, so can we make an old big, but then API can edges the snow to say, like only use that part of a node um and that uh another approach, people um thinking about both are valid and for different scenarios. We just need to understand how much we want to address it soon and uh obviously a couple of stability will be a big issue here as well.

A

All the resources uh put in place upgrade and this that cap was merged with a few, with understanding of few race conditions that we introduced. um We hope to address them closer to Beta, but it may require a lot of refactoring, so maybe after the city Factor, it will be easier.

D

Okay, so you meant to say we had to keep this for a hold for a while.

A

If you want to keep the racing on that, it's it would be great. um The main goal is to understand the scope and, like agree on a scope, is also parties. So if you can go through the pr and issue again and collect all the feedback that was given, uh it may help- or maybe we can have a separate meeting for that, but I think it should be closer to beginning of the next release. Right now, everybody will be busy with 128., because.

D

A

Is there any more thoughts on Dynamic notary size.

A

Okay, then, let's move forward.

D

Yeah one last thing is that I will request anyone to just leave a comment on the issue, the announcement, so that we can go through it and try to address those things. Thank.

A

You thank you, um I, don't know.

E

Yeah, that's that's me. Actually, uh hello. Everyone, like I'm new to the meeting and I, was actually taking interest in this particular issue. Where uh I understand we need owner references uh to be exposed from the Pod, either via environment or download. Api. Probably download APA uh so like I just wanted to uh get a better understanding of it and also like how to go ahead with it.

A

Can you comment on the use case.

E

um So like this is uh yeah, uh this is one of the use case that was uh under discussion where, uh like the metrics were metrics wanted to uh get the owner references of the pods and, like the replica, sets to the deployments and all that I'm still like kind of uh not clear on the part like uh what is the exact use case that uh will require this. So that's one of the things I wanted Clarity on.

A

Yeah, if it's about metrics, it's quite an interesting use case and we met it uh when we've been working on probes, so every port has its name as identifier, and this name is constructed from uh in in case of uh some application. It may be pod name, but then it will be also a replica set, ID replica will generate ID and another ID as well, so you guys have multiple random IDs as a suffix.

A

So on the one hand you want to have metrics specifically for this port for this instance of report, but for some metrics you want some aggregate across multiple ports so um and easiest way to aggregate across multiple ports is to um I mean today we strip step in the suffix, but it may be better if you have a direct, only reference, um maybe more clean solution.

D

A

It's about that um yeah. If you want to start the cap, um you can start writing it. I think we will get more attention to it uh closer to 129 um I, at least right now. We are closing up 128 and code freezes soon.

E

Great yeah, okay, like I'll, start drafting the cap and maybe in the next meeting we can uh put it up for review in the same uh issue thread something like that.

A

Yeah you can do that um as I said, you may not get enough eyes right now, because everybody will be working wrapping up the 128 release. We will have code freeze very soon.

E

A

Okay, any more comments.

E

Nothing from Ryan. Thank you.

A

Okay, next item is mine: um I wanted to let everybody know that sidecar PR got merged, sidecar um yeah I see people reaction is a emojis. The sidecar is a long-standing functionality that people were using for a while, but it wasn't used everywhere.

A

um Because of all the limitations with jobs and with uh sidecars not being able to restart um in case of like complete Post running to completion, so this PR addresses solves these concerns and the sidecars becoming almost a first class citizens uh in the kubernetes.

A

Please watch out for all the failing tests that may be result of this cap on merge and also we have more follow-up PRS like uh since it's a very big functionality. We split it into multiple PR's um main logic is already merged and uh the only thing left is uh clean up here and there, and maybe a little bit of enablement of more things like craziness, props and such but yeah, please be on lookout for any regressions. We can. We could have cost.

A

In next item is from Mark.

F

Yeah um I was that we have a pull request open for the windows, implementation for the stats. Only Cris and I was hoping to get some um reviewers from signal to look at that, so that we can hopefully get that merged by next week. For this there um there's the implementation and some ede tests, the ede tests are currently failing because, as uh these Ed tests uncovered, some issues where, um like on on Windows pods and transient states, were causing um some of the stats not be reported correctly. There's some linked pairs in container D.

F

That I know um folks have been reviewing, and hopefully those get merged soon too, uh and then we'll make sure that the ed test pass. That's running that this test consumes those updates, um but I, think that we could probably work on getting this merged in kubernetes, since it seems like the functionality is not working.

A

Yeah targeting GA of this right I believe it's.

F

Targeting beta is that is Peter on the call.

G

Yeah I'm here um it is it's targeting beta I I'm suspicious as to whether it'll make it there. But um it's currently targeted, debated now like to go.

B

G

F

Yeah, and is that Beta And On by default, if it's on by default, I, think I, don't know what to try and get the Windows 7 merged.

G

No, no okay, beta and off by default for sure.

A

Okay, um yeah, because it's if it's owned by default or some reason, I thought it. It targets GA, and that was um a little concerning. But if it's bad it's great, um we progress and I see him. Now. Oh no, who commented.

F

F

A

F

I'm here for that thanks, everyone.

H

A

H

Hey guys yeah hi, oh good morning, I am new to this uh project and meeting. So uh actually my uh I'm going to ask a very maybe still the question so I have this PR and uh uh it's about uh like facing an issue like when uh steady policy like pausing shared pool, uh uh still uh allocated to Reserve CPU right and then uh there's a test like poor communication manager test. uh It's just failing.

H

Like depart like I want to know.

E

H

I try to see the test in the ammo, but he just said uh exit one and no specific uh error message so is Francisco there.

I

Yes, I am hey yeah,.

H

Yeah I think you're very familiar with this. huh How should I, debug and see what's really failing or just false alarm, yeah yeah.

I

um So basically, yes, you would like to have the the CP manager test run against APR, so yeah about the the the specific test, unfortunately I'm between uh in uh I'm with busy this week, for the reasons to get ready explained so me and others so yeah I will take a look as soon as I can, but if someone else wants to to help feel free, because I I have a limited bandwidth, but yes, we need to it's.

I

It's a related to to your PR um I I, strongly believe that so need to make sure the lane is working correctly.

H

Yeah but but uh let's say, uh let's error like this like this in a test, how should I see the actual error message.

I

Yeah, this is exactly the problem that the the lane is not be. It's not supposed to behaved this way so yeah. We need to understand why the lane is not even starting and yeah. We we can help with that, but yeah we will get to that. Hey, but long story short is not supposed to be like this. Yes,.

A

So a few suggestions.

H

Ideally, you should explain.

A

Yes, um a few suggestions first is uh try to run it on mcpr. uh It may be uh helpful and then like. If it doesn't work on mtpr as well, then Lane is clearly broken.

A

um You can also ping seek testing uh because it may be something with infrastructure um once these two exhausted uh yeah, please pink again on Slack, and we also have other CPU anthropology tests failing right now on periodics, like we had them on CI on pull requests and they were reasonably stable before I. Don't know what happened with this particular one uh on periodics, we have them added, spicy added them uh a few weeks back and but they are not working right now as well.

A

So we're looking for people who has bandwidth to look at this test and try to get them back into working state, but those tests are very important because before uh before Society added this pull request, jobs and other jobs. We didn't have tests at running on multinomial environments and I. Think this one is Ryan called multinum as well, which.

H

A

I

H

A

But since it's multi-numa, it uses bigger machine type, uh actually a very big marketing type, so that may be one of the issues and it may be test infinite related. That's why I'm suggesting to suggesting to start with seek testing? Maybe they can help.

H

Click test, all the sequence has tell us. So sorry, can you yeah explain what's the secret so.

A

Yeah kubernetes we have different sigs um and sigma we're looking at equivalent and such and sick test is concentrated on how tests are run and what kind of test Frameworks we have. So if there are some infrastructure like there is also infrastructure, uh sick, terrible to help with that. If one of those may be better to look at this particular issues and signals.

H

Okay, success is another channel right. I got another group of people, yeah.

A

And there is also infra kubernetes infra Channel as well. Okay,.

H

A

May be able to help.

H

Okay yeah. Thank you.

A

As um said that, we have this one and we have more um related tests that needs to be fixed and we're looking for people with a bandwidth to get them back into the Green State.

A

If you can help with that, it will be amazing.

H

A

And that's on Paka.

J

Oh yeah, this one is about to the uh eviction CI failure for PID related scene. uh uh This is here we can see.

J

There are many uh feeling readings, and one of them is about PID pressure related, uh which is filled in the CIO, but it runs well in the community to it will cause a PID pressure easily in Canon D uh cluster, in my local test as well and I'm, not sure if you are joining the subscript, the signaled test failures, uh email group, you will got a filler email every day, I think and yeah.

A

It's known issue for a very long time and I'm really glad you look into that.

J

Yeah I want to Iris here because so many guys here I'm not sure if, if anyone knows the background or context about this e2e test, uh I see uh according to recent testing, it is always filled.

A

The request is to look at PR.

J

I want to know some context, maybe about the pr uh I fixed it using uh not waiting for the PID pressure on CIO, but I'm, not sure if it is right. Another thought on this the another way to fix this uh I'm, not sure if it is right, I tried to make the PID pressure lower to make the couplet faster I'm, not sure if that works, and it also flicks in the community environment for the eviction order, because roblette cannot know the uh the PID process.

J

States of hpod uh uh apt, I enabled the feature gate to get post status from CI the then the it works in my cluster.

J

There's there are multi problem sense here, I think.

A

Yeah, unfortunately, I don't know this logic about eviction ordering for kids. Is anybody on a call, know the history and something about it together?.

J

um By the way, the I think the PID process States for is forming. uh If it is from the C advisor there is a PID related metrics, which which has some performance issue. I think before I have said, say that in an all the kubernetes issue,.

A

Yeah I think I looked at PR and I'm not comfortable to fix it, as it suggests to fix just like changing condition. um I really want to understand. What's going on so um yeah, if anybody has a context, please share.

H

A

um Yeah: let's uh try to move the discussion to maybe issue and Slack.

A

Okay, we can find somebody who remembers or who willing to help uh bounce ideas.

J

You have, the current fix is not just Grace graceful.

A

Okay, yeah: we at the end of agenda um anything else for today.

A

Okay, uh if nothing else, uh please, uh uh if you can spend this rest of the meeting time for reviews, please do so. There are many PRS that waiting for some ice uh and some attention. Thank you very much. Bye-Bye.

H