Kubernetes SIG Node, 16 Aug 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node CI 20230816

Description

SIG Node CI weekly meeting. Agenda and notes: https://docs.google.com/document/d/1fb-ugvgdSVIkkuJ388_nhp2pBTy_4HEVg5848Xy7n5U/edit#heading=h.2v8vzknys4nk

GMT20230816-170353_Recording_1920x1200.mp4

A

Okay, uh hello, everyone- this is signal. Ci uh today is August, 16, 2023. and yeah. Let's get started, okay, so first, uh let's go on the agenda. We have a couple of PRS from thought. Apparently all of them are already more uh anything else. You want to add on this, but yeah.

B

Now it looks like they all suggested they have, they all got merged or the last one is approvement number, so yeah I'm, good yeah.

C

Yeah I see interesting discussion on um busy box cherry pick. um What is the final decisions that are like? Are we cherry picking previous versions.

B

um I think it's okay to cherry pick. Well, you! So you don't actually have to cherry pick the change to the image, the test image, but you, since the test image, is all stored in the same repository. You can just cherry pick the change to use the new test image, but.

D

There was another issue.

B

um It's in that uh PR that busy box one through six now uses or now provides a sleep call as a Shell built in and Ash, and we had a test that depended on sleep being an external binary, so I had to change a test, actually call Ben sleep instead of sleep to resolve that so.

D

B

Yeah I think it should be okay to cherry pick. I think it was someone from red hat that that wanted it for um one of the other architectures.

C

Okay, thank you.

A

Thanks: okay, let's see, oh then, if we don't have any margin items, we can go to test triage.

A

A

Release, locking we have a couple of queers to be home.

A

Looks like workflick, let's run all right.

A

Okay, the other one. It's also the same I think it's the same thing, that's failing.

C

Is it what uh timer fixed?

C

Let's see one of the pr from torch fuses.

E

I think they came out was for um serial tests and they were not the police blocking ones. They were the.

C

B

Yeah I've seen failure and I've never had a chance to take it yet, but yeah. That's using our new mechanism for lifestyle container life cycle test so.

A

We find yeah both of them are the same figure. Thank you.

F

A

I have an issue tracking the failure or is something we want to track for them.

E

um Especially because it's considered release blocking place.

A

There wasn't another one, you don't have anything for a while.

A

Yeah, usually two of this flaking.

A

um This one I, don't know why it's like one, but it's plastic. Now: okay, after counters,.

F

A

Closet I will update this image uh this week, so Jesus was better she's not released.

A

Oh everything looks fine CPU manager test. They have been playing for a while.

D

Yeah, let's look into this I I was on PTO last week, so I didn't get a chance to follow up on this, but I think the last set of changes we made did not seem to help. So we need to kind of keep digging.

A

Okay, thanks for the updates.

D

A

uh Yeah, the swap issues, I guess the same failure, but at least they're reporting us failing now.

A

Okay, he doesn't even show anything.

A

I think I feel this looks a little weird like I. Don't know if this is like a desperate ball or why is it showing all of the machine.

A

Because if I go to the test grid I'm into the pro job, nothing shows up. Okay,.

A

Yeah, it looks a little weird as well flaking well they're, passing now. It was of like a while ago, but looks great now: okay,.

A

See container D we have a voice feeling.

E

A

Can keep these these ones are.

A

Snowflakes as well, this yes.

F

A

Okay, one flag with okay.

A

Oh, this was fixed discrete.

A

It's not summary.

A

There are too many test breaks in this section. It's.

E

A

A

E

A

Performance test, foreign.

A

ah It's our serial zero tests.

E

Yeah Dynamic resource on location.

E

A

C

F

A

It's okay: this is okay. Now.

A

A

Yes, looking green, though that's good to see.

A

F

A

Mpd this one is I, know the failure of this.

A

It's been on issues.

A

And this is green now, so it's fine okay.

A

This is green in general, now.

A

E

Yeah, this is green.

A

Yeah I think that's all those are all the tests. The critical tests is.

F

There any other.

A

Stuff that is not in signal that I need to check. If not, we can finish test reaction. Go to.

C

Okay, if you can quickly go into performance link just to double.

A

Check that's on the document. Yeah.

C

In the document there is a parent Airport.

A

A

Okay, let's see this is.

C

The CPU or Google I believe and it's stable. If you switch to memory.

A

It looks yeah, it looks good. uh Those celebrations, I, don't know.

C

Also looks good.

A

C

Previous uh drop down into our runtime.

C

Yeah yeah um this one, uh two down cool, I, think right.

A

Now, okay, it looks stable.

F

A circuit I'm just curious: if the memory tests are we using C group, P1 or V2.

C

I think it's U1.

C

It's a good question. Is there expected any changes in VO2.

F

Especially under heavy load, we expect C group V2 to perform better, so I will just curious about one.

E

Also uh recently, there are the changes in secret security, accounting, uh so actually we'd expect a little bit more memory because there wasn't dating any content correctly.

F

C

I think it's a good action item, uh maybe not like immediately Mike, if you can put it on agenda to check what performance is measured because it's configured for P1, uh C group, E1 or C group video, because, ideally we need to start switching things to V2 and have V2 as a default everywhere.

A

Yeah, just uh from the current.

A

Okay, uh then yeah I guess that's that's good for uh that's. Viage.

C

As a board of tests like in the project board as well, oh.

A

C

One second.

A

um Just uh just.

E

Okay, there are a bunch of this here. Let's see.

A

You can test no.

D

A

Number features: okay,.

A

You'll assigned to you it's part of an umbrella.

A

This is for Sidecar.

C

Okay is the sentiment ready, I.

A

Think it's already the working Thunder it already has a reviewer I mean it has history: okay,.

C

Thank you. Can you put it into PR's news review.

E

A

Okay, let's see next one, this.

A

C

A

Umbrella right, we have a smaller issues.

B

Yeah dims asked me to create this issue to sort of track. The various failures across ec2 some of.

E

B

Nfs version stuff so I think in general it it's helping to make the tests a little more flexible across different os's I'm planning on going back through here and identifying if, if they've all been fixed, the most part I've immersed in quite a few. So far.

A

F

Brokers is anybody working.

A

On this, uh I couldn't find any.

B

A

Couldn't find any related PR, uh maybe I can move this to in progress. Oh don't just move.

B

um Yeah I'm actually working it they're uh I've, gotten three PR's merged for these failures. um I think there's like two issues left I'm still looking at, but yeah.

A

B

That's the right thing. uh Is it okay? Thank.

A

You for this issue, yeah, yes,.

B

I probably should have signed myself.

E

Awesome thanks.

A

Vision testing for images.

A

A

I mean I, guess it needs several reviews.

A

Okay, moving to this reviewer, anyone.

F

Interested in look into.

A

This uh it's just bumping the images of few things and and yeah I think that's it.

C

Yeah I, just don't look at them.

A

C

No move it to Archive, um it's typically like uh so Global changes is demographic testing. So we don't.

A

F

A

You'll have two more four remaining.

A

A

Well, it's like yes,.

A

B

Yes, this was a failing on a ec2 cluster. um Anyone interested.

A

You know coming into this job in particular,.

A

It looks like flags for easy: two, no jobs, okay,.

B

I wasn't sure: that's why I was gonna, let dims review it.

A

A

Put resources and to enforce user connection.

A

To make sure split over here, PC is.

C

You're gonna send to me okay.

A

D

Yeah CC me as well I'm interested in looking at it.

A

A

Disappear recession.

C

Yeah I think it's the current change some time ago.

A

C

Yeah move to the column, entrepreneur.

C

I might not have permissions.

A

C

A

A

S, it should be empty. Now we have some more okay.

A

B

Yes, that's a cryo failure, um I believe yeah.

F

And uh that's the pr Wild.

B

Talk to yeah and then basically just bumped up the top and I'm sort of watching.

A

B

See if it continues to stay green now, I'm, sorry, let me.

A

Let me move it then I'll disappear technique.

A

A

Sorry for the night, by the way.

B

Approved just in the cubic emerged.

A

Okay, then I think.

A

um Last CNA plugins.

A

Three reviews.

D

A

Number for anyone interesting look into this.

E

Yeah I can review it. That's hair, Commander.

A

G

um There was a very old issue in here that is linked and this issue is similar, so the symptom is remote. Volumes of the Pod are becoming local Mount point after rebooting. The nose and the issue is triggered by this- create a cluster create a deployment with one replica shut down the mode you delete. The forward option first bring back the node, the node pod is running, but it's actually using the node local network.

G

There was it was mentioned somewhere that uh the issue might be fixed in this uh and was asked to retry 127 128, but apparently uh there is still that issue. That's why they created this three-day support.

C

Things ignored here is just nominally um it's mostly handled by zip storage.

G

Okay, so this, why do you say this is related to six storage.

C

uh Based on the discussion in a box that you just closed, like a reference from here,.

G

C

We will um Jason Fran and.

G

Okay, so do we want to change anything here.

C

So for this one I would suggest uh six storage to look at it. First, um like six storage. First.

G

Okay and the status here stays the same.

C

uh I typically like, if you do slash, remove Sig node I will then remove it from here all right speaking.

G

C

um Remove the stick and as a node, it's yeah, it's not logical.

G

C

And then say like uh six storage kind of this Dash first.

G

Do I need to tag here as well.

C

Just uh just a text.

G

Okay, this one pods in image full back off, gets stuck there if image fails to pull for long enough, so they have a part that is stuck in imageable back off. Initially, this was caused because the Pearl Secret was not existing. We collected that issue, however, by the time it was corrected. Much time has passed and now, even though the secret Secrets exist, the image is still stuck.

G

It spent most of the time in image.

G

Seems to confirm this.

F

C

Universal 124, which is other support, but uh yeah I, don't remember any changes in this space. I.

G

Think somebody mentioned uh that you should ask the support at the manage GK first and then log it in Upstream in case there is above.

G

This is specified as a GK version right.

C

G

Okay, so do we want to ask for any other information here.

G

Yeah because I don't see.

G

How can we reproduce see what happened? I think those events are container where they have just mentioned that uh reproduced with whatever information is here like not have secrets first and then are those.

C

Yeah I'm curious: how do they operate.

C

I think uh there are three pieces like uh if it's an open source and uh we need Reaper on a latest version like later than 124, and that video I didn't need good, look and steps to your produce. I, don't feel that there are enough steps to reproduce yeah.

G

You can ask for more uh information here and ask for steps to reproduce and also if this is uh reproducible in later later versions. Right is this just a text message that I need to write yeah.

C

And then, finally, it will be three hours needs information. We will move to a different form.

C

It's need this information.

G

In what versions the latest ones or 127.

C

uh At least 125.

G

G

Is it needs information.

F

Yeah I just needs.

G

uh The next one is propagated, sit down signals resulting till after 30 seconds, although the termination grace period is 120., they have a card with processes orchestrated from shells liquid system into the quality propagated to these processes. I find that these processes occurred in 30 seconds. Although the grace period is 120, the board itself is still in 120 seconds.

G

Expecting I was expecting nothing to be done until 1 20 seconds processes receive symptoms.

G

To reproduce your entry point.

G

Any processes is.

G

Up to 30 seconds to this is include.

G

I, don't know, oh I, don't have contacts either. Do you know foreign.

C

And wait for graceful termination period uh to finish so I think it's expected.

G

E

Yeah interesting terminology and the follow-up like they say that the process shut down, but then the Pod is not killed. So I think that they're expecting, like some life cycle, tie to the rest of the Pod like if the single container dies and the rest of it should be like reported or shut down which, like maybe they I, don't know, uh I'm, not sure exactly what the Pod is.

E

But I don't think. That's necessarily the case right.

C

Yes, I attached some zip file, I, don't know like what is in there.

G

This, this is just to reproduce.

G

I'm not sure if I should put on reading it.

A

All right, I've seen several cases where, uh if you send the, if you send the secret room to a part and it doesn't reply, then they kill something, but sometimes the ball is not cleaned up and I I, don't know if this is like expected Behavior by now, but yeah everything it should kind of expect. The pottery clean after a big kill.

A

Now, I also know that some, depending on what the body is doing, for example, if he's waiting on an NSS server, it can even if it's a kill to the to the so the process, it will not die because of this.

A

So that might be their case at all.

G

But that looks like an expected Behavior generate.

A

If, if they, if they can specify, if that's the case is what happens with the process, if it's acrylic, with the secure.

G

Okay, so what do we want? What's the action over here.

A

Because you asked the owner if they are using an NFS, Mount or anything similar that can hold the container from dying after effective.

G

C

Look at the zip file and zip file only contains shell files running inside the docker image, so I would suggest to ask for post Tech as a information that we need and after Prospect we can decide what's next to us, but yeah I agree with Mike. After we have a port spec, we need to understand what is work in the termination of God.

G

C

You can also say that sick term is totally expected.

G

Next one, our management of huge Pages memory into kubernetes cannot be changed or deleted.

G

Key Management of huge Pages memories cannot provide a method to allocate the huge Pages distributed across new models evenly because it has below like Amazon and every manager ensures that the memory which a port requests is allocated from a minimum number of Human Rights.

G

This is unacceptable because there are applications that we keep you and huge Pages distributed across numer nodes evenly evenly.

D

It sounds like a feature request or memory manager. I think CPU manager has CPU manager policy options. Perhaps this is a request to introduce something like memory manager, policy options.

G

To be distributed like this, one has yeah.

D

Potentially yeah, so this is like this is something that we as a community would have to discuss and see. If that's something we want to invest resources into, but I I, don't think, there's any immediate action item from our point of view. If someone wants to actually pick this up, that would make sense. But maybe we can just say um like it's a kind feature.

D

C

D

I wouldn't call this a bug.

C

Yeah, remove kind back and I'm like uh and for policies like that. Do we simply go through care process or we.

D

Yeah we'd have to go through a cap process. Okay, is this credit.

C

Yeah and uh do remove this kind uh bug.

G

C

Yeah, thank you.

G

Next one kubernetes post start, who doesn't show even since 125, so we upgraded from 195 to 126 and after that, every life cycle group doesn't show the error on or describe for older versions, which show the error basically for the new versions. The same code shows as this expecting as older versions. How can we reproduce this? Okay.

G

It's not working correctly the same for them. Other versions are showing up the events. The changes in this did we remove something.

G

I'm not sure um so I think like the gist of this is the behavior change. So we were showing errors before, but not now.

G

C

Yeah I I, don't um these are part of some PR. Can you look at PR description? Oh, it's just normalizing life cycles. Okay, so I think what happened? Is uh you failed to include message inside the uh let.

G

Me change it to this type.

C

And we can do Post start hook, failed uh semicolon and older message like previous models.

G

Was there a reason why the message was removed.

C

I have no idea.

E

B

Public like, if you have account environment, variable or something.

B

D

E

B

To that, if the comment said change the yeah 284.

G

B

C

All right, yeah.

C

I think we can close this uh expectedly.

G

As we do not want to fix pause, do you want to write the details.

C

Any sense of confirmation.

G

I'm sorry say that again, but.

C

You don't know to expose any sensitive information that may be part of this message.

C

G

I looked at this, uh they were mentioning that the parameter is not set why the config file specified. But then the last comment says that it is working as expected. I think are not reproducible.

B

Yeah I've tried this on three different kubernetes versions, and it worked back to one two. Four I think it works for me every time, I'm up I have not been able to reproduce this.

G

Okay, so do we want to wait for the creator of the issue to respond back or close? This.

C

I think you can close and ask them to reopen if.

G

G

Freaky Test download API volume should provide containers memory limits.

G

This is a six storage job. That's failing, apparently,.

G

And this doesn't look like signal tissue, then, okay,.

G

G

G

This looks like a very old issue.

G

Okay, so where did it change from six dollars to Sigma.

G

You can clean up navigated.

G

You can use it. We are not sure why this was changed to signal.

C

Let's remove it.

G

G

C

Don't put commands and checks in the same line, it wouldn't be happy.

E

G

What stays in out of pot State on a scaled out note created the cluster Auto scaler on 126, with cluster Auto scalar enabled in minimum size is 1 and 5 is cleared out. An application requirement to 360 replicas The Walker Note size was foreign.

G

Scheduled on it and should stay Independence Day, this would have triggered foreign.

E

G

Yeah I think there is a question here.

G

Any static password cause problems again.

G

It looks like like they mentioned here: uh Cube proxy should be part of Readiness check is what they are set lasting.

C

You're making this change as a feature request. I think uh this may be a duplicate of some older box and.

G

What is this cluster using Cube proxy houses as a static point? The issue says that this is visceral cluster with real OS I thought those were using Andrea. What is Andrea.

C

G

Okay, I think: what does what Antonio is suggesting here that if they make you proxy as a part of node Readiness check, then it would involve um a lot of checks. I think it would require all the static ports as a part of mode Readiness check. This would solve scheduling problem, but it will impact mode. Startup, readiness.

C

So the issue with static ports and other ports uh message is well-known uh problems that we have for many years. So I would say just accept this box. We may need to do duplicate it uh towards some other box, but uh I think at this stage it may not be on a bucket, maybe some feature request to change, how we treat Northern regions, conditions.

F

Yeah I remember: Clayton had a fix for one such issue. Long time ago, though,.

C

Yeah, but it doesn't fix uh anything completely right so right uh and it really depends on how long it takes uh cook proxy or anything Network related to Startup yeah.

F

I guess I'm trying to look for that issue.

G

I've added an actionable items uh where I can deruplicate this to another two other bugs that are related.

B

G

G

Start the season is what message can be copied between different phases.

G

It returns. Failure under some condition was met once the condition is met. It returns success when upgrading to 123.6 the failure. Reason of message returned from the soft admit, Handler seem to jump between four phases that is transition from pending to running, despite this of admit, Handler. Subsequently, returning success.

C

Does anybody know what soft Kubota admission Fender means.

C

We don't have pluggable architectural performance.

G

G

Apparently, the status changes.

G

Not only briefly, and then it moves to reason, one.

G

Reason is provided this is unchanged, create and register a soft.net Handler without failure with non-intentional message once and then returns. Success.

C

So many things around with this bug, I think 123 is long out of uh support the first message and then we need to understand what the soft admit handles mean because we don't allow any plugability there.

G

Yeah should I ask if this is, reproducible in 125 is greater than equal to 125. First, yes,.

C

Please yeah, let me do some change rounds off as well.

G

I need information.

G

Instead, City for you simple.

G

C

Can you ask what does soft admit, handle means.

C

There's maybe some custom beautiful kubernetes.

G

Okay, I think that's all done.

G

I need to move this to.

G

The information yeah we are done with the battery large, then.

C

Thank you is there anything else for today's agenda.

C

Thank you very much Mike and Dixie for driving this season um that uh have a good rest of the day. Bye thanks.

G

A

A good one, bye.