Kubernetes SIG Node, 7 Jun 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node CI 20230607

Description

SIG Node CI weekly meeting. Agenda and notes: https://docs.google.com/document/d/1fb-ugvgdSVIkkuJ388_nhp2pBTy_4HEVg5848Xy7n5U/edit#heading=h.2v8vzknys4nk

GMT20230607-170532_Recording_1840x1020.mp4

A

Well, it's June 7th 2023. It's a signal. Ci meeting welcome everybody I uh today on agenda. We have few items. Let me share my screen we'll go over that.

A

You should see my screen now.

A

Okay, um first item: do you see it.

B

Yeah yeah, if you can't thank.

A

You so first I'm uh Paco wanted to join, but he didn't. um Maybe we go with the second item first and then go back so arm and multi-numa. This is what you created: a pull request.

B

A

B

I did that today, I just added uh three jobs for CPU manager, memory manager and topology manager, and it basically uses multinomial test infrastructure I believe we had added that a few months ago when we were graduating, topology manager to ga. So now we should have signals periodically. Just take a look. I wouldn't call myself like the most proficient with tests and try jobs. So I would appreciate if people can. Let me know if everything looks: okay,.

A

Great, um we just went through same exercise with Standalone tests, so it wouldn't be that yeah.

B

It was yeah, it was very similar, um like all the arguments, and everything was similar other than a few fields that had to be populated differently.

A

C

A

And I checked arm tests, they're still failing. Let me get the link so yeah, something is uh broken with a startup I need some I know that a few people already looking at me. We have an issue with this. uh If it wouldn't be fixed soon, it needs to be I. Will I will try to find people.

A

Okay, so let's go to this Uber issue: James uh periodically created these issues on permafiling jobs, um and this time it's not the next exception, so to do yeah.

A

A few tests: let's take a look at details, so this one.

A

Yeah I think it's done because two PR's mentioned and they are both merged.

A

R64 of heading yes, we just added in it now worked.

A

I mentioned Ike, who created the arm 64 this.

A

Okay, so some other personal Trend to look at this okay. So, let's see if there will be any progress on arm eviction, tests uh cry evictions.

A

Well, this is different. This is like.

D

A

Infrastructure, failing um I, think I approach this yeah. This was a very strange bug with feature Gates, not initialized, and it caused some panic. This is this should be merged soon.

C

A

A

Is Ryan on a call no Ryan was assigned to this.

C

A

A

Okay, something under sick testing, I'm, not sure. um Okay, let me put an action item on my agenda to follow up on ownership and why we have this.

A

I, don't know what Canary means once you can see this brand here, no problem here.

C

A

Okay, this is not assigned to anybody.

A

Oh yeah, because it's just created yesterday.

A

Okay, let's hand out during triage.

A

And CPU manager yeah it's it's because of 93 cost 93 uh Mike. Can you take a look at this I think it just bumps them familiar image for a minute yeah.

D

You can take a look into this I think you can move to the next one or three for the layer. One.

A

So I think it's a good practice just to go through all image files that runs against master and bump them bump them. Yeah.

D

I think I can do this. 37 uses the same the same configuration if you want to upload something newer. We just need to verify that the secret driver is it's. uh It matches.

A

Yeah I think we need to run all the master tests on the same image family. So we know for sure that we're testing everything uniformly.

D

Right, yeah, that makes sense yeah.

E

Let me see if we can go through the news.

D

For the latest, one.

E

D

D

A

Yeah I think that's that's it. uh We only need to charge this one and uh if Ryan cannot work on this, I need to reassign. Okay, let's go to the TRS, then.

A

I think I opened most of them already so I was watching interested yeah.

B

Yeah so uh recently we came across an issue where, after cubeletree start, uh we were noticing unexpected Behavior, so that led me to use sample device plugin for testing cubeletal, restart, uh behavior and I noticed that the device plugin wasn't. This is a sample device plugin that we have in tree and it wasn't re-registering itself after cubed restart, and that was because we were only allowing it to register.

B

Once I made a couple of changes just to make sure that um it we have the ability of re-registering once you know the device plugin once the cubelet was restarted, or you know this could be any other scenario like node reboot and things like that. So that's the kind of rational for the change and we are currently looking into uh you know some of the changes that we made as part of device device manager to make sure that from tubulatory start perspective, everything is in order.

B

B

Well, you see there's a for Loop in the else itself, so that is to ensure. So there are two scenarios.

E

B

Is when um there's a register control file specified, and that is like a trigger file in in the sample device plugin that triggers registration, and this is used for cases where we don't want the device plugin to register, and this was useful in some of the end-to-end tests that we had implemented and and by default we want to ensure that the device plugin registers itself so the second, the else part of it is corresponding to that.

A

You see so before it was just a select, and now it's exactly yeah I wonder if you need to have a sleep here but uh yeah. Maybe somebody can review um anybody.

C

A

I I can take a look I already.

A

See this deletion of rejected ports of the quarter result: oh yeah, I, I, looked at it uh and I. Think I mean this mostly chord changes, but what caught my eye is a restart functionality in the test.

A

So what is happening here is uh Google Disturbed and then some actions happening like some registration and registration of pods and then Google starts again. So.

B

There is an expectation.

A

That there will be a period of time when couplet is not running I. Remember Francesco was saying that there are some cases when test is running with kublet as a system uh demon, so Cobalt will restart, even if we like just killed him with a stop.

A

So I was hoping that Francisco will be on the call and you'll be able to share some experience.

A

If not I can just uh comment here.

B

Maybe yeah just to comment on CeCe Francesco and please CC me as well. I'm, currently looking into some of the cube literally restart scenarios.

E

Yeah, okay, so.

A

um Other than that, this is mostly product code.

A

I will move it out of this CI group.

C

A

E

A

So it's just a chore.

A

Let Me comparison to improve consonant.

C

A

Edit, it will test the case to remove the sources size patch.

A

Is it another end yeah.

C

A

Right Way in place interesting.

A

It's very strange to see main inside the test.

A

Yeah same question.

A

C

C

The plus one there's no plus one.

A

A

Okay, this is what we discussed during the um tests.

A

Cto continuity.

A

All that has failing, we just turned, or just a few of them.

A

Oh, this is a test to be talking about fighting right with Google restart.

B

uh Yeah actually I'm not sure I was talking in the context of device plugins, but it could be the same thing. This is. This is Dr.

A

B

Not this one, actually, okay,.

A

So not this one.

A

It's also failing.

A

Yeah I wonder why? Why is in.

A

Just different, it's a resource user. The usage test is not the array.

A

Oh, this one is uh flaking.

C

That, plus with zero seven th interval.

C

A

Anybody interested to take a look: okay.

A

Yeah failures, buddy.

A

A

Is slow? We just need to understand why it's low, why it's exceeded 0.6 seconds I would assume you're not interested to take a look.

C

E

C

This is myself and.

A

uh I'll see if I can find somebody.

C

A

It's what we looked at.

C

Okay, you just forget this.

A

Yeah I, don't think it's the same. This is about.

A

Yeah or maybe it's about the unit tests- oh yeah, it's blue, oh.

A

A

Yeah I think this is God's Department.

C

A

I'm not sure which one oh.

A

It was fish to cryo. Okay, um that's why we can find it.

C

Closing for now easy.

A

Could um I think we've done with a test side of it things.

A

Anything else I suggest you go to the back trash.

C

10 10 bucks.

E

A

Okay, guys so determination hangs when positive rejected admission time.

A

So uh if there is a port that wasn't admitted uh in.

E

A

Race for germination, Heats um Port is now deleted and Demon said, can create a new pod because it I think that one this one still exists.

A

Regression 127.

C

Clinton's patches landed in 127 right for workers, yeah.

A

That's why he's mentioned.

C

A

There is a demon set controller issue separate one.

A

Is the pr we just looked at as a restart.

A

Okay, so maybe it's called by this so I'll assigned to Georgia sign to me hi.

A

Dog smart is new, great zombie assigned I will just accept it.

A

To prepare Dynamic resources.

A

A

A

If it's part of enhancement, work, okay, this is a small scooped.

C

A

Just have somebody assigned.

A

Still worrying username spaces, page I, think this page is very new and it's already still working okay, 125.

C

A

I think karadrika create.

C

A

Original version.

A

Yeah I would rather figure it out.

A

Foreign doesn't start when setting period seconds 3600.

A

Or maybe it's because of Jitter time so before we start a probe, we do a Jitter and Jitter is based on some random seeds that we use and we multiply it by period.

A

A

Yeah digital time.

A

A

State is not updated when Port is deleted. Okay, somebody didn't receive the hook.

B

I, if I remember correctly, at least in device manager, when you have a checkpoint file, there's no update at the time of deletion, there's only update at the time of creation and see if we had a subsequent part that was requesting CPUs. The checkpoint file would be updated later that that's my understanding. I can take a look at this.

C

B

And that's what yeah I see Francisco saying that.

B

C

A

C

A

Fails to throttle Port so that this for this size, pure limit.

A

Okay, so it's in place update firmly.

C

C

C

Yeah featuring it works.

A

Yeah I wonder if we can check that uh C group was updated.

C

I've mentioned Renee.

E

A

What's happening to an agency, they obtain to know this parasite. Okay, another restart.

A

I removed the intermediately.

A

Eventually, pods gets removed.

C

C

126 as well.

A

I think they mentioned that they don't see it in 196.. It's only see in 1.7.

C

C

A

If I do it immediately or after comes up.

E

E

C

A

See if we get any more information.

A

Start time seconds time, step equal to container style time.

A

E

A

Wouldn't be expected.

C

C

A

C

A

At some point, we need to review the recent news information.

A

Maybe it can take a few.

A

Okay, this one we just looked.

C

E

A

Just made it here.

A

A

The robot update.

A

A

A

Election manager check the disk usage of that containers. I. Remember: we've been confused. What that containers mean here and why this Q such is important.

A

Example, Port A has one living container with five Gigabyte usage and zero gigabyte and one that container, whose discus is just 10 gigabytes.

A

So what is that container here.

A

That can directly refers to the container that was terminated by killing the pods in it process container see if there isn't any container.

C

A

There is no init container so any process what they do.

A

Okay, I can write some something and they're writing. 10 gigabyte input, B.

A

And Q itself kill the process.

A

Can it been restarted.

A

Now it has one living color is five gigabyte and Paul berries.

A

Houses two containers, one of them- is dead.

A

There is no mounts, they just write in temp.

A

C

A

That's very strange, so container is dead, but I still find some memory.

A

A

I'll remove it.

A

A

A

A

Okay, maybe you have time for a couple: more sharing, GPU and multiple containers visible. There are some drivers from.

A

Should be a bug? Okay, it's not a bug.

B

And it's very old I believe the array allows this to some extent now.

A

So I'm sure how I opened it, because it's nothing but sorry.

A

A

Okay, so it cannot report its metrics.

E

E

A

Energy issues trigger okay,.

A

Okay, this person is complaining that, when node is using too much memory, all sorts of strange effects happening, especially when home killer starts, killing protests randomly.

A

A

A

A

Okay, let's try to reach next time and see if we can improve it. I like the idea to change this to make the timeout here. So this is a good Next, Step, okay, anything else today.

A

Thank you. Everybody coming um have a great rest of your day. Bye.

D

D