Kubernetes SIG Node, 14 Jun 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node CI 20230614

Description

SIG Node CI weekly meeting. Agenda and notes: https://docs.google.com/document/d/1fb-ugvgdSVIkkuJ388_nhp2pBTy_4HEVg5848Xy7n5U/edit#heading=h.2v8vzknys4nk

GMT20230614-170440_Recording_2078x1340.mp4

A

Hello, hello, it's a signaled, CI meeting, it's Wednesday, June 14 2023, welcome everybody. Let's talk about some agenda items, I added a few follow-ups from past meetings.

A

First, is our test and now running yay to everybody's involved and I know many people were working on that it's a subset of tests right now. It's not all the test, but it's a good enough starting point. Now we can add more and make sure that we validate arm 64 Upstream, so every Everybody, using rm64 Builds on like selling them and like installing them, can be more confident that it's working so I know that there's some tests are failing, I Francesca. Do you think it may be a related device? Plugin I, like this device thinking.

B

I got you I need to check. My gut reaction will be no, because those tests have a lot of IFS to make sure they run on a supported environment, which is a subset. But, yes, I will check.

A

It'll be nice um yeah.

A

We have a bug for that I think we have uberbock tracking arm 64 uh tests in general, so we so far we've taken it there. But uh if it's very specific to this test, we can create a separate bug.

A

um I mean one option would be to disable this test like exclude them from rm64 uh run and then enable it later if it even just if you don't have time right now, can I ask you to just add a skip into yes,.

B

Yes, yes, that I can do this week. Let me add the note yes um I'll. Let's do like that. I will let this keep and if I figure out an obvious reason right, while I look for the skipper, I will fix the advisor. Just add the skipper. With a note.

A

Thank you very much.

A

Okay, next one, and by the way for rm64, we only have periodic. We don't have CI yet. So this is another thing that we may need to do. Let me follow up on that.

A

Thank you because if some product changes will be needed, you will need to test them and we don't have way to test it right now. So would be nice to have CI test as well.

A

A

Multinomial uh some multinoma also we added uh we had CI, we added video objects, but unfortunately they all failing like this is not failing. This is uh not running any tests, so it seems that some definition isn't correct, but then, like uh these two are failing. This configuration I'm, not sure if it's white uh yeah.

C

I think CPU manager, I believe there's some issue going on I'm, not sure about memory manager why that is failing and we should probably investigate the topology manager. Why it's not running as well I'm a bit tight on my bandwidth, so I can't really say.

B

C

I can help out I.

B

If there is I I think yeah investigation is due, but let's try to First fix the the CRI, because that all the research measure will depend on that and it's a common cause of possible issue. And there is a test infra issue fed by Parkway. If I'm not mistaken, about updating the base image, which is too old- and that should be relevant.

C

A

I think next item is about it right. Oh, it wasn't Leather by me, but uh no, no, it's different I.

B

Will fetch the link and share in the chat just a sec.

C

A

Okay, um Mike, you you've been looking at this cost 93 right.

D

A

A

C

We can do an investigation and deep dive into why exactly these failures are occurring.

A

A

Now uh this one was led by me, but somebody thank you.

E

Yeah I added it so like actually, the dra related tests are like failing in this like after it was introduced. uh It has like never passed like since last week, uh so I think uh these are the same set of tests that that are failing in the arm 64 one also just now. I saw it so like uh cure the Dr replica not able to register to cube plan. uh Some sort of error like that.

A

E

A

Same error as unarmed.

E

Not the same error but but like the same set of tests are failing.

E

A

Do you want to see here.

B

Yes, please FF Romani.

C

B

A

Thank you, okay, thank you. Are you also looking at that at him or you just uh you're asking for.

E

But like I was actually not able to identify like how to fix that I just found out the comment that started the program but like it was not able to actually identify the error.

A

Okay, so it's unfailing once it's merged right.

A

A

Okay, we'll see two.

A

Maybe add can help here, yeah.

A

A

And lastly, the spermal failure issue just wanted to um check, so we get some arm fixes. We.

A

That's some yeah evictions known.

A

Yeah, this is still an issue.

A

Yes, the most focusing made on um on arm. We need to keep and this one Mike will send today. So hopefully we can address like two of them and uh uh next to address uh other ones.

A

Okay, I can make in progress, and this now enhancements freeze deadline. So many people got busy.

A

Let's quickly, look at the performance dashboard, so we keep our head. Okay, so everything good on uh this. Is it Google it yeah it's Google. It memory is good runtime.

A

A

I have quite a few issues: I already cleaned up, some of them. uh Let's go through the rest.

A

Okay, yeah, we looked at it last time. um Some tests are not running as fast as expected, so they writing like 0.70.072 seconds slower than expected.

A

Still no takers, I, guess: okay, ah let's uh make it happen for another week and then we'll try to find more people.

F

Be related um when we first started running some Indian no tests on eks.

F

um There were several issues related to go: Max procs the nodes actually had more CPUs and then go match props to text number of CPUs on the Node, and so it runs more go routines more stuff in parallel. But then it's still CPU limited to like one. So it ends up basically trying to run a whole bunch of tests and parallel, and they end up taking longer than expected.

F

It might be related.

A

Yeah I think it's yeah, uh I uh I will very briefly. So maybe you will okay, it's fun migration, not problem detected jobs to ETS interesting.

A

So do you have any context on what was happening.

D

I think they just want to migrate from gcp to eks clusters.

G

To move as many jobs as possible um without dependencies over to the eks cluster, one of them kind of many from that reference issue.

D

So by dependencies you also include I, think one of the jobs actually pushes images to gcp of literally stream.

G

Dependency so I, if any, like internal authors, that's required for those pushes. We should uh keep those out of this issue. Does that make sense.

D

Yeah, let me take a look at the pr MLG.

A

Do you want me to no.

D

I'm sure that works.

A

And this is failing, so um we need another look. I will put it in a music reviewer first, but then, if that will keep failing, we need to make it wait till another.

G

Failing because um the eks cluster requires a resource quotas,.

G

A

I think this is a reasonable, that's what we typically do for all jobs.

A

Okay, thank you for the context.

A

And next one is from James.

A

A

A

I will move it into uh original motor, but likely to be as soon as a digital proof. But it's already approved.

A

Switch to E2 machine types.

A

Because you might have contacts on this one.

A

Those are related to some Amazon migration, actually I.

C

F

I didn't think so we don't have E2 machine types. Don't worry, I thought that might be a Google machine type.

A

Yeah it is but uh yeah I wonder what is going on yeah I, wonder why James is running ec2 tests.

F

Yeah I'm really confused about this one I.

D

F

I can paint them to ask him. I live in eight six, two six.

A

I'm selling chat.

F

A

Okay, um so this is still failing. I guess I mean all other tests are not fading, so I'll beautify is fake.

A

Moto with another first.

A

Yeah we uh on Google, we don't use E2 as well like we want n series to be used, so it probably will be just good to go. I just want to understand the context here, better.

A

Work in progress- it's not approached so this is a product change. Yeah I will wait it for our main reposition main project. Part.

A

A

I think it's also product change.

A

Yeah, mostly product change, but it has a big test.

B

I think this is totally product change, production code change,.

A

Fiction and then print log for failed addiction, I think, okay, so Parker is trying to add some logging.

A

And Patrick saying that.

A

Doesn't seems to be necessary. Okay I will make it to work waiting on another foreign.

A

To filter out useless events yeah it's a product change.

A

And this one we just looked, and we uh mentioned uh Ed who added this test, so maybe that can help.

A

Okay, we're done with this board. I want to switch to bugs right now. Is there anything else about the test?

A

Okay! Well done, let's go to box, you only have three new ones: new ones. One is in the rotten already, not quite new.

A

Complete so it's rotten, so it's very old, not very old! It's beginning of the year.

A

Good memory lick the top notes showing unknown serious, starts stuck foreign.

A

Oh yeah, we looked at the last time and decided to bring it to triage for today, because it was like last minute.

A

A

Yeah, oh yeah, I, remember now, so it seems that during the bad State at least, containers does just keep uh accumulating and they never been terminated. The suggestion is, uh can we at least have a timeout and kill them?

A

Otherwise we get into umlock um failure.

A

Oh I can't open it here in the browser.

A

You should know it's still once you download this file. Okay, I think it's reasonable.

A

Is anybody interested to take a look at uh how to add a timeout.

A

A

A

We're so I'm going to double check that this to-do is still there. I am pretty sure, but uh let's be on the safe side.

E

A

Foreign yeah, still there yeah loud and clear.

A

And since it's um the least container start, so maybe we need to associate it with a uh promoting this to Beta as well. Maybe I'll mark this judged no.

A

Running pause, these devices I forgot, yeah, accept it.

A

My inputs of these devices I terminated the Google to restart it last week,.

B

Yeah um I tried this one. It's a regression I'm working on notes on the fix. There is a lively conversation about the best way to fix it and already posted a PR, so yeah.

A

Really, everything already happened. Yeah do we need a back part.

B

Probably once we agreed about to the best way to fix, which is uh still being discussed.

A

A

Yeah sounds scary.

A

When I'm teaching API ports such as resources, it's very old.

A

Why it got into triage again.

A

Okay, so edit the back up by a team.

C

A

Yeah, since it's um issue with Alpha feature, I will put important long term, not critical and yeah. It seems to be known.

A

A

Yeah the uh half an hour, so we have a little bit more time. I suggest we just go through a little bit of needs information. Maybe we can yeah at some point. We need to clean it up.

A

The last update.

A

Okay uh David: this was looking into that. So David said that he cannot update.

G

A

Terminating state after completely started, so maybe this is an issue.

A

She has major detects, failed, couplet and restarts.

A

So what do they expect.

A

Oh I, don't think it's true.

A

It's a old issue, foreign.

B

Just for context, because I'm looking into similar stuff so I, don't know how GCA works, but I know that when cubelet is restarting so initializing, if admission fails, it kills a container intentionally. This is the bug I'm, looking at the one with I mentioned previously, so hypothesis I'm not really brainstorming. Now, if GCA runs an admission and that admission fails for whatever reason, then cumulate will terminate running pods I'm, not sure if this is relevant or what is happening here, but just giving an idea.

B

A

Anything special so you're talking about this issue about device uh like any part of this device. Being yes,.

B

Yes, yes, the common factor could be but I'm really just looking at this comment now, so it's really guesswork that when cubelet restarts, if any admission fails for any reason, then yes, it intentionally kills pod, which could be surprising. I was a bit surprising when I learned myself, so it is the reason why I'm mentioning here could be, maybe not, but it's.

E

C

A

A

um I will take a look later.

G

A

We are from not working ephemeral continues. We have to think secret cache foreign.

A

A

That seems to be um regression: okay,.

A

Yeah I wonder how I remove this removed information.

A

Maybe yeah, it worked so yeah. It's a good thing. We fished it out of this information Google. It accepted access, import, setting out of CPU and scheduled boards.

A

Okay, so that may be a static bot related.

A

Okay, assume misinformation.

A

A

A

Okay, so it's uh in place scaling related and we asked to check c groups: I, remember, yeah c groups.

A

Yeah c groups took an effect.

A

Wait, it wasn't. Oh, it was no, it wasn't applied right. So.

A

The issue that uh it wasn't applied, like uh CPU limits, 200 uh meal and here is Zach and start.

A

A

A

And get 20 meters back. Anybody else has agenda for today.

F

A

Good to me, thank you, have a good rest of your day. Bye.

C