Kubernetes SIG Node, 19 Oct 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node CI 20221019

Description

SIG Node CI weekly meeting. Agenda and notes: https://docs.google.com/document/d/1fb-ugvgdSVIkkuJ388_nhp2pBTy_4HEVg5848Xy7n5U/edit#heading=h.2v8vzknys4nk

GMT20221019-170447_Recording_1538x1120

A

Hello: everybody, it's October, 19 2022.. It's a signal. Ci subgroup, meeting uh welcome everybody. Let's get started today on agenda. We have a few items. Let's uh get going. The first item I wanted to discuss uh what we prepared with Swati uh Swati is not on a call, so I will represent her I.

B

Just joined Sergey.

A

Yeah uh we prepare this report and presented yesterday. um I I, don't know like I. We didn't get any feedback during the signal meeting that everybody were quiet uh just listening, uh but uh I wonder if anybody on this call has feedback to what we made here.

C

I, don't know what kind of feedback you want, but I've been looking at the test grid and getting alarmed at how many things are failing and if that's the kind of feedback you want, you got it.

A

Yeah, that was uh one of the messages I wanted to deliver. um I like one message I was I was missing from this report. Whether everything is okay or not, and um I. Think knowing all this like some details of his failures, I would say the status of it always green, uh maybe yellowish green, but um uh we need like looking at this uh and looking at testing it. You are right, like you look at like how many tests are failing, can um uh how many different problems we have it's.

A

It feels very dangerous, even though, like we know it's pretty stable right now, we don't have any major uh problems.

A

um Yeah, so I think one thing that we can do is to highlight the status, maybe like saying that all the critical Test passing can many failures. We know that are being infrastructure problems, so that may be one feedback, but yeah.

C

You know I'd like to give one additional opinion, because it's going to discourage people from trying to fix these things. I was looking at things that used to fail in a particular way. They fail in totally different reason now, and it looks like there's a fairly broad squareth of infrastructure problem, which is masking the original problem. That was there some weeks back hard to figure it out. Now.

A

Yeah, let's do um yeah um when it's green, it's better right, so you immediately know when something started being not green like, for instance, it's uh if you do what I swapped as a red for a very long time because of infrastructure problem and now I wouldn't commit that it's working on fedore. We didn't change anything in the logic and it worked on uh continuity, so all I can uh Ubuntu I believe we're testing it. uh So it's supposed to be working functionality.

A

It's supposed to be fine and it's Alpha feature so I wouldn't sweat about this. That's been read for for a while, but uh you're right. If, if this will stop working on Fedora, we wouldn't even know about it.

A

um Any other feedback.

D

I think this is really good. um I think we could uh I I saw some of the tests for cos that we want to deprecate out I, believe I, think if we went through the failing test and figured out which ones we really cared about, we could highlight those, and then we can dive into them.

D

um I'm, not sure how we do that, but um maybe we could somehow sort rank these in some fashion.

A

Yeah and um I asked Mike to add this item for today discussion. um This PR highlights some things that I wanted to discuss and it's exactly what you're talking about this is like some tests. We don't really care about and they read like and we okay, let's discuss it when we will get to that. Okay, okay,.

C

A

You um I also wanted to pay attention on this one. So I was very generous here by saying uh many issues are one month old. uh In fact, many issues are like multi-months old, so I I was just okay. One month is like old enough to um to make people worry.

A

um Some issues are from like last year, quite a few. Actually, so it's at least six months old many issues and I also found many issues. Not many fewer issues got rotten, so it means that during the summer time, when many people were on break, we had issues broken out and we didn't fish them out, because issues still exist, but we don't track it as a GitHub issue, so yeah. That is.

A

This is alarming. We need to keep up the pace of fiction and resolving issues.

A

B

I think it's pretty similar with.

A

Permafilers right so this this eviction and this eviction uh like sore eye uh for a very long time, uh I think somebody in Google looking at this right now. um Pangea is not on a call right now, but yeah he's looking at that. But again it's it's not like top priority.

A

So issue tracking I think uh this is quite easy, easy collected statistics and I think if we start doing these reports once a month, you can get some sense of what we're doing um and maybe, if you combine it with uh how many tasks were closed, uh yeah just.

D

A

A

um That may be osprs, so tasks is just one here and 11 here. So.

D

Yeah I mean decent progress.

A

A

um Okay, um also interesting statistic: I think we're doing reasonably good on bugs from triage perspective, and this is what we wanted to do from this group. We really want like bugs to be fixed by main signal like by everybody, so triage from triage. If you're doing amazing, we caught up on all the bugs and I think we reaction, like people provide more information and like moving them into right homes. So we know all the issues we have. We just we're growing our debt by accumulating this box and not fixing them.

B

Yeah I think one thing that would have been really nice to capture here was the bugs that we triaged, like the number of bugs that we managed to triage because I believe that would have been a big number but I struggled to figure out. The query to you know capture that number exactly.

A

Yeah, that would be hard because some issues would even removed from our Sig, so you it would be. We wouldn't be able to get this information afterwards.

A

Yeah that would be interesting piece of information, uh lots of work going into that, but uh it's hard to demonstrate.

B

A

um Yeah, well all these people I I, browse through zoom, and it gives me this list if I miss somebody, it's Zoom fault, not mine, I, promise.

A

um Okay, this is this: is it um yeah? If you have any other ideas? What we can um talk about in this report? It will be great, I. Think next report we may start doing some Deltas, so not only current status, but it is Delta as well. That may help and I mean my hope when I started this uh sheet was that we will have clear picture what is like, like green, going to red and green again, but yes, it doesn't provide enough information like it's not presentable like.

A

We cannot present this and say like this is kind of status of tests like it doesn't work, unfortunately, so maybe this can be improved as well huh yeah if you have any other ideas and feedback or want to be involved for the next report. uh Let's discuss that can.

C

I see that report one more moment, I was thinking of something that might be useful to put on it. If there's a trio section down below.

C

But it's not the monthly report on the tri-izing effort reason I'm I'm asking is because we were wondering how much progress we made so it'd be good every month when triaging is done to put at the beginning of the session, how many tickets are in that queue and then triage it, and then you can see the difference week after week.

A

Yeah, that's a good idea how to do that yeah! Something else is happening outside the meeting, but yeah.

E

Most of us are.

A

Coming here, okay, so we can.

B

A

We can try now try today, uh okay Brian, you said that you want to discuss UPR, so I noticed it didn't have signal, that's why I haven't seen it. So it's a very simple command here and after that it's immediately got into my radar.

C

um Yeah, it might be a bit verbose, I'm here to pitch it and I'll try to summarize, for you, previous PR went out. It wasn't good enough. uh This one I'm I want to explain because I want to make sure you know that due diligence was done to try and get the uh arm. 624 tests working for tensorflow I, documented in quite a bit down further by the efforts I made to make that work.

C

I, don't think that I failed to try hard enough I think it's actually not going to work any further I couldn't find an image anywhere that was going to work for the air, the arch 64.

C

um to actually build tensorflow properly anyway. I won't go on too much about that. I put some data in the ticket, so the proper solution is to not try to do that anymore. I even tried to bump it up tensorflow too, and the amount of work that would get in it similar or equivalent test going with tensorflow 2 was way too large to justify this, including pulling and basil and building stuff, because those images don't exist for our 64 either.

C

So my con, my conclusion and recommendation of this is get rid of that Arch 64 image build and this thing will turn green again. You can see the pr at this point is quite small. There's like one line, remove it from the base image bump to version.

A

It doesn't have any, but it is notes.

C

A

C

A release notes no.

E

C

Made a fairly verbose comment here in the pr I didn't make any release notes.

A

uh Typically, like every image has a change look or whatever.

A

Doing that um I see, oh, it doesn't yeah.

E

Some some, though,.

A

um Okay, because if you look at pause, for instance,.

A

Okay um images calls.

A

C

I don't mind if you put a comment on there and recommend I add a release notes, that's fine!.

A

It doesn't have a teaser I.

C

A

On the pause image, has it so yeah, it's fine, I'm, sorry, I I was distracted. I was looking decently on pause, image, versioning.

C

You show the audience the awesome comment. I put in it's pretty large I want y'all to see it. I did a lot of work on it.

A

Okay, this one right.

C

There, including test results, to show you that I was not cutting corners.

A

Did you learn something new?

A

What did you learn something new.

C

No I was I was backtracking here, there's a tremendous effort to get tensorflow to work and then the conclusion is just get it out and prove the other tests still work.

A

Okay, yeah, as.

C

Soon, as I answer, your question.

A

C

Mike, did you have a question I see you jumping on screen here: okay,.

A

All attacks I can't do one more.

A

It doesn't count my approval doesn't count here, but uh do something.

C

A

Thank you for great effort and I hope that performance will be back on Green I.

C

Want that in green I need a victory.

A

Like saying just showing why we Face the world, yes.

A

D

Mike, uh let's.

A

Go to uh this one.

F

So, for anyone not familiar, this uh PR is to delete some jobs that are on the cost tab. This uh the shops have been bred for for a while, and nobody looks at it. I ask internally to the costume if they are using them, and they mentioned that these are not looked by them. So.

D

The conclusion.

F

Was it's probably best to just delete them.

D

A

This uh tab called cause. I mean like this, the name of operating system and I. Think when I looked at this PR I was like it doesn't look like it's cost specific, so it doesn't look like it's operating system specific, so, for instance, this test. uh This is a stock test, uh soak, meaning that VMS will never be deleted after test executed. So um unfortunately, it's all right now it used to be green before my parental leave.

D

A

Strange it was green, green, red, green, green, green red red was when VM was out of memory out of hard drive and then like everything, crashed and like testing, for it creates a new machine so um yeah. That is very interesting task to keep and I think this was my first suggestion and I I just like quickly glanced through and like I think what Mike did is move this tab somewhere, but then like when I did another review I uh back to Ryan's Point.

A

uh Was this test important I noticed that uh this one flaky so flaky is the one like I think it's the only type that specifically queries for flaky tests. So the idea with flaky tab was always that if tests start being flaky- and we don't want to make our whole test grid red, so we will Mark the specific test as flaky and we'll move it into a separate Tab. And then you look at the stop and clean up and unflake it.

A

So that was a process uh that was created before and I looked at other tabs and I didn't find any other flaky tabs. So I think it's the only flake. It up that we have thank you.

F

Guys yeah I remember that for some other tabs we do have a flaky tabs, but for for some other specific groups.

A

This is very another.

F

A

One somehow I didn't, but it's a different set of flakiness right.

F

Yeah, it's possible that we removed it, but but I recall there was another flick it out, but yeah I guess at the end, what I can do is like just go through them and inspect it. If anybody really cares about them and if not to see if we have a replacement for them.

A

Yeah because an Ingress I think that is very specific for Ingress tests like trans Ingress, test and I, wonder if we run it in other places. If we do then fine, let's remove it, but I think it has a very specific configuration for Ingress does and then it runs them into specific configuration. So you may lose a signal. I mean it's a red signal right now, but uh we might lose it nevertheless,.

D

A

D

A

This one, maybe we need to involve a Sig Network and even with networking tab like we have like Sig Network somewhere different place, um and then they can decide, they don't need it, but I don't think we want to just blend blind and move it. um And uh reboot is another thing that uh I just don't know um if we run a reboot test anywhere else, I mean reboot. Is such a disruptive thing that uh I would assume it.

A

We disabled it everywhere, um but it's interesting like I mean we don't test, he put in any other uh I, don't remember any other test that would intentionally reboot the machine. So maybe that's the only one, and if it is, then we need to decide whether we want to keep it around.

F

A

Ryan uh to the point to your point, about importance of a test. um I was thinking like this. uh This, like reboot or like a very specific, uh like uh cereal, on Fedora, like memory swap on Fedora. Is it would you say it's critical enough that we need to care or you think it's kind of secondary tab.

D

A

D

For Fedora, we should look into them.

A

And I'm, not speaking on specific, like this specific test like I, just it was just first test in in this report that was marked as failing right, yeah.

D

Yeah I think we should look into the fatorial ones.

A

So how would you categorize, which tests are not important to look at, because that was your point right, so we need to look at tests and understand which one we don't want to look at that much and like.

D

Maybe I guess my point was as if we were going to deprecate out the COS uh tests, then we should remove them from the you know, build tree and and remove them from the document um and just prioritize which ones we were wanting to look at. First.

A

So but Alpha feature so like I'm trying to find Dimension here which, which less important than other dimensions.

F

A

F

Have the full context on this, but when I saw the issue, I think it was specified that it were failing uh at least uh all the all the times that were keep track of the test grid. So that's why I I thought that maybe they were failing since the beginning of times and they don't really offer any value and and there's no point in keeping red tops if they have always been read, but saying that at some point they were passing then that that changes everything but yeah.

F

It might be wise to just go again into them and inspect whether or not we want to either remove them, because we have them running in another place or properly fixing foreign.

A

And I'm asking you, because uh you brought this point uh about important tests and less important tests and I get in this all the time from uh inside of Google as well. Let's just consider it an important test and I'm like trying to come up with a dimension and I cannot like is Alpha feature is not important. I mean they about to become beta right, like they are important. Oh.

D

No Alpha features are important and we should test those.

A

Yeah and I I can like I, mean some functionalities it we may not use for Google, but it's still important for open source. So we still need to invest to those and I. Just cannot come up with a dimension like I, don't think we run any uh not needed tests, um and that bothers me, like even performance, says that, like Brian, uh you helping fix uh I think that I need it, because we don't run any stress. Besides this performance uh on kublet and uh like we wouldn't catch any real degradation of how kubernetes works.

A

um Yeah but I I welcome my any idea like how we can categorize tests to make it uh make. This question disappear so make people realize that all tests are important or maybe categorize tests.

A

Okay, sorry for my rant, I I, just I was trying to get this uh sort it out for some time and I just cannot come up with any plan and like any categorization that will make that is easily explainable.

F

We don't care about running tests that they don't offer any value. What would be the point of that right.

A

Yeah I mean we run some tests a little bit too often like some tests, like every four hours like once a day, but uh besides of that I I think all the test runs are quite needed, uh and the fact that signal is red is not indication that you don't care. It's indications would be.

A

Something is broken with us and we don't have enough people to look at them. Yeah, right and again, like I, think knowledge that everything is green is uh very much in the heads of few people so like people who attended this meeting knows that, like everything is pretty much stable because so the failures we know like this is the infrastructure. This was fading a lot, but we know that it's working this kind of tribal knowledge that we carry from really studies, but uh I hope we.

A

We can eliminate this uh human factor from from the equation.

A

Okay, let's move on to triage, um if you don't mind, uh we have.

A

Okay, the one we just uh looked at.

C

That's where you can that's my take and I think.

A

Yeah I think it's approved. Somebody.

A

It is not. Label can I do that.

A

A

Maybe it's done now: um she's, not gonna image validation course waiting.

A

A

Yeah, it's green, it's red all the time. I think we can apply sign Network for that and see. Let's jump help us.

A

Desire to look at it, I don't even know where to start so I. Would that signature to look at it first.

A

Pending pot feathers by cool but.

A

This is a feature, so this is a feature: let's see how much tests they have. They.

A

Some of the fiction tests.

A

And one and trend.

A

Yeah I think we can just review it as a regular PR.

B

A

Scroll, those away.

A

Yeah there's a very interesting cap: I really enjoy um what they're doing with the jobs these days. So if you I would definitely look myself, but uh if you're interested take them as well.

A

Okay, uh so I think this person uh shingo came last week to this meeting, can give us some overview of some functionalities. That is not working as expected.

A

uh Yes, okay, so um the engine New Image here very there's a base image or busy buttons. Okay,.

A

Okay, so I think we just cleared it.

E

I think I already uh yeah I think I already moved it very trash, accepted it and reviewed it.

A

Okay, thank you.

A

Yeah, this looks good.

B

A

Okay and just we looked at um perfect, uh nothing else, oh doing, okay, um if anybody like I think uh last time what we did on the music, we looked at the task assigned to everybody and everybody had tasks um so I. Don't think we need to do it this time, I, unless people want to um want to discuss some items, any desire. Oh okay, let's go into box really quick and then okay, ah let's do what you suggested. Seven plus this I think six six box reached.

B

A

Beautiful images, uh I am afraid. I I cannot go too deep into understanding these images right now,.

A

You start to fail.

A

Ing Google will fail.

C

It's such an excellently done. Ticket.

A

A

Like existing limitation, um but I cannot say for sure, without going into my details,.

E

It uh it says: 121, do you think you know? Maybe we can ask them if it's still uh happening.

A

Yeah, let's do.

A

A

Yeah I think it may be something related checkpoint, but I'm not sure what to do with memory manager on set chip pointing I know. The CPU manager has some issues uh with the restart and losing state.

B

I think typically, they expect in CPU manager, you're you're supposed to remove the checkpoint file for CPU manager before you restart qubit I believe, maybe that's what you're supposed to do here as well, but I think it makes sense. Don't just ask them to check first. If this is the behavior they get on the latest version, and then we can look into it.

B

A

Sandbox post container not deleted, and it's good that it's not deleted right right.

A

C

Yes, I guess so.

A

I think it was found yeah.

E

A

A

Oh I see it's not about image. It's about uh I'll pause, dockershim,.

C

I have 11 days.

A

Yeah I know that uh similar Behavior exists on containerd. We have some ideal issues with sandbox being stuck sometimes I doubt you will ever touch it for 123.

A

A

Like it is, um I know that uh sometimes happening. If somebody wants to look at Dr Shim, please be my guest. um Let's uh or it may be a little bit hard from archeology archeology perspective to go there and code base is not on Master anymore.

A

Okay: this uh is another time issue um how much in the past.

F

Foreign a while ago that were for some reason the node goes in the past and then.

A

I, remember um about that time. It was uh poet, wasn't able to I mean not was still in good shape. It just couldn't schedule any more ports, because we thought that uh it's already have a newer Port around or something like that. So.

A

I believe it's an issue we may have a lot of those is.

F

It I really feel like is this something that we should support, or is this something it should have happened on those right.

A

Times queue happening, and uh typically, if you have a time server around, it will update completely fast I. Think most of the Box reported about time. Skew was uh from Edge devices like some telecoms at the running node on some low power device, um and those may have time skew reasonably like very, very high.

A

um So if you don't support those devices very well and I think this will be like the smallest problem. If not becomes becomes unknown, you can at least just three stars this node and be done with that. So it should be fine, um but yeah I think we have more serious issues with timing right, listen! How much timing could talking about here uh if it's uh very like if it can be represented with a few seconds, and we may need to look at it into that.

A

That's important.

A

Yeah I think this is being looked at. um Isoculating.

B

Let's look into that.

A

Again, without.

C

A

A

Foreign person.

B

It is, it is the same person who had the other work I think in this they're still proposing the change as well. I could take a look at this one. Thank you.

A

I think his tickets very well, so I will accept it just for.

A

Thank you for watching and last one.

A

C

C

Cuddle, this might be a container D thing.

C

A

It may be contain everything.

A

It doesn't say, oh energy here and it's.

D

A

A

You can tell more.

A

A

I will mention David.

E

Hey you can tag me too. Okay,.

C

I'm curious too, can you tag me. I was gonna just secretly tag myself.

C

I forgot your ideas, however, is it X, my queen I, think eczema yeah.

A

Yeah I think uh with the promotion of CRI starts. um Do we do it for this lady? So next it is, but uh we about promoted uh to the next stage. It will be interesting to fix it.

E

You know it'll be staying at Alpha, so it's not real. It's like retargeting Alpha. So it's not quite moving forward or moving orthogonally, but I.

A

Guess, moving into being functional.

E

Yes, adding more needed functionality for sure.

A

Okay, thank you. um Yeah I think we're done with the bakriage um and we have 11.

B

Minutes left anything.

A

Else for today.

A

Okay, thank you. Everybody for coming bye.