Kubernetes SIG Node, 30 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20210630

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

uh Hello, everybody and welcome to today's edition of the node ci subgroup and triage session. It is wednesday june 30th 2021 and we have uh one agenda item from francesco and I suspect, probably more agenda items will trickle in uh do we have francesco.

A

B

C

No, he said he cannot join today. So oh.

B

That's here he just.

C

Added items under the agenda and like from it, you can see the progress or something like that.

A

Yeah I've been reviewing his prs. I know that there's a bunch of different problems that we have on the the node serial test, so that one looks like it got approved and then here's another one helpfully fixing this test, and I think I looked at this. uh This one was funny. I was like there's a sleeve in this random bash command. What's going on, there uh looks like the job is still failing, but I guess there's probably more asynchronous work to be done here.

A

But I'm very glad that we're making progress on this so yeah.

C

Like I understood from him like, we have probably one only test that currently is failing like persistently under the serial drop. So probably that's awesome. It's very close to the past again.

A

D

The image download.

C

Yeah yeah, it was one for nfd. I think like not feature discovery. Yeah no feature discovered like.

E

This one looks like.

A

And I guess we've got uh so as far as board goes looks like we have a new pr, uh but other than that. um What do we want to take a look at today? uh We have like a bunch of flakes. I think that the let me take a look at kubernetes dev.

A

So the release team has been sending out signal reports. uh I usually look at them, so you can see it's red and we had a few signaled things in here. That are, I guess, marked as in progress. Luckily, no new things so, uh but there are like, I think, five in here, and these should probably be.

A

I don't know if the issues currently have priorities on them, but uh if not, we might want to mark these. We've got five things uh and I think they're gonna start marking these as flaky and excluding them. If we don't fix them, like that's kind of the that's the release team's lever- uh and so I know this is: let's see, we've got the startup probe one.

A

uh I have not even seen this one, it doesn't look like it's assigned to any well uh assigned to francesco. Apparently, uh so I guess we've got this one which I'm unfamiliar with this one uh looks like matthias is on this one is me: I haven't looked at this one uh and uh rtm, so I guess do you want to go through them? Do you want to start with this? One.

C

Yeah yeah, probably yes like again, sometimes the stars that we got from the c advisor uh from some reason. We got zero for some specific field like cpu.

A

C

Something like that uh and we had the similar issue under the same test, and the fix was just like to assume that zero is normal value for it. But probably it's not so.

A

C

A

I agree that it's probably not normal, and I think this is catching an actual symptom, because I know we've had some bugs in openshift reported along these lines, where, like c advisor, just isn't returning stats sometimes, and I think that harshall had a fix for that. Like I remember there being an open shift bug do we we might because uh david porter's, the c advisor maintainer. Should we cc him as.

D

A

This has been flaking on master for a long time, uh and I know I've seen this like in certainly like latest openshift. I think we've had complaints from customers about like metrics just disappear and because most of the software treats zeros as actual zeros uh they're like there are. These drops in my graphs, and this doesn't make any sense. So I think that this one probably should get some eyes.

C

uh Like from my side, I can continue like to dig into seed visor, like I'm sure, like how how much time it will take.

A

Yeah, let me just bump this one to the top, so we uh to kind of put some sort of priority order on these things. I don't know that this one is at the point where it's critical urgent, uh but I know that this is constantly an annoyance. I see it flake on my pr's and I think that there's an actual underlying behavioral bug, so we may be able to link some of those uh if we go through a bunch of like other bugs um but yeah. I think this is.

A

uh We just should just try to keep making progress on this. So thanks for taking a look at it and then hopefully we can maybe pull in um like david porter or harshall or other folks. um Let me take a note for myself to um make sure that I I can maybe pull in some open shift bugs.

A

uh What is this url I'll? Try to remember to do that just so, we have them sort of all in one place. uh Okay, so I think that one takes care of that uh and then for this one, uh so this one's apparently assigned to me, I haven't had a chance to look at this.

A

I think I looked at this and there were like not a lot of flakes. It was like very minor.

A

So, let's see what uh oh, that looks, maybe pretty frequent.

A

So it looks like it's flaking a lot in ci, so.

A

Oh sorry, apparently my internet was sad.

A

uh I don't know how much of that came through, but initially, when I looked at this, it was like not flaking. Very often, it appears to be flaking a lot more often, so I haven't actually like done a deep dive into this. Should I remove myself? Does anybody want to take a deep dive into this.

C

Like I saw it's probably like mostly felt under the windows lane, so is it on windows? I think yes, like I saw somewhere the windows.

A

Oh, I think yeah you're right. This is failing on windows. That's a great point! Okay! Let me let me add that.

D

um Should you ask mark to take a look at this.

A

Well so yeah um so mark, I think, might be on leave still, uh so we should I'll uh paste this in.

D

A

D

The name of uh somebody who can work with.

A

Yeah, I don't think I mean technically it's a sig node test, but because it appears to only be failing on windows. I think that we should just assign it to sig windows.

A

uh And I will send them a note, um save.

A

A

Okay, I have poked them so hopefully they'll take a oh. That makes that easy. Oh, I guess I should remove it from our board.

A

I can call it done.

A

I'm not gonna delete.

D

It or anything I'm just.

A

Moving it into done, because I don't think there's anything else for us to do uh if they want to send it back, then so be it. uh Okay, pods should run through the life cycle of pods and pod status.

F

Yeah this one does not seem to happen so often, since one of the the commit that I included in the comments so just have a look.

A

Oh, that looks pretty regular.

F

Yeah, like like a metronome but uh yeah strange.

A

uh Oh, when was that commit because that looked like it stopped happening for at least this job, but there's a bunch of other failures that are still happening so.

A

Given the regularity of this, it looks like it's failing on almost every single one of these runs. I wonder if we can get where's the test grid link. Does it have a test grid link, ci, cubelet node? I don't even know what this job is.

A

Is this a test grid link? No, uh but I might be able to click on one of these and get the test grid.

A

There we go yeah that looks like a this, isn't a flake. This is a failing test. um Let's see.

F

Yeah, maybe it's not included in any other project.

A

And like it's only happening on node cubelet orphans and it's a conformance test which concerns me a little bit, I don't know what this job is so here, let's, I guess, update this remove kind, flake, failing test.

F

Yeah, maybe that's the thing I didn't run the the right config for the the kind. So maybe I was not able to reproduce locally.

F

A

F

Me I I tried really hard like I.

A

I trust you for.

F

One day- and uh you know.

A

uh I certainly trust you, I'm not gonna, I'm gonna up the priority, because this is a failing test and uh we're getting pings from uh sig release. To look at this, I'm gonna yeah.

F

Yeah yeah yeah yeah.

A

Soon because I think they- I don't know if this one is release blocking, but it might be release and forming so.

C

Yeah, probably it makes sense to check this like pro configuration for this job, like maybe it's some just memory or cpu stuff. I don't know.

F

Yeah yeah, yeah, okay, so yeah! That's that's one thing I didn't look at exactly at only at the frequency, but not that it was always the same job. So I I will. I will have a look tomorrow.

A

Perfect, okay, I have the next steps on here, so I think we should be good to go thanks. uh Okay, let's look at this. One pods should support pod readiness gates. I don't know anything about this feature uh and apparently it is flaking and here's a test grid link. Oh, that did not work.

A

Sig release 114 blocking that seems very old. Oh, I guess this is from what like 2019, so that makes.

A

A

This is super. Super old looks like I have not seen this failure at all, uh so maybe let's check it in uh triage, not loading close. These.

A

Tests what's the failure test, oh timed out waiting for the condition.

A

This thinks it's network and not us, but it's a sig node thing. Oh probably it thinks it's network because it's timing out.

A

Okay, so we actually have and it's it's failing on cryo and container d. So uh let me add this more recent triage.

A

A

Okay, uh hopefully, that will be enough for francesco to work with. uh Let me remove derek and don as.

A

A

And I think that's priority important long term. I probably should make this priority important soon.

A

Okay, last one, I know this is the worst uh matthias. What's.

F

Going on with this this one, we have a plan and we're still waiting for one committee to to see.

A

uh Okay, where, where is the um what's the pr that we're waiting on? Oh it's, the clayton's pod life cycle, yeah, okay, yes, um that's that's fine! I think we can just keep waiting while we wait for that to land. I think it's we're currently waiting on land, how to uh give that the lgtm.

A

So, okay, no update for that one. That sounds good to me. We've looked at all of our tests. We've done our duty. So uh is there anything else in the board we want to go through today?

A

Let me try to also maybe pull those up yeah.

D

If you do know insignia, there are a few pr's at this reviewer.

A

uh If I do what.

D

A

No assignee, sorry, your audio is breaking up a little bit.

D

And it's a mask yeah. There are.

A

D

Needs somebody.

A

Okay, let's take a look and make sure we assign some folks ensure images are pulled after eviction tests who wants to review this.

E

I can't review.

A

There's no sorry, I'm just looking at this there's no issue attached to this.

D

Yeah, it's similar to the stereo tests.

G

Claims that this will.

D

Make uh that's better.

A

Okay, uh then, probably a kind failing test and.

A

E

A

Oh, I should probably look at this one.

D

Let's uh merge, commit and still cool like.

A

Yeah, it's it's! It's yeah this one, I guess, is waiting on author.

D

Yeah in the comments like it looks ready like I just glanced over and looks ready to review, but I don't want to crush things and uh I.

A

Need to maybe check out what's going on with all of clayton's stuff today, so I will try to chase him.

A

A

He answers my slack messages, amazing. uh So, let's see uh end to end node fix the device plug-in test. uh This is a francesco pr.

A

uh This one I should probably review, uh because I've been looking at a lot of yeah. I already did a review on this one. So let me just assign myself.

E

Assign me as well, okay, I'd like to take a look.

A

And then end a node memory manager, automatic hublet, restart.

C

Yeah, it's mine.

A

What priority is this? Is there an issue that this fixes.

C

uh We still have some races, they pretty, we don't have a lot of them, but we still have so just like, because the recent introduction of the new uh text context flag just to disable automatically start of the complete. I just use it I'll.

A

Throw important long term on here uh and who is the right person to review this.

C

Probably like I can ask francesco to reviews like it should know, or again it's pretty small one. So anyone welcome.

A

Anyone welcome anyone want this. One.

D

I can do it.

A

A

Cool okay, I think we've got folks assigned to all things that we need people assigned to now. So um sounds good to me.

D

Yeah, um I think uh one thing uh from uh back triage. We found boxcraft because uh we found so many uh old old issues and we need to have test coverage which is uh great, and we have this issues on the board now. So whenever you have a chance to take a look, they wasn't properly attributed. That's why we didn't find them before box club really helped.

A

Are they in the to-do list here.

D

Yeah, okay, so one.

A

Of them is very interesting.

D

uh Editing the sock test like uh sock test is, uh should be quite interesting, like I think somebody already commented there that they want to take it, but they already knew how to start so. I gave some pointers.

A

Great, do we have anything else uh for sort of ci related stuff, because, if not, I would love to talk more about the bug scrub, because I think that it has unlocked and or created some work for sort of the triage. Second half of the meeting.

A

Okay, hearing nothing more on ci, so uh basically, we've now gone and we've scrubbed all the bugs, which is great. We started with like 450 bugs and we closed 130 of them. So we now still have like an absolutely outrageous number of bugs but uh they're almost getting into like manageable numbers, so a thing that I think we should consider doing and I'm going to start.

A

I didn't have time quite yet, but I'd like to start with some proof of concepts uh is to create like some like for node specific, like bug management boards, so like one for features and everything else, and one just for bugs and ensure that we're kind of looking at that stuff weekly uh we're not doing that right now, obviously- uh and I'm not entirely sure yet like how we necessarily want to categorize them, because they don't necessarily have the same workflow as we have for like prs, where you know you initially send the pr and then you assign a reviewer and then it gets reviewed, and then you have an approver and so on uh with issues I think, there's probably going to be more states like needs to be triaged.

A

We're waiting on information uh we've like confirmed the bug, and it's in progress uh that kind of thing, but I haven't. I haven't really done any proofs of concept for that. Yet so I'm gonna try to make some boards and play around with it, and I think probably we want to start with bugs and then move into our feature. Backlog, which is quite big. uh Is anybody interested in working with me on that?

A

Or do people have like feelings, ideas about adding that as part of this meeting, because I know that, like one of the things we're not currently doing is like as a group getting together and looking at all of the incoming bugs.

D

So yeah I'm up for helping you out in that and I think it's definitely needed.

A

Great okay, so.

D

uh I believe we need to keep the number of box manageable. Last time I almost a year ago, I suggested to do that and feedback from community was that we are not ready to take new work. Yes,.

F

D

We now in better shape.

A

Yeah, I think that, like that's, why I want to do that. Scrub is because I figured after at least we do that.

A

Then things are all mostly up to date in terms of labeling, and so we should be able to kind of like pull them onto boards and not have to look at every single thing while we're doing it- and I know unfortunately like we still don't- have a lot of automation for like auto categorizing, you know what columns on what boards things need to go to, but uh I think at least we're in the sort of state where we can start like every week, looking at incoming stuff- uh and that will be, I think, a big improvement too.

A

So uh I think, having a like a bug board uh plus and everything else,.

A

Board and we need to figure out the.

A

A

But yeah I'm looking forward to doing that. uh Hopefully moving forward.

A

A

Cool uh okay, other subjects, anything else on bug, scrub follow-up, I'm hoping to do that soon. It's really painful without automation, I gotta say it's like. Yes, I would love to click and drag 200 issues onto a board. That sounds like a great time.

F

Do we know which tools we could have for automation? Unfortunately,.

A

There aren't any so github is currently like announcing a bunch of stuff happening with their triage boards and whatnot, but as far as I can tell, there's a lot of new features that they're adding, but none of it solves the problems that we need solved. I've been talking to contrib x about it, so there exists some bots out there, so we might end up asking contribex to like set up a bot for us, but it's going to be a lot of manual configuration and work and we'd have to like keep the bot running.

A

So we'll we'll see what happens, but I think at some point that may be what we need to do in order to maintain steady state. So we're not like staring at.

A

A

uh Other stuff.

A

Oh, um where are we at with the who's working on the node conformance stuff? Do we have anyone assigned to that.

D

I do but through the box droppings and I lost it,.

A

Okay, uh that's that's.

A

A

And then, similarly for the node feature stuff uh is that also in the same boat.

D

So mike is making progress. There is a pr for mike. Oh.

A

D

A

D

That program, so it's like multiple stages um ah here. It.

A

B

G

Sure well, so this is the first part of the the work in progress to remove these uh tags from the tests. uh First, I created the first pr which basically duplicates the tags with the new syntax and eventually we'll remove these tags right now. This pr is safe because it doesn't modify any behavior and there are no tags that have the new.

G

There are notes that have a new attack, so they shouldn't like modify any any wrong pipelines.

A

Awesome, so how does this work exactly we're adding node uh we're, adding the thing that just like marks things as a node feature.

G

Yeah we're if you see that the tag is very similar with the main difference, but it doesn't contain the colon and then the the other of the scripture. The idea is to have this text.

A

And then I see I see yeah yeah yeah. I see what aaron is saying here that makes sense to me and I think yeah having it as node feature versus. Did we say that we just want to call it node feature instead of node alpha feature.

G

Oh we're using the three of them.

A

Oh okay, do we even want special feature? I don't even know how that's distinct from anything else. Where is that what job is this?

A

Oh, that's, the weird job that has the failing test that nothing else is failing on. Maybe that's why.

D

uh About this, we needed to take it in stages and yeah. Oh.

A

It's also it's also under the node serial job, okay, so yeah. uh I think this is great uh here. Let me help gtm this. I think this is who yay, thanks for making progress on this.

D

A

I say having not looked at every line, but I trust that sergey did we need to do this so woohoo yeah looks good awesome. I'm very glad that that's in progress then.

D

So if we like there's one topic about stock tests, I don't know if you saw this um discussions, there is a issue that you found during boxcraft. Somebody suggested to do sock tests for to detect couplet leakages or breakages.

G

D

Yeah the test easy, the only problem how to detect the failures and leakages, so I I suggest, like one idea, is for me to use npd to detect the problems uh because npd can find when kublet was restarted or such, and it has some medic collection stuff. But I don't know how easy to use it during the test. I wonder if anybody has any other ideas, how to detect crashes and leakages.

A

So for crashes, I.

D

Did it without like realizing a huge infrastructure.

A

Yeah so for crashes, I'm not entirely sure for leakage. uh So a problem that we have right now is that the node tests aren't- or I should say, cubelet's not being scalability tested and uh is there anything in here but scale? uh Let me see if I can find this on the scalability repo.

A

They have a repo, let's go to their charter and find.

A

A

I think it's this.

A

D

Tests yeah, I know we closed one of the issues here for kublat all right. You can search for.

A

uh This one yeah: this is the item yeah.

D

Yeah there's a link, and uh at the moment I checked it. It was like 15 days ago and it didn't have much information. Maybe it doesn't know.

A

Who added this?

A

Oh yeah? This definitely looks new.

D

So I think it only runs on my oh.

A

Yeah yeah, it looks like 120 versus.

D

A

So that's super interesting.

A

This is awesome, oh my god, I'm so excited. So you think this only works on 121.

A

Should we pester them. I see.

D

And I, like I, don't know what this drop represents, like maybe it's a nancy thingy, but I I think, oh.

A

No, it looks like it's.

D

A

Oh, that's scheduler.

A

How was here, let me go back to the master job.

A

Where is the cubelet cubelet.

A

ah Look at that, I wonder if that was the what date is this ah uh six eighteen? I wonder.

E

D

A

The run c regression fix yeah, that's wild. This is great.

A

So, let's see, I guess it's probably.

D

Sure, because I'm pretty sure it was added up, regression was fixed.

A

Well, apparently, there was something else yeah, so it looks like uh 121. Has it uh just 120, I think 120 does not uh so should we ask them if they want to add it for older branches or is just the assumption at this point that it won't matter because we're hopefully not adding scale regressions in back ports, but who knows.

A

I guess, let me add this to the.

A

D

I think my my thing is this is not so test it's just trans for a while and it aggregates over the our test run.

A

A

Okay, well, I'm so happy to see that looks like we now have uh quite a long agenda compared to what we started with. Do you have anything else for today? I don't think I have anything.

D

My question still stands: does anybody know any way to detect leakages? Any anything say so like people see in other tests that you can reuse? Otherwise, you can keep. I don't into npt direction.

A

I don't know of anything specifically uh in general when I've seen leaks, debugged, it's all been manual using like the stuff that comes off the prof endpoint for like memory utilization.

D

D

A

Cool okay, anything else.

D

A

Great I'll stop, sharing and stop recording.