Kubernetes SIG Node, 19 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20210519

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

So hello, it's um may 19 2021 and uh it's a ci subgroup of sig note. uh We just discussed the meeting time and it feels like we. Okay with the time, let's see how it works for asia contributors.

A

If you have problems, we need to reconsider.

A

um Yeah, I think I only have one follow-up from last meeting I was I mean we have some internal dashboards on code coverage in google and I was like looking inside like trying to understand how people will check uh code coverage improvements. This is for like uh first good issue, kind of uh issues that we discussed last time, but then I found this dashboard in open source. We also have code coverage dashboard. Oh I don't sure. Let me share.

A

A

Okay um yeah, so apparently there is a code coverage dashboard and uh we can see improvements here. I I mean I filtered like kublet already and it doesn't seem too bad. So it's uh reasonably uh high numbers.

A

I found couple examples like if you look closely I'm using like 22.6, like it's kind of low number that uh maybe worse, invest into, and it can make a good uh first issue to contribute to that like 12 yeah. So I listed two examples here that I found interesting like um runtime go. It only has some feature like functions to convert one, a container id to start to like another container id or the string status to port status of this uh jazz. So it's super trivial to cover and it has a very low coverage.

A

uh It does have its own test file and this one is a certificate bootstrap. It's just reading certificates from a file. It has a lot of it has some code coverage, so there are functions written to uh that set up everything like writing, file and stuff. So once you have this setup complete, uh adding new test cases, uh typically, not that uh complicated.

A

um So I was wondering if, if you have new contributors like do you think just create issues in kubernetes, or it will be too noisy to have all the issues for for this kind of files.

B

Often we do like one sort of mega issue so like we'll, have a single issue for all of the coverage and then like checky boxes in the issue.

A

Okay and then okay yeah, uh I thought.

B

That way, wouldn't be too noisy.

A

The issue itself would be noisy or oh, no.

B

No, I'm saying that way: we wouldn't have like a gazillion issues, so it wouldn't be too noisy. It would just be the one issue with like the help wanted or whatever and then, as things get finished, we can check off boxes.

B

uh I don't know if you've seen some of these, but so, for example, when I did the structured logging migration, I had this giant list of files and then every time someone migrated what I would check off a file and there's like similar ones, for you know, linting migrations, and that kind of thing, with this big umbrella issue and then, as things get completed, we check off the boxes.

A

I see that sounds cool good um yeah, I'm thinking like do we need to. We make a issue. We probably may declare some threshold for code coverage like maybe fifty percent uh would be a good threshold.

B

One of the things I'm a little bit concerned about having like run through a bunch of the cubelet code, is there's probably stuff that like have coverage right now in the unit tests, but it's like it's effectively. Integration coverage so, like it's possible that, like you, know all of the code paths or many of the code paths are being exercised, but uh it's not like isolated function to function.

B

So it's hard to make targeted changes and like see what happens uh and I'm like I'm thinking of cubelet underscore pods dot, go uh in particular where it's like there's some like very basic phase, changey stuff in there.

B

But there's not a lot uh for like, for example, there's nothing for converting api statuses, um which I have found frustrating when I've been poking around in there and trying to make changes because it's like well now do I have to go back and like write all of this coverage for these things that are untested. Currently, I don't know.

A

Yeah, I think, to solve this problem. We can run packages like new testament packages right what? Yes, you want something.

C

Yeah, I just wanted to add some part of the code is difficult uh to write unit test for because you have like all those signals that are coming from everywhere and then, if you really want to exercise.

D

C

You need to make like fake cues and.

B

Well, really, I think the problem is that the code was not written in a way that was unit testable and now in order to unit test it. We would either have to re-architect it or do a huge amount of work to like add you know, mocks and all that sort of things um it's yeah. It's not. I mean this is what happens you inherit technical debt? um It's not like a great status quo.

B

It does make me wonder like if you know adding unit tests to the existing stuff is the right thing or if we should look at some of that re-architecture, uh but sometimes you need the tests first uh and so yeah. I don't know, there's open questions. Yeah.

C

Yeah yeah, I I either you you need to re-architect to test, but you need the test to validate your rear architecture. It's it's.

A

Yeah so um one thing uh forcing the architecture. Sometimes it's uh I remember: we've been fixing uh how fast test run and uh every time you don't use fake clock. You have a problem because you need to have timeouts, so that kind of forcing you to use fake clocks and it's easier architecture, because uh in pro in production code pass code is exactly the same. So it's not that dangerous, but yeah. I agree. Okay, um I'll create a mega issue to try to consolidate things.

A

Okay sounds good.

E

Sorry, hey sorry process question. um We understand that increasing the test coverage is important, but I already see this work fall into the backlog area, so I wonder, process wise.

E

What's the right step in general, because, for example, you need to find acknowledgeable reviewers approvers. So if there are ideas or exactly established processes about how to handle those things, I will be happy to read about.

B

uh I think that's mostly so it's true that uh anything. That's like improving test coverage uh would be considered a sort of backlog thing, because it's not release blocking right. uh It's important it's good to have, uh but it's not release blocking. So uh I don't think in general, like I don't discriminate personally as a reviewer against like stuff that are, like you know, priority backlog, but if I see like an xxl backlog pr I might like you know, leave that one in my triage backlog until later.

B

If I have like an extra small, you know exactly.

D

B

I think, like part of this, I'm hoping that you can get like reasonably sized things, one of the things that I tried to do for the structured logging. Migration was really strongly encourage people to cap their pr sizes at like no more than like 40 lines of migration, because otherwise it's just impossible to review. uh I would say that might be a useful thing to include in our umbrella issue like say: yes, we want this unit test coverage.

B

Don't go, add 2, 000 lines of coverage in one pr, because it will be really difficult to review.

F

C

We don't have labels for low risk prs because basically, adding tests are low.

A

C

As long as the tests don't flake.

A

Yeah area test is typically loaded.

B

Yeah, unfortunately, though, it's really hard to tell if a test is flaky, if, like you're, only doing the one pr thing and then you know we have to go back and catch it later. So uh I wouldn't say it's like I mean it's there's also. The risk of somebody goes and adds tests, and they document through those tests behavior that we didn't actually want, but happens to be the current behavior and then somebody goes back and says. Well, this is like you know, it's canon.

B

It's been documented in the tests and then it's really hard to fix it. So uh it's. I think that like this is why backlog is appropriate for this sort of thing uh and if there are specific things that like would potentially be blocking for some reason we might want to like set the priority higher, uh but I don't I'm hoping like. I don't think that this will, especially if we have like a tracking issue where we're like. Yes, we want to make sure this is happening as part of this release cycle.

B

I don't think we'll have too too much difficulty trying to get reviewer and approver eyes on it. Like I know, uh marinol, for example uh like he is approving backlog stuff all the time for like sorts of cleanupy things. It's definitely true that, like some of the like super big drawn out, needs a lot of reviewer brain power. Backlog stuff is not gonna be as easy to get in, but I don't think we're necessarily at risk of that here.

A

Yeah and I think the main driver for creating this is our team brought up that some contributors, like new contributors, wanted to have easy to start with work items, uh so this is targeted for those like. I don't think we will forcefully assign anything on this meeting it. Just I mean we don't I mean it's just a backlog.

E

A

So I don't have any other agenda items if anybody else has it, otherwise you go to triage into the board. Anybody has something: okay is the archon around archon. I think uh there was an item artem on you last time on uh yeah.

G

To create an issue regarding the serial number, I will do it today, like I totally forgot about it. Sorry, okay,.

B

Oh yeah, actually, uh as an agenda item, do we have a plan for how we're going to deal with the failing serial tests because uh that came up at signode and it sounds like it's a problem right now.

G

uh Yeah, it counters the status that it's like failed after the timeout, like it run around five hours and it's not enough to finish ocl uh tests. So probably someone need to take the responsibilities to run locally and start just to test like test tests, one by one to see where the issue additional stuff that we can check. It's like timeouts under tests, not the like global timeout, but the timeouts on the test, maybe like, because some tests have a big amount, just in total, it failed because of timeout.

G

So it's a additional like action item that we can do but uh yeah like someone should take it. uh I'm personally want to take it. But currently I don't have time because I believe it will take uh one two days to dig into it like it's, not. uh I believe it's not uh some uh issues that we can fix on glands. You know because we already fixed such issues as like cpu problem, cpu manager, problems or lack of memory and out of memory problems, uh but we still have a problem with the serial drop.

G

So it's it did not really help. Do.

B

We have any volunteers who would want to dig into this as like a sort of multi-day project, and then I think what we'd want as an outcome of this is we know that the tests are, you know, timing out and or failing and or failing because of timeouts. So, like, I guess, picking up this issue which uh will get filed and then trying to determine, like you know, go through the list of everything there see.

B

What's failing and like you know, come back with a report and maybe like the follow-up actions that we'll need to take to fix things.

G

Yeah, maybe the easiest start point can be just to run locally with increased global time out to 12 hours and to see like if it will pass after the tell how otherwise it's like it's not ruined.

E

I want to help. I do have some time, but not sure it's enough time so put my name with an asterisk alongside to make a best effort.

B

We do have a few new faces on the call. Would uh any of our newcomers be interested in some of this stuff new to me? Possibly, I missed you last week.

A

I think mike on the call, but I think mike already took some investigations and mike is very new, but mike do you want to help somehow.

H

uh I mean I'm not sure how hard this is for anyone, oh but yeah I can. I can give it a shot as well. um That can help with anything here.

B

Yeah, I don't know what uh matthias and dd have on their plates, but yeah. I.

D

Can also look at like I'm not sure, but I can also take a look yeah.

C

So I volunteer too, to add some coverage uh actually and also some end-to-end coverage as well. uh I think uh we discussed last week with uh savvy okay. I.

A

Think maybe what.

C

A

Need to discuss like somebody needs to run it locally and see what is actually failing. I think this would be like very first step.

D

Can we have a link like which, like which test we are talking about? I have missed the discussion from the yesterday's meeting.

B

A

B

Time you're gonna file that right or do we want to just do it now, yeah.

G

Yeah exactly I will do it today. Okay, like I, I will post.

G

And I will also add the link to the test grid job.

F

A

Okay, all right, you found it. Okay,.

A

I attempt to file an issue and then.

C

I think it's also important to pick people that already run uh end-to-end tests locally, because you need to set up some some things uh in order to do this and you need a reasonably powerful computer as well. That is on all the time and.

E

This is why I volunteered yes again, unfortunately best effort, but I have some infrastructure I can use and I do run into and test some of them. So yeah. Let's see, I will keep you posted folks on the signord slug channel.

C

Okay and make sure you you share the the the parameters you pass to the what's: the name again: jinko no.

E

Yeah, I think we have a make file target which takes a reasonably care but I'll be sharing how we I do run the test. We have some docs already, but yes, totally we'll share what how we do around them.

G

uh In general, impossible to run like e2e node test locally without any gc instance. Stuff like this, and if I remember correct.

D

The latest update.

G

To the serial drop it was like we run on top of instances with two cpu and eight gigabyte of memory. So it's not a lot of resources. To be honest, so, like probably, you can run it on your laptop after you will close chrome.

E

Keep it fresh, though it will run for a while.

B

Yeah I mean I'm also happy to like try to run them on my desktop and then just give somebody a dump to go through, but I do not have time to try to get them working or like put a bunch of effort into it at this juncture, because I've got other things on my plate.

A

Okay, then, um thank you, everybody, oh for this.

A

Okay, so just a single pr that doesn't have a signee.

A

Yeah this is uh making just working. I think we need to assign to ben just to include it into pre-submits right.

B

Yeah, I think this is unbroken now. I think we want this. This is what we discussed yesterday in signode.

A

Yeah, I think we we got all agreements, so let's ben approve it, let's take it from there.

A

Okay, then uh yeah we out of prs. Anybody has updates on their pr so issues. They worked on.

B

A

Last week, I.

B

Don't think there's anything assigned to me for this board.

D

We have time I want to ask on approach for something.

D

um Yeah, what's your question yeah so like um there are two things and cubelet with which you can provide configuration. One is the component config and flags, so I we had, I mean from uh cluster life cycle and cubed game. We have had this feedback that, like cubelet, was the first one to adopt component config and there are lots of flags that we have deprecated and we are adding new flags so like what's our approach should be here to like. Should we avoid adding new flags to cubelet or uh like.

B

Yeah, I think we're we want to avoid adding new flags to cubelet those should go into the cubelet config. uh This is probably a better topic for a wider sig node meeting, uh because.

D

B

That it's been kind of like there's been this sort of floating.

D

B

Yeah just it, I.

D

Think I asked here because it is easier to approach here: yeah, okay,.

B

You know that's.

D

B

Good topic for a sig node meeting, and I recommend that you ask it there: okay, yeah I'll, discuss.

D

A

Yeah, but I think short answer is kubota. Config is where you want to put as much setting as possible.

D

But there is one issue that it is like we have like the component. Config efforts has been stalled and, like we do not uh like it is not moving, so it is alpha. So how like for an alpha feature? How can you say that to like stop adding flags or something.

A

um I'm not sure what the company config feature.

D

Okay I'll, then I will summarize uh things in a dock and then I will discuss in these meetings.

A

That will be perfect. Thank.

A

You, okay, um I think we done with this triage and uh I suggest we switch to um product cache, so whoever not interested in product triage, you can uh drop off and I will stop recording now.

B

Sounds good, do you want to give me co-hosts, so I can share my screen or are you gonna? Do it.