Kubernetes SIG Node, 8 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20210708

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

Hello, it's a july, 8th 2021.. It's a signal. Ci subgroup meeting uh welcome everybody. I will start sharing my screen.

A

A

So yeah um it was a short week for us because we had like uh independence day, but something was happening plus we have a quad freeze tomorrow, so I would assume that many people were reviewing prs and trying to merge big ticket items into uh 122..

A

So let's uh just go through agenda iran. I think first, two items have for you. You pasted them.

B

Yeah the first two items are both me so because it's code freeze today, uh I have an item in here for the stuff that the release team has been tracking for ci signal. We should probably make sure that we look at all four of those tests which they are saying, are potentially release impacting.

B

uh But the first item was a report on the node conformance tests uh from bringing it up at sig arch at the last sig arch meetings. That was last week, and so we had an item on our backlog to discuss.

B

Basically like should we rename node conformance to cri conformance? How do we define the scope? How can we better document this and previously from like a slap conversation, dims has said well, this is mostly about uh you know ensuring that the um various different cri implementations all work with node things, uh and so I basically went to a sig arch meeting and because a lot of the conformance folks were there, uh I asked them. You know like. Does this sound accurate? Is this reasonable?

B

Should we go ahead and rename them um and a lot of the sort of like original uh people who worked on the node conformist tests were at this meeting and they were like? No? No, uh it is not merely a matter of cri conformance.

B

um So specifically, um everybody agreed that we should probably take conformance out of the name of these tests, because conformance has a very specific meaning and the fact that they have conformance in the name is confusing, but it is not merely limited to the cri, and so I linked that old issue there, which was basically the original issue uh tracking when we added node conformance tests to begin with, and one of the things that was sort of discussed when we went through this, was that, like right now, running conformance suites require you to spin up an entire cluster and there was a goal for the node conformance test that, like basically, we would be able to run conformance tests that don't require a full cluster that just require a cubelet.

B

And so, given that context, I think it's a little bit more complicated than just saying. Oh, you know, we've got these. These node conformance tests, they will check the cri, maybe we can call them cri conformance or something like that. um It is a little bit more complicated than that, and so I think basically, we need to sit down as a sig and discuss what we want.

B

The future of these tests to look like and write some documentation and possibly go and like do an inventory on them right now and see what tests like you know, require other components in order to work that don't just require a cubelet.

B

So that's that item and I don't know if the recording is published yet I know cigar, which is usually pretty fast at publishing them, but you may want to actually just watch the recording of the discussion from that meeting.

A

Make sense so it's basically to test some.

C

A

Kind of uh when you don't want to bring up entire cluster and still want to test what's happening.

B

Yeah uh and yeah, I know that the the conformance name has been very confusing because from what I can tell like, it's they're, certainly not like conformance as in the kubernetes conformance project, because that is really only focusing on like user-facing apis and a lot of the cubelet stuff is not that uh the cubelet conformance tests are like checking. You know basically behaviors on a cubelet which may or may not actually have anything to do with like outward-facing uh like api things. So.

B

I think we basically just need to like go and look at all of the tests and just maybe look through some of the historical issues and decide what we want. Node conformance to be and probably not call it node conformance.

A

Yeah one thing about node conformance, node conformance testaron, as the p submits. So if you want to, uh I mean this is one of the reason to have a note conformance. So it's a basic set of tests that you want to have to be stable.

B

Yeah, so I think that we need to think of probably a better name for node conformance, and I know that we're also right now sort of going through all of our weirdly named node specific test selectors. So I think, like now is probably the time when we go through. We say: okay, is this test in or out.

A

Were there any suggestions.

B

There were not, I mostly just got a bunch of historical context to take back. That uh was different than what uh initially I had heard on slack from dims, so I mean I think we still have the item going forward to. We need to improve the documentation around this, but I think the reason the documentation hasn't been improved is because nobody's really in agreement on in terms of what node conformance means right now. So.

A

Yeah, I think we can get get started with some very basic, like no alpha features, kind of uh thing.

A

It's uh from what I understand like most of the tests are marked as not performance, most of the like primary tests of the probe test or the config tests.

A

Okay, I suggest action item. I think so it's still on me. I I'm uh the item on me, so I I will put something down and we can discuss.

B

Sounds good yeah. I think that that's good, and mostly I just wanted to like bring you more information, because I think that we were offering. uh uh We were working on incomplete information before so now. I think this is kind of gone from a oh. This is like an easy thing. We just need to change the name and like update a few docs to uh this is a bit of a refactoring project and sort of like. I think it aligns well with some of the the work uh that we're already doing so.

A

Yeah still mike I'm doing like not features to feature renaming right.

B

Yeah and the next item- I don't think you need me for any of these. It's just mostly. I think these are the four like flaky tests that are being tracked right now for ci signal and just making sure that we get an update on all of them. I think that the test freeze is next wednesday, so we've still got a little bit more time, even after code freeze today to keep looking at these.

D

Yeah for this one, I I just looked at the the recent failures and most of them happen after like eight to ten seconds. So if we just increase the the delay to 15, then we might cut it will not solve the issue right, but it will flake less often.

B

Is so is, are there any correctness issues with increasing it to 15 or like did we pick the threshold such that it's gonna be fine.

D

No yeah, I thought it was gonna, be fine, but.

B

The reason I ask is because that was what I did for when I put together the probe tests um like for the liveness probes. uh I basically like, went and ran a loop for an hour and like did the math on you know the things that were potentially flaking and whatnot, and I picked the threshold such that they still demonstrated correctness, but the test basically never flaked, like even in the worst case scenario, uh if, like things raced in a way that it was the slowest possible thing, it still demonstrated the correct behavior.

B

But uh the test wasn't like flaking, because things were too slow. So if, if there's no risk of like a correctness issue, I would say yeah increase the the thresholds. We may have just not written the test in a very good way, uh which is fine.

D

B

Yeah yeah yeah yeah.

D

I still think uh uh if, if the the thing from clayton uh merge, it will reduce the time to to propagate the the status but uh yeah. I suggest we just.

B

Do that both- and I think that uh I mean I hope that clayton's thing merges- I don't know what the I don't know. What the definitive answer is on that uh just because there are so many bugs like this one that one is going to address, um but I think that it's also probably a good thing to just increase the threshold, because you know there could still be in the weird race condition like happens, and we get it really slow this one time and the test fails.

A

Yeah, I wonder what does mean for the product like if we cannot write a test that will ensure that one startup probe succeeds like containers ready? Is it actually not ready in in many cases in production or it just like our test environment is so bad that it's too slow and no production environments will ever experience. This problem.

D

The the thing is locally, as with the current master, I cannot make it uh fail, so I I just have to to rely on the on the ci and and then try to understand. What's going on because locally on kind, it never fails.

D

So it's very weird.

A

Do you understand why, like what is slow, what you'll need extra 15 seconds.

D

Yeah, I think it's it's the the propagation of the the status between the you have the container stitches and then the pot status, and then you know we propagate everything, uh so my guess is is maybe we run like many tests in parallel and we we have like. I don't know the plague that is like over overloaded or something, and it takes more time to to propagate.

D

That's my guess.

B

I haven't looked at this one specifically, but I know with like the probe uh termination grace period seconds. I mean some. Sometimes it's just a matter of for the startup program.

B

I think it's a little bit different, since that one is only really probing at the beginning of the container's life cycle, but at least for the termination grace period seconds like it was basically possible for if you timed it really really poorly in terms of the interval, depending on when the jitter hit and then also like, basically another random startup for the worker and then, like uh you know, when the first failure hit it could potentially sometimes take up to, like, I think, basically 2x the test period to fail or something like that for given the threshold.

B

um So it may be a similar thing here.

B

So you know one might intuitively write the test to be okay. Well, I will make it like one and a half times the threshold and that's not enough.

A

Okay, so because of jitter and like.

D

Yeah, okay, so I will prepare a pr and then yeah try to have it merged by next week.

A

A

Okay, no update here anybody else who wants to take it.

B

A

B

Okay, he uh he gave a bunch of updates. I don't know if he's busy, we might want to reassign that one.

A

Sorry um yeah does anybody else want to take a look.

C

Like I can take a look, but not in on the week like and probably not on the next sea, like a lot of stuff around here,.

B

Let's leave this one with francesco if we don't have any volunteers that have obvious free time- and I can see if I.

A

Can find someone.

B

A

A

But yes another one on you.

D

Yeah, so just just look at the, I think I have a pr for this one.

B

That one was only failing on like very specific jobs, so yes.

D

uh Actually, it fails on on one specific job, all the time so for for this one, I added just a log to see the status of the pod when it fails and then on the other one I added a retry, because in the in many tests we have this retry thingy.

D

When we apply you know the resource has been updated before we submit. So there is like a retry mechanism.

D

You can see better if you ignore white spaces, because I have like a bunch of uh of course that was just put inside the uh yeah in the gear yeah in.

A

A

Oh yeah much easier.

A

Okay, so it's already assigned to somebody, but I don't think.

A

Parker might be replying.

A

Oh okay, so one og I'm already here so it needs approval right.

D

A

D

Okay, it's already in the need approval, okay and by the way uh the next two weeks are gonna be on vacation. But I would still try to be reachable and then I will ask signals if I can become the viewer, slash approver in the tests because nobody's looking at them. So it's really impeding us.

C

It can be great, but yeah.

C

We have good vacation.

A

I just I just dropped my my.

C

Notebook during the vacation because, like it, makes.

A

C

When I'm, together with kids and also, I need to check emails and stuff like this,.

A

Okay, arjun: do you have an update on this one.

C

uh Yeah, I think I think lana tried to pink uh someone if you will check below, but we still don't don't have any answer uh so again. From my perspective, it's like uh looks like some c advisor problem, uh because currently uh our ci is running with the c advisor, not his like runtime. Oh.

B

Was this the one where I said I was going to update it with the open shift bug because it looked like an open shift bug that I had seen.

C

Yeah exactly you said something like this.

B

Yeah, okay, I don't see.

A

B

There on that, but I know what this is talking about.

A

I think uh david.

A

A

Advisor bug- and this is flaky right- it's not like funny coz the time yeah.

B

A

Yeah, it's the worst kind of box.

A

Okay, I will pink uh david in case he has some ideas, yeah and finally, on the board, I already reached a few items that uh just added them to the board.

A

We only have three that doesn't have a scene, and one of them is swap.

B

So that was a failing test, uh but it doesn't affect any of the pre-submits.

A

Yeah, it's a serial right and dynamic uh config. Last okay, yeah send it to me.

A

Okay, I'll take a look.

A

Swap so I you mentioned ike, okay,.

B

Yeah, oh apparently needs a reading base already. um I I submitted this one yesterday, mostly, I was just looking to add um periodic jobs for cubelet swap stuff now that it's merged.

A

Yeah, you need to replace.

B

Yeah I was four hours ago, apparently so I'll go, take a look.

B

A

uh So yeah, I hope you guys would anybody else wants to take a look. I think it's some all right. It's.

B

Not an org member, so yeah I'll need somebody else to lgtm to prove it.

B

I basically, I just renamed the uh jobs to all be consistently named and.

A

Yeah I'll take both as well, but I don't really want that mic to take at all um node and then that's for female containers.

A

A

Any takers, so if you were.

C

A

To alpha right or what beta maybe.

C

I can take a look: okay.

A

Okay: okay: we are finding everything.

A

Let's just quickly take a look at what's uh what needs approver, yeah, zero, still, okay, it seems to lose lgtm.

A

A

Okay, I'll put it into interviewer.

A

So this is what this one is interesting like parker, following up on some pull requests I used yesterday and um in this past uh it used to wait for events uh to hear um it used to wait for events and expect no events. uh Why couldn't where? Is it? Oh yeah, this one wait for events and expect no events and then check for both uh being uh for this gun.

A

So it's changed to wait for events and now event supposed to come, but the problem is that um we have an issue that saying that events are highly unreliable on tests and there are many cases when events just wouldn't show up. So that's not supposed to fully rely on events ever and that's a comment I made and yeah.

A

It's like is the test case, yet.

A

Yeah, I don't know whether we want to take it and see whether it flakes, um but I think we know already that many tests that rely on events or will eventually start breaking just because they weren't highly unreliable.

D

A

D

It the case in like a real world uh clusters that we lose events. Oh.

A

Yeah some highly high throughput systems uh like when we have a lot of pods and many events they start like. There are a whole bunch of throttlings being kicked on and uh you can start losing events. Okay,.

B

Yeah events are lossy, so I mean you may have seen. For example, the cubelet uh tries to like it has an event manager and it tries to like send events uh and like, for example, if there's a network partition or something like that, the cubelet may retry a few times but like it, may give up trying to sync an event uh if you've been working on particular cubelet issues now granted like.

B

I know, a lot of folks still rely on events, but like there needs to be a fallback if they fail to sync for some reason. So.

D

A

D

But normally it's only on tests. We never rely on events in the in the states of the pods or anything. I hope.

A

Not that I know I think some monitoring systems may rely on them, but we don't.

A

There's also a bunch of throttling is enabled so events has this hash, that cook waited out of them and if hash matches and kublat may decide to start throwing this hash. Your friends.

A

Yeah, it's highly unreliable on like when there is a big load of them, so on tests, they're not reliable.

A

And the interesting thing that uh the test used to rely on, no events will come, and even if events are unreliable and just didn't show up, the test will still go forward right. So the change of test that start a line on existing movements is troublesome.

A

So this one also lost the lgtm and it's an umichi.

A

Which is gem again.

A

Okay, I will put it on me reviewer, but uh I think something going on there. So soon it will go back to mesa program um yeah. I think we have plenty of things that oh wait needs approval.

A

Yeah, when you think that mr prover, I will maybe a pink signal channel asking for somebody to take a look.

C

And probably we have some like pretty urgent bug fix for the dynamic kubrick configuration that, like it was set to duplicate it and that, and so, like a lot of our serial links, started to fail. Like memory manager, cpu measure the block, they all started to fail because it failed to pass the duplicated version for the for the feature.

A

C

It's already have some fix and it's already look good to me, but we need to prove approval for it like it's pretty urgent.

A

Or interesting um yeah I will be interested to take a look, but I cannot approve so. Can you pass the link in the chat.

B

I think that's the fixed resource, metrics end-to-end test.

C

B

One second from the top and needs approver next, one.

C

No, I think this is a different.

B

One, but I will just copy.

C

Puzzle link sorry.

B

That's the only one that I knew that was like.

C

A

C

A

Was here for a long time, but uh dedication dynamic, config was happened yesterday it was immersed yesterday.

A

B

Really recently, where, like all of the tests are failing versus flaking.

B

And this is specifically because they changed the way that the timestamp and these tests were working.

A

B

And uh I guess that those end to ends didn't get exercised in the pr where they made the change. So I think that one's a separate thing mm-hmm.

A

Yeah, thank you.

A

Oh okay, I wonder: okay.

A

A

So uh last item here is uh two more: this one's uh we just discussed francesca. Can you join.

A

He wants some help. um Okay, if anybody will have time, maybe I don't know whether you'll get it this week.

A

Okay, um any more agenda items for today.

A

Okay, then I'll stop sharing uh elana. Do you want to go to board trash.

B

uh I don't know if I have the energy to do that today, mostly just because of.

A

B

Is there anything that, like is not, should we do like a sweep for things for code freeze like look.

D

If you, if you need help from from some people here for good freeze, at least I can help.

B

Yeah, I feel like the problem right now is not like reviewers. The problem is like blocking on uh approvers, but here let me very quickly share my screen. uh I think we should keep this on the recording and uh let me just take a quick look at the I.

D

Did the plus one uh on your pr to get one more approval, but.

B

Yeah, I haven't heard anything about that.

D

uh And I I spoke to a bob killen as well about this. He he knows, and he spoke to don the other day. So.

D

B

We shall see. Okay, let me just quickly. Oh um sergey, can you give me coco, so I can try my screen.

A

B

And hopefully this is the right tab. I have two of them open, okay, uh good. So, as you can see, we've got like 42 things. I think right now that need approver, uh and so let me make sure that those are all accurate yeah. It looks that way.

B

43 things that need approver um so, like all of the ones out here, never mind, here's another one.

B

uh So, as you can see, uh we got a lot of things with the approvers right now and uh there's some stuff that still needs review. uh I think I've just been focusing in terms of like anything else that needed review, I'm only looking at stuff that are like critical, urgent or important soon, and uh this one is a windows thing, so I looked at it, but I couldn't really do anything with it like.

B

I think the windows people need to lgtm it and if they need an approval from node, they should last us um in terms of importance and stuff there's also not a lot of stuff here. uh Paco mentioned this one. This disable the cube read-only port by default. uh I think we need an approver on that one. So if somebody wants to take a look at that, uh that would probably be good.

A

Yeah, I think this one is controversial because all right.

B

Yeah, I think that was part of the concern, um and I know I think a couple of these are related to caps that we're trying to land. So if we get eyes on those as well, that would be great, um but other than that. I don't think that anything else on this list is uh like.

B

I think all of these are just for the most part, I'm not sure what's going on with this one, uh but this one is controversial and I think the rest of these are all related to caps, so they definitely need eyes, but we're really just blocked out of proverbs right now. So.

D

B

I think that's the I haven't actually checked to see, if there's more stuff, to add to the board that we've missed, but I think it's just yeah adding people to approvers and reviewers. So all of our pr should be on the board.

B

And I think a bunch of these have even been oh, it looks like it was.

A

Yeah, I noticed that even even one with lgtm, if you interested to take a look, uh it helps to have second look on some prs, because you you leave some comments and then it will be easier for approver just to make a final decision.

B

So yeah, I think it's just a matter of. We need approvers to take a look at stuff today. uh I I have been trying to. I have approver on one file that is relevant to node.

B

I can do feature approvals, so I've been trying to like to make sure that, for a lot of our caps uh that we want like another set of eyes on for approval, I've been going through and double checking, basically like the stuff related to anything with package features uh which is actually a lot of work, because you have to make sure that they've met all of the cap criteria in order to make changes to feature flags.

B

So you've got to go back and read all the graduation criteria and then I just kind of go and check off my list but yeah uh as far as like stuff that needs note or stuff that needs api reviews, there's still quite a lot of stuff on the backlog, so uh just make sure that uh folks are aware uh make sure that they're looking at stuff, that's more marked uh important, soon, critical, urgent, especially and hopefully we'll get as much stuff merged as possible.

A

Lana, do you remember you wanted to follow up with.

A

Is uh contributing uh country backs uh seek uh on misery.

C

A

To be applied uh automatically fast, without the need for comment.

B

Oh yeah, so I did. I did follow up on that one that was actually a sick testing thing uh and it was supposed to be fixed quite a long time ago uh there was basically an issue with um like github api throttling, uh and so they made like some sort of change that should have fixed it and for a while it was relatively fixed.

B

So if uh the label is not being applied right now without having to comment on things to like, basically kick the bot uh then feel free to reopen that issue, because that was quite a long time ago. I think it was yeah.

A

I noticed it yesterday that, um after my comments, uh it applied need to review a new metric based uh contact, so I I thought, that's not fixed at times like that. It's extremely important because it helps to detect something faster.

B

I'm trying to see oh here we go yeah I'll show you the I'll link the issue uh in the chat for the notes, and if, if you want to reopen this and say oh, it's still, a problem again uh feel free, but uh it.

A

B

B

And the issue was that uh there was uh basically that the github search size limit was too big and they were having some issues with pagination and api throttling. So.

A

Yeah, maybe now it's again, it's a problem again because everybody's doing something so so many things happening. Okay,.

A

I think we're done for the day's end.

B

A

Yeah, unfortunately, nobody from ap apac showed up, uh let's keep it doing once a month and uh if it wouldn't work, we can try something new.

B

Yeah, I don't know, maybe it's just because it's like you know the day before code freeze there, so they were all pooped, because it's the uh end of their day or I'm not entirely sure uh but yeah. Let's uh maybe the other thing too. uh We might be able to set up like a reminder or something like that. That might help.

A

Yeah, I cannot remind everybody much earlier than 8am.

B

No no, but we can set up slackbot.

A

B

A

B

A

B

A

B

Bye thanks, everyone.

D

Thank you, bye.