Kubernetes SIG Node, 10 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20210610

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

Hello: everyone- it is uh thursday june 10th 2021. This is the first time we are holding an alternate time session for our sig node ci and triage meeting, and I am gonna pull up. The notes share my screen and see what we've got on the agenda for today.

A

Unfortunately, sergey could not make it as he has a dentist appointment.

A

Here are today's notes.

A

ah So we had something on the agenda here uh for this item, but it is crossed out. Do you want to talk about this.

A

Does this get merged? It did.

A

A

Are you able to see this slash talk.

B

C

B

I can just join.

A

We can hear you.

B

A

Can you see my screen.

B

A

Yeah, so this is crossed off, uh but nobody has merged this. I did talk to dims and he was hesitant to merge this, because his concern was that we might be losing tests by basically getting rid of this. um I'm pretty sure we're not because I went and looked at the test. Selectors and the only difference that I could see was that uh this one does not select for, I think, serial tests, uh but from what I can tell it's not actually running anything else. It's just a duplicate, so.

B

Yes, I agree with you, but then let me check later. Maybe I can go through other test cases and then check if other uh cover those.

C

Okay sounds good. I'm gonna un strike through this. How do I do that.

A

He won't, let me do we have anything else for the agenda for today.

D

A

Know that we have the eternal uh serial job, so we should probably talk about that. I don't know francesca if you've seen the latest.

E

Yes, I did we, it seems we are eventually making good progress. We had a breakthrough, didn't didn't. We.

A

Yeah, uh so I guess as soon as we don't mess up the like test description uh to actually like run it on the branches and we exclude the eviction test and it seems to just be the eviction tests. uh Everything runs fine like within about an hour and 30 minutes, so um I submitted a pr basically to split the two jobs and then odin was skeptical that that would fix it.

A

uh But I think that we should try to like divide and conquer like because there were still definitely lots of failing serial node jobs. So I think that we should try to like fix those separately from like the eviction jobs which are slow.

E

uh I just want to go on record and saying that in general I believe that splitting the serial the huge cell line is something worth per se. Even things were working good. I think that one huge serial job was too much.

D

A

I agree with you: let's see, I have a pr open.

A

Oh something happened here. Maybe did what's going on.

A

I mean this looks so like it's failing constantly to me, although we can see, I think, from these uh uh from the prs where it was run. We had some greens.

A

So here's my pr to split it. I guess let me copy this issue.

A

Oh, it hasn't approved in an lgtm; it just also has a hold. So I guess do we want to talk about canceling the hold.

E

It's probably me yeah.

A

Oh yeah, it was your hold.

A

Does anybody have any uh opinions on this? I don't think that we should can uh configure a new, oh wait. It has its own new tab, it's on a diff, different dashboard. I think the way that I did this so here for me do we have rtm today. I don't think so.

F

I have a small note on the on the serial test as well. I tried running only the eviction test and I ended up with a 24 hour time mode, so yeah definitely so.

A

Yeah, it's the eviction tests that are broken well, it's good that we figure that out and separate. I think that we should just separate them out.

F

A

A

um Yeah so I mean, is there anything I I really want to unblock the like periodic, so we can get signal on which tests are failing. uh So if folks on this call, don't have any concerns like dims was happy to approve this one, the way that it is not getting rid of any tests, we're just splitting them into different jobs. If folks think that that's okay, then hopefully we can get this merged, and then we will have the eviction test.

A

I guess running and timing out, uh and then we can file that as a separate issue, and someone can work on that and then we will have the rest of the serial tests which, according to test grid, uh this is from the pr job. Look pretty flaky uh but like there are some of them that are passing so there's a few that have been failing and they do look like they're some of the disruptive tests so.

E

I agree with you elana. I just wanted to give other time to chime in, but if we want to move forward by all means, I'm with you.

A

Yeah, well, we have the whole meeting here. So I don't know if anybody else has feedback.

A

We have a much larger group, mostly. I think that, like odin linked this tab, this is the pr tab. This will not run continuously unless we're like going and doing that, like pr blah blah blah on every single pr. I don't think we should do that. I think that we should just you know like this is based on the two commits where we disabled the eviction test, so when I think that we should do like for any other pr that doesn't do that, it's still gonna keep failing. So we really need to split that out.

A

So uh mostly, I was just confused by what he was saying in terms of uh like.

A

Do we need to let this soak because nothing's merged that would need to soak right now, like there's nothing, that's merged in kubernetes kubernetes and the only thing that's merged in test infra is that we fixed it so that it's actually running against prs and not just against master. I missed that when I did the copy paste on the first.

A

A

uh For we have a bunch of new folks on the call: do you know what we're talking about currently.

G

uh I took a look at these piazza. Yes, I have a vague idea.

A

So the the tldr is that we have this node cube serial job, which is supposed to be doing all of the serial tests for the cubelet, and it has been failing for months, but it has a bunch of really important tests. So we wanted to get this job working, and so we started with uh trying to tweak the timeouts, uh because previously it had like a three hour timeout, we tried increasing it to seven.

A

It still was failing so uh then francesco found that, like it was uh the tests that were running for eviction that were timing out, so he submitted a pr to like stop those and then it's still timed out. But then we realized that the job for the pull request was not actually running on the pull request. It was just running against master, so we fixed that and then we found that indeed, when we disable the eviction test, it was passing in a reasonable amount of time.

A

So I submitted a pr to basically split the two, so we can work on on one hand uh trying to get the eviction test. Stick so they're, not timing out. It sounds like they're timing out even after 24 hours, so there's something wrong with those tests uh and then the rest of the tests keeping in the same job so that uh we can get some signal in terms of what's flaking. What needs to be fixed and we can hopefully add them back to like release and forming or maybe even release blocking.

G

That makes sense.

A

So yeah critical test coverage missing. We gotta fix it. That's what we're here.

A

For so yeah, I guess we'll we'll leave it to give uh odin a chance to chime in on that pr. uh But uh I think that we should uh proceed with splitting, because if we don't do that, we're not going to get more signal.

H

Do you know who was working on the eviction test itself.

A

I have no idea so that would be, I guess, uh probably do we have an issue for that, probably not uh that's, probably something we should file right now.

E

A

Like very broken.

E

I would like to have a look, but after I I will be fixing those tests that I I wrote and they're failing and that that I won't deeply want to fix them. And after that I can look at evictions.

A

Yeah, that's no problem.

B

Okay, I may have in this test case and I I will try to run them locally and find find the reason, maybe some of them. Okay,.

H

Dems had asked me some questions about it. That's why I was asking earlier in the week.

H

I think I think what was happening was we actually did kill the pods, but there was a different asynchronous thread going on inside kublet. That was getting the stats for which pods are running, so they were receiving the fact that the you know, a pod that has already been deleted. You know was still running after the pod was deleted because of the the asynchronicity of the uh stats. You know they get to get which pods are running.

H

H

So if you look at that talk to dems.

G

Yeah cool, I might also try spending some time, uh does.

A

Anybody want me to see them or assign them on this.

A

Danielle, do you want me to sure okay.

F

Oh feel free to see me as well.

A

F

Oh, it's miranda, 96.

F

I'm not a member, so I'm not sure if I'm gonna.

A

F

H

G

A

H

Me on that, as well later, like bro.

A

I know you like brow, but without the n.

H

A

uh And let me see if I can I'm linking this uh this other issue uh that, like talks about the tests, good failing, so that's probably enough for.

C

People to figure this out enough context.

A

Okay, that issue has been filed. Add that to the notes.

A

Oh, it looks like paco also wants to help with the eviction. So if somebody also wants to triage this right now feel free. I'm going to try to drag this onto the board.

C

A

C

A

I thought I opened it. Let's see, I don't know if there's anything other than that, one that we're missing, um I'm gonna assume that folks can assign themselves and then I'll leave this in the in progress column. Let's see if there's any flakes, no yay, okay,.

D

A

Sergey cleaned up the board, I think before the meeting.

A

So if somebody wants to just go run in and triage accept that one right now, that would be great and I'm not sure what the state of things are or, if there's anything that we need uh updates on. Oh we've got this cubelet node conformance tests are broken that I'm working on. So I should maybe slash or sign this one.

I

Could you assign me to that as well? I've been looking into conformance tests.

A

Oh, you have okay yeah. Well, so I had the this is the one where I had uh I feel like. This is a duplicate because we also have.

A

um Where is it.

A

This one yeah like this, it seems to just be a duplicate of this one. So what repo was that filed.

C

C

Oh I've lost it.

C

Did I close it, I'm bad with tabs. Oh this one.

I

I think there was another one that.

C

We were speaking about.

I

uh There was another one that we were speaking about yesterday, that was again conformance again: node conformance tests that were failing so yeah just either cc me or assign me at the works.

D

A

um Let's see did I link anything else here.

A

Oh and then there's yeah okay, so we now have three issues tracking this one. I think this is a duplicate.

A

This is from 22 days.

A

A

A

Yeah, so I think that this will probably be handled by whoever is working on these two things uh both of those I'm assigned to, and there is a pr up to just remove.

A

A

So yeah, it seems like for this one uh dims had some concerns that uh we might be losing tests by removing this job, but I'm like pretty sure that we're not actually losing anything because I did a search through. um I believe so we have this job, um which is running fine. I believe this kubernetes node cubelet and then we have the kubernetes node, cubelet conformance, which runs the exact same thing only it skips uh it does not skip serial tests, whereas this one skips serial tests, but I couldn't find anything marked node conformance.

A

That was a serial test so, like I think it's just a complete duplicate.

A

Maybe I can pull up into test grid, uh so here's the node cube of conformance it's just it's like there's some sort of infra issue where the job is not starting and then what is the other job called.

C

The ci node cubelet is on.

A

Node cube master.

C

And it's here and this one's running just fine.

A

So- and I feel I mean I don't know- maybe there's one of those eviction tests- that's tagged cereal, that's causing this one to fail, but that I don't know if that would be uh the thing causing this, then we can look at one of the test runs. Maybe.

A

This one like actually failed, and it doesn't seem like the cluster's coming up.

A

Oh, that seems questionable. Why is this happening in a flaky.

A

A

I can't tell if this is timing out or what's going on, but this is definitely failing.

A

I think this is timing out after an.

A

A

Well I'll add an example of one of the failing jobs.

A

Is there anybody that wants to pick this up, I mean I put a pr up to remove the job. Does anybody want to confirm that they're not duplicates and see what why this is a ci failure, because I know that dims indicated some interest in making the job work first and then confirming it's a duplicate before we get rid of it.

I

Yeah, I could look at it.

I

A

Sorry, just click the link in the chat.

A

So that one is at least in progress, I mean I'm pretty sure that it was in progress already, because this looks like a duplicate of our other three issues on the subject. So, let's.

A

A

So any other agenda items.

A

Sounds like no. Is there anything that we want to go over on the board? We have a bunch of in-progress issues.

A

We've covered that one we've covered that one who's this one assigned to our home is not here.

A

I know that this one, I think, is a pretty frequent flake.

A

Do we have any updates on this? One looks like no.

C

And so that one is basically the same and then I think we have another one at the bottom. It's also basically the same so I'll drag it up. So it's beside its friends.

A

Establish test plan for sig node container d.

A

And dd you're assigned to this one, uh is there anything that's been happening here. It looks like you did some research yeah.

J

So actually like before one or two months back, I guess dems have raised multiple pr's on cleaning up container tests and uh yes, I'm linked with the issue, and I have also raised one or two pr's. So it's just uh now. If you look at the container board on telescript, it looks much cleaner.

J

um Just we need to do a cross check and like uh and then closed solution.

A

Okay, yeah definitely looking. I don't know what it was like before, but this doesn't look terrible.

J

Yeah so that has been removed, the 1.2 there was 1.2 job, also not three that haven't removed.

A

uh And, what's the what exactly is the follow-up that we're we need to do still? I think that.

J

There could be more job that we can clean and, like you can see, still three or four failures there. That can be.

J

J

A

So I won't take the.

C

Stale off of that.

D

C

I would cube with orphans tests failing.

A

Do we have I mean today I haven't seen him in a while.

C

Same with this.

A

A

Does anybody know what the status of this one is? There's an open pr for.

A

A

Looks like ikma is looking into this one, so I will. I will assign him the issue, although I think he's not a kubernetes member.

A

A

Clean up ede node jenkins, I feel like there are a bunch of tests that were.

A

Sent in to remove this and looks like no dte is.

A

A

A

I don't know if anything has happened since that lgtm was cancelled. It looks like no.

A

A

At some point, someone else can pick that up, because I know it's that one it looks like it hasn't been touched since april, and it is now june. It turns out.

C

I feel like this one is a duplicate.

A

A

Oh, that's c group v2 versus not c group.

A

A

Okay looks like we have an approach for that uh from yesterday, so that's.

C

Great you already looked at that.

C

One this looks like a.

A

A

ah No one of them is a timeout, and one of them is that this doesn't properly match.

A

So we've got a bunch of issues with this test. This one does not oh it's assigned to francesco. Do you have an update on this one?

A

Do you have too much on your plate.

E

Update is, I didn't, add the time to look at it, and this is my bed but yeah. I have other stuff to do at the moment.

A

That's okay. uh Does anybody feel passionate about this test and want to jump in? I know that this is. This has been a flake for quite some time.

E

And by the way I I still want to look at it it's just later, so.

A

Yes, well, uh don't we all? uh It would be good to at least get somebody looking at this now, because I know, like I've, seen this flake on.

I

Absolutely absolutely.

A

um But uh if nobody else can get to it, then certainly you can look at it later. Let's just you can't do everything.

A

So this failure uh that rob guilty uh reported here, that's a duplicate of the other issue. It's not a timeout.

A

So perhaps, uh given that that summary is the same, I will just close that one there's a duplicate of this one, because I think this one has a little bit more detail.

A

And actually it's hard to say, but I know that the uh the ci signal team is looking at this issue and not this.

A

C

A

A

So uh pots should run through the life cycle of pods and pods status.

A

E

A

Also signed this one.

E

Yes, just look at.

A

E

Yeah, I'm sorry for about that. That's.

A

A

uh And then uh sorry, it looks like you put an update on this one.

I

Yeah I was looking into this yesterday um kind of did some digging up so basically still there's a test that is failing because of some time out issue and need to get at the bottom of this. um So I just put in kind of some of my findings for me to you know. Remember when I come back to it. Yes,.

A

uh No, that's a great update and uh yeah. I know with this one. It's just like the pods for the tests initially are not starting, uh which I don't know. If that's something that we can do anything about. That's more of an api machinery thing so.

I

Yeah, that's what I thought that, like some of the tests in networking and or sorry, api machinery conformance test and networking conformance test were depending on this particular test. So I I assumed that it was because the pod wasn't getting started that this. uh This were.

A

They dependent with the same error, because usually each test is independent.

I

Yeah, I should say the the latter yeah okay.

A

Yeah so then, maybe do you have some capacity to take a look at some of these other tests and see if it's like a similar thing like this one, for example,.

I

Yeah, I I'm already doing that. Yeah, okay, yeah.

A

Then I will guess this one looks like it's also being looked at by this. The isignal team, so it'd be good. If we had um someone actively assigned to this one I'll unassign, you uh francesco just because you've got so much on your.

A

Plate, so does anybody want to take a look at this one.

A

You got new faces on the call feel free to jump in.

K

Yeah put me in okay.

A

And I know you also have the uh the issue of doom with uh the flakes on the probes.

A

I should probably take a look at that one today, uh flaky test paws should support pod readiness gates.

A

This is an oldie.

A

E

And here is where I stopped no reproducer.

A

Couldn't find any producer.

E

A

Okay, well, I will certainly add that as a and update and let's take a look at the triage link and make sure it's still.

A

A

There are some occasional timeouts, but it seems to think that this is a network problem, probably because it's timing out.

A

A

Let's take a look at one of these.

A

It wasn't just that test, so I can't blame the.

A

Infrastructure.

C

I looked at that one he looked at that one and that one just closing.

C

C

A

Okay, last three.

A

Potential race from one of the winters ah adida you are assigned to this one: do you have any updates on this.

J

Yeah give me a second so on this one um like if you go. I have uh I have mentioned in comments here. If you go down next one yeah from the discussion, I found that like what the lender is doing, that that part we can remove. It won't have any.

J

A

So there's a suggested patch and then I don't know.

J

I will create a pr to remove it. um Then we can see more discussions on it, possibly.

A

A

uh This is more of a sort of topic thing and yeah. I agree with this. We should not have node features and also features it's just kind of all over the place, so I'm not sure.

A

Let's add this to the agenda for next week when sergey can attend and I'll remove like cycle stale on this.

D

C

Next week will be fifth, no.

C

C

Oh and I should add the attendees, let me take a screenshot of the participants.

C

I'm gonna add that.

A

A

Okay, uh last one, I think this is uh matthias's favorite.

A

We have this startup probe job that keeps flaking so yeah.

K

No at least it flicks less it.

A

Flakes less yeah, that's good news, I'm not sure how to fix this. On, like I know when I added the tests for um changing like the grace period, I did like a thing where I ran the thing on my local machine for like two hours and then I like looked at the distribution of timings, and then I based the timeouts on that. I doubt anybody's done anything like that with this. So I wonder if, uh if maybe that's the sort of thing that we just need to do on this.

K

That yeah, I I don't know.

A

I went and I ran it. I collected 100 data points. I did the quantile analysis on it and then I like added a little bit of room. uh Sometimes the problem with these things is that, because you know there are possible races and whatnot like the way that we set the timeout thresholds uh is like we have to set them in such a way that we know for a fact that we're getting the behavior that we want isolated in the tests and we're not accidentally timing out on something else.

A

So we might be able to like tweak the thresholds to reduce the flakiness.

K

Yeah, the the thing is that before the the flakiness was explained because it was, it was a bug, and now until I find that there is a bug or not, I I'm not sure if we should change the timeouts or not it's very difficult and maybe the the work on the uh on the pod stages, uh the one that is from.

A

Oh, that clayton's doing I'm not.

K

Sure that that one.

A

Will necessarily make a big difference on this one, but I think possibly uh like reducing the jitter to like ensure that it's I mean, I guess it'll just still be some multiple of like.

I

A

Yes, because I wasn't sure what the exact value of that before was.

K

A

What's going on forward.

K

um I I I I need to watch again the the source code, because at least the way that now the test is written. We we no longer have the issue of the sleeping at the beginning, the the jitter that was like interfering with the sleeping so now, but by the way it's written here it it's it's out of scope, uh but there might still be something in propagating the the the stages, maybe from the container to the pod, or I don't know.

A

Yeah, I think I think it's fair to I like this is not flaking nearly as often so. I think it's fair for us to just kind of hope and wait. So good. We got through all of the issues on the board. um Is there any? I guess I should check to see uh if we have prs without a slightes.

A

I don't know why this has a hold on it. I'll assign this to myself. It's a review.

K

I I have like two questions, uh so this this meeting is superseding the one on wednesday or we have.

A

K

Have both of them.

A

So we'll never have both of them. uh The plan is that we will do this one once a month on the second thursday at this time and we'll use the wednesday time the rest of the month, because we weren't sure how much demand there was for this time, and the other issue is that I am like quadruple booked in this slot. So I can do this once a month, but I can't do this every week.

K

The one from the week that we were meeting on thursday.

K

K

Your audio is breaking up, I'm not sure if it's just for me or everyone.

A

D

Yeah breaking for me as well.

K

A

K

A

Yeah, if to clarify it, when we are having this thursday.

K

A

We won't have a wednesday meeting uh so every second week of the month, every second thursday of the month. We will have this time and not have the wednesday meeting. There is a calendar, invite that was sent out to signo test failures, mailing list. So if you are subscribed to that list, then you will get the calendar invite. So you will know when the meetings are supposed to be sergey.

C

A

The calendar invite and he updated it so the wednesday one was like cancelled for this.

A

Week and we're basically at time so we didn't have a chance to go through the uh the rest of the pr's, but I think that's okay, then we'll stop sharing anything else for our last two minutes today,.

A

uh Bug smashing uh so we're gonna be working on uh the bug scrub uh that will be in uh two weeks uh two weeks today, uh and my hope is that we can kick that off, probably starting at like 1 utc on the 24th, uh which will for me be like, I think, the the day before kind of thing. So I'm happy to help kick that off.

A

We need volunteers, so I haven't checked the spreadsheet, but I sent out a spreadsheet to the mailing list for folks to sign up to volunteer, and I know that we have some folks on this call in various time zones. So I'm hoping that I will see you there and that we can have some folks volunteer, looks like some people have signed up.

A

So that's great, we need. um We need uh regional captains for europe uh and uh middle east and africa emea. So uh if anybody at least have lots of reviewers approvers possible mentors signing up, uh but we need people to help coordinate so hoping to. I encourage you to sign up for that.

A

If you are interested.

A

Oh and let me show the link in the chat it's also in slack, and it's also uh sent to the mailing list and thank you uh adidi and uh paco for volunteering for apac. That's awesome uh coordination. Will there be a slacker zoom? uh I am trying to figure that out right now, uh so probably we'll just use the oh hello bird. I have a bird sitting on my window.

A

um Hopefully we will use the sig node zoom. So like this room uh and we'll just be dropping in and out throughout the day, we don't have any other signal meetings scheduled. So it's not like. There's worry about a conflict um and then for slack, I'm seeing if we're going to either use the main channel or if we're going to get a special event channel. uh I've been talking to contribex and slack admins and I haven't gotten the response for them.

A

Yet I'm also trying to get triage parties set up because apparently we have to like run our own instance of it. It's not a hosted thing so yeah, I'm still working on the details, uh but uh hopefully at some point we'll have that all figured out great uh we're at time. So I will see everybody next week uh in a stig node meeting, perhaps or on slack uh sooner and uh have a great rest of your day cheers everyone cheers bye.