Kubernetes SIG Node, 24 Mar 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20210324

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

uh Hello, it's uh march march, right.

B

A

B

A

uh Almost three months past of this year and it's signaled ci uh subgroup um welcome everybody.

A

So we have pretty empty agenda and we have a test fees today. uh The suggestion is to go through all the gtm issues and try to see which one we need to like scream about and uh has to merge and uh whether we can expedite uh uh pr that's uh under review as well.

C

Did you want me to share my screen? That's got the benchmark or the milestone.

A

Yeah definitely, oh.

C

You might need to make your own house.

D

Magic uh one second.

C

Okay, so here we go, uh this is the list. We got seven things, um and I know that we're waiting on this one today, um which is.

C

The um the like, I think there are some concerns about whether or not we could run these tests in conformance so.

C

I don't know, I think, derek is going to take a look at this one before the end of day today. So is he a sign? If not, I will assign him again. He is okay, so I guess that one's there's. Basically, we have two pr's open and a few issues open uh the other one. I think that might be uh this.

C

One has also been sitting with derek I've, been talking with jordan about this one because he's also reviewed it, and basically this fix is a thing that we would like to backport to 118, but it looks like it adds a new race condition which I had brought up and apparently uh others have concerns about, and so the difficulty with this one is that if we don't merge it now, we can't backport it to one 118 because, basically, like the development won't reopen for 122 until after the last cherry pick deadline and the last 118 patch release.

C

So there's a concern here and that if we don't merge this now we can't backport it to 118, but there's also a concern. It will introduce more bugs. uh So there's we're kind of at a bit of a standstill on this one and, uh like I mean I've, talked to the release team, uh but there's not really like they're, not gonna, hold the release for this. As far as I can tell so,.

A

I mean 118 was out for a while right.

C

Yeah, uh it's I think it was released in april of last year, so it was actually the first release with, I think, a full year of support, uh because prior to that we had only supported releases for uh nine months. So.

C

But yeah there are folks that are kind of like screaming about backwarding this to 118.. The problem is the thing that this tries to address. I mean it fixes an actual serious correctness, bug that affects clusters from when nodes like go offline and go back online and they basically mess up all of their online pod statuses uh and pod gets stuck in. I think node affinity and it's it's not great.

C

I've seen a few bugs filed about it, saying that oh, this doesn't like solve this in all cases or something like that, but it does seem to solve it in most cases. So I'm not really sure. I'm not really sure what to do about this.

C

C

I mean like basically, there are two options right. We can merge this now and uh then we'll have time to let it bake and do the back ports, but that's very relatively risky for this late in the release, or we can wait until 122 and then it won't get backported to 118.. I think those are basically the options.

C

A

Don't think it's a risk and I mean the only thing we're risking is: we cannot backport it to 118 in open source right, so downstream people may still apply it.

C

Yeah yeah, I mean people can take the patch and build it themselves if they so wish. uh But you know that can be very difficult. So yeah I mean like not that there's like a poll here, but uh you know some people like well. We should merge it now uh because we're worried, uh but it's like.

C

We are also worried about making it worse, because we've got a new potential race condition we might be introducing- and it's just so hard to like detect this sort of thing without ci soaking I mean the initial patch that sort of spawned this issue looked very innocuous at the time and has introduced all sorts of weird knock-on effects. So.

C

Yeah, I don't know, do we want to like put a statement on here like on behalf of the uh like node? uh I guess triage that we reviewed this today circa. Do you want to do that?

C

Okay, because yeah? I think this one's also basically like it's kind of waiting on derrick to make a call. Do we have an opinion as a group as to what we should do.

E

uh I'm not very familiar with the issue, so it's hard for me to say.

C

I can give you a very quick background, which is basically uh so a patch went in which, uh like so the issue was that, uh like that users would experience, is they'd turn off a node and turn it back on and all the pods in the node would get stuck in node affinity and they'd have to do all sorts of weird things to get like those pods to like actually properly die, terminate get rescheduled.

C

That kind of thing- uh and this was this- was happening basically because nodes would come online and they would mark themselves as a ready before they even had a chance to sync with the api servers, so they'd be working with like outdated. uh Like view of like what the api state currently was for those pods uh and as a result like you know, there would be sort of a race and a mismatch, uh and so the the fix was don't mark the note as ready.

C

Until uh you know, we have the um we've synced with the api servers and we can get that state and make sure it's in sync. So we're not doing anything wonky.

C

uh So it's a correctness fix, uh but there was there have been a few knock-on effects from that, so ones that nodes sometimes will take longer to start if they're not running in standalone mode, because they take time to sync with the api server and so in some like single node instance cases or with cube adm or that kind of thing it can take longer to bootstrap now.

C

But you know it's a regression, but for the most part it is a regression that, like fixes correctness issues, so then there's a question of how can we make this faster or.

E

uh And like how? How longer will it take by my understanding it can be longer by minute, but not by some reason, yeah.

C

Yeah it's it's like you know like in the order of like it might take a minute, uh I think, 40 seconds. What was what most people were seeing? It's not like it's gonna, take 30 minutes, or something like that. So um so there was that there have also been a couple of follow-up issues, claiming that, uh like the node affinity issue was not fixed by this patch, but that has not been my experience in production yet so uh that, basically, that there's more investigation that needs to happen.

C

uh I've also seen some weird consequences that seem to start once. We merge this in openshift, uh for when uh we do this, but we also on reboot, do a cryo wipe. So we wipe all of the like previous container statuses uh and that has caused some like weirdness, with the like uh the node, not really knowing like and basically like, causing some accounting issues for pod life cycling, but because that doesn't happen with container d or like typically uh in like cube ci. We haven't really seen that anywhere else.

C

um So that's what's going on with this uh and yeah, there's a concern that, like uh moving this into a go routine when this is not thread safe, it's going to cause concurrency issues. If we merge this, so it might make this faster and it might address the regression, but it might also add new, bigger batter problems and because the cubelet has uh like so many things that are liable to get into race conditions. It's very uncertain. What merging this, especially this late in the cycle, will actually do so.

C

That is the risk.

F

It seems to me that, from project perspective, the benefits are not really clear and I mean uh the the the unknown the unknown unknowns is. What worries me most. As engineer like we can, there is a concrete risk, not just people, brainstorming and introducing more data races, which is scary, because it's hard to detect on ci, so my take is a disclaimer.

F

I just read the the issue before this meeting the request as well, but admittedly I don't know much so this is my just my take given that the signal for ci is uncertain and we are in late in the cycle, I'm.

F

Unfortunately, it seems to be that the safest call is to not merge it yet, but I'm fully willing to defer the the call to the final call to derek and to others, but from my engineering and I'm done from an engineering perspective, it seems that the unknown unknowns outweigh the benefit, the known benefits. So this is, I.

C

Think I agree with you there, uh but uh unfortunately, those folks who are like, but we desperately need this patched in 118. It will not make them happy so.

F

But this is a release. Is it a religion, engineering uh issue I mean I, the the pr percy has those problems, but can we I'm not enough deep enough in the kubernetes release engineering procedures like hey? Can we make an exception somehow or it is now or never, because there are two different things? In my opinion.

C

Yeah, I think part of the problem too. I mean this. This pr has been open for quite a while and part of the problem. Is nobody wanted to make a call on it earlier in the cycle? And now it's a month later, we still haven't merged it, but now it's kind of like. Oh, I don't know, uh and there have been like a lot of iterations. So it's not like you know. This was just uh sitting here or something like that.

C

I think like there were changes pushed as late as yesterday so like this is it's not like? It's been a static patch kind of sitting here without, like being looked at uh for you know a month, so maybe, given that yeah I mean I I would lean to all sorts of, but I don't wanna, you know uh sergey. Do you have thoughts on this? This is kind of like our big flaming. You know uh like one critical, urgent thing. We probably need to decide today as node.

A

Yeah, I think the problem is kubernetes state overall is: uh there are different types of users, like some users just happy to take whatever uh and they need features and they need more like innovations and some people just need stability, so we need to decide like which camp we are leaning towards. So talking from stability perspective and like being conservative, if you you don't take fixes that way.

E

Yeah, but again like our main purpose here, is to cherry pick it under release water 18. If I understand- and I mean maybe somebody should be staple release and if you will now introduce some additional race condition, it's like will not be a step or release so.

C

Yeah, I think I think what you I think you raised a point earlier, uh which I thought was very good, which was that if we, uh you know like this, isn't really the 118 cherry picking thing I mean: that's, not really our problem as a node like that's kind of the release, team's problem and I think like as node it's reasonable for us to say there are known unknowns that make us not super comfortable wanting to merge this now.

C

So, if the only reason we want to merge this now is because we're worried about backboarding to 118. We should make that their call.

C

So I think that's, I think, that's a great point.

C

You know our job is just to say, like here's, the the technical side of things, it's not really our job to decide on the release schedule for people so um sergey. Do you want to add a comment along those lines? I've obviously commented on this a bazillion times so.

A

Yeah, I can do that. I put a note in a document.

C

Oh great and yeah, I think, other than that uh we have, I think, there's only like two things here: labeled as release blocker, uh which are these two and then we have um who marked this life cycle frozen. Don't do that.

C

um We have, I think, also this one as well, which I'm not sure if a pr merged for it.

C

I think these are all on the test enhancements board. So maybe um sergeant do you want me to hand it over to you, since these all have milestones attached, we should be able to. I think, if we go to the board, I think you can filter by milestone like milestone. V12. Does that work yeah? So we can just like go over them. There.

A

Oh yeah, okay, I thought you keep a projection.

C

I'll, let you drive.

C

A

Okay, and do you know if you want to take any other fixes in the milestone or we better not.

C

I don't think that we have like I mean really. What we got to focus on is, I guess, like test flakes right. So do we have any other like burning things that have been filed against us.

A

You just see a board and it's like it's very flaky right now so like. Let's look at it really quick.

C

A

Good, so uh look at the release. Plugins, all of them, except myself, like meister, was also flaky yesterday, so, like all of older uh tests are flaky uh and it's different tests.

A

So this is not like even a test. It's just like not test. It's something, this infrastructure, uh it's fishy, oops, it didn't mean to quote it.

A

So this, like, I yesterday, were like specific tests uh regarding metrics. It needs to be addressed, but today I don't see him any longer.

C

Yeah I mean if.

A

It was an infrastructure.

C

Thing, uh I think, like anything over 95 is probably pretty good.

A

Yeah, I remember we made the screen one time so.

C

Yeah, all of these I mean you know, they're flaky, but it's like they're flaky in the over 99 pass range. So that's pretty good.

A

Okay, so yeah uh something wrong with the with the infrastructure, so this also free link with other all like this infrastructure and then what else I was looking at yeah.

A

Oh, this one is also like- has a lot of flakes um yeah, um so I mean I don't think like, since it is blocking and critical are not like. Yesterday, there were a couple flags that uh I don't see them today, so likely just infrastructure, timeout issues. So um and since we don't have anything like just jumping on us, it should be fine. I mean we should be ready for release.

A

um That's why I agree. We can just walk through this and um call today. I think okay.

A

I think this issue has a pr associated with that.

C

Which one the the yeah.

A

I'll change, yeah.

C

It's interesting, oh no, this one! It's interesting, they're not showing up on the board. Maybe someone didn't use the like github fixes properly.

C

Also, who set this one like who set this one as a release blocker, I think this one also involves sig storage, I'm just curious because it's not been triage, so I assume it wasn't us.

C

F

A

Okay, I think it seeks storage, so it's sweaty.

C

um Do we want to fix the does? It have stick storage on there already.

A

C

Yeah, okay, yeah.

A

And uh jin kissing from six stars.

C

Yeah, so maybe we just leave it with them. Should we comment as much yeah? It looks like michelle's looking at it, so.

A

Yeah, I think we can just archive it out of so. The only thing is uh uh about pause image like if you want to like help with that, but get it so.

C

Yeah I mean they can they can tag us? I guess if they need us, but I would agree that this probably isn't us just have time without.

D

A

A

Progress what's.

A

Up: okay, it's merged.

C

hmm So if it got merged, we should probably just close it. Oh yeah, they linked the issue, but they didn't put fixes. So it never got auto closed.

C

Great that one's easy go.

C

A

On call this week, so my phone- sometimes oh,.

C

No, that's your phone, oh, if you're getting paid sucks. If you need to go just, let me know- and I can take over- but I think probably we'll finish pretty soon, because yeah we've only got the one other issue.

A

Breathing test should.

A

A

Yeah, I think we had a pr that fixes it.

C

ah Swati, it looks like you looked into this one: do you have any updates on it.

D

I'm sorry, I wasn't able to look at it last week. um I might take a look at it this week.

C

Well, it looks like we have, did we fix it.

A

Yeah, I remember we we've been fixing it so when pr, so, how do you associate vrs.

C

Oh, is it uh the yeah 10.50 or whatever.

C

A

A

Because we are not, we don't have this test as a conformance any longer.

C

Did we not or did we.

A

Remove this then, from conformance I remember, like we sent 2pr from jimson from myself.

D

A

D

A

C

Was this the exact probe test that we removed from conformance okay, yeah? I know exactly what you're talking about then then I guess this is probably fixed.

C

uh Can we check the test grid to.

C

A

A

Maybe it's not conformance any longer. That's why it's not here.

C

Yeah, so if that's the case, then uh I think we can close it.

A

Yeah, it will be nice to refer to this pr.

C

Yeah, so maybe we should link the pr that removed the test and then close. It.

A

A

Yes, his control.

C

A

C

Yes, we were trying to do with this release and we have a docs pr, that's up as well. The only thing is we need to, like you know, update the test to conformance and derek said he wanted to go through them all, to make sure that we weren't accidentally, including anything with unsafe ciscodales, because those should not be in conformance okay,.

A

Yeah, it's not a big deal. It's not happening in this release. Typically, like people find that so.

C

Are we allowed to? Are we allowed.

A

C

Like promote the conformance tests, after because I mean the ga flag is set like the feature, flags have been removed, but it was already by default on so.

A

It's not like, we are allowed it just. uh This is quite a recent practice to like.

C

A

Into conformance.

C

A

C

It's just sometimes.

A

You can do it.

C

A

I mean, ideally, you need to do it during beta. You need to be almost ready for conformance and then yeah.

C

But this is new yeah. I understand. Okay, makes sense, yeah, I'm just hoping like, because we're currently like we're holding the thing on the website on this pr- and I don't think that makes any sense uh because I don't know if the plan is, we revert the feature flag change like that's, not gonna happen. It was already defaulted, so I don't know I'll check in with people. Hopefully this will just get urged and it'll be fine with potential changes that are needed.

A

Okay and offline, just this morning, I went through prc's attention. They still need attention.

A

We and I cleaned up like triage issues so like there are a few review in progress, and I didn't see anything critical that we want to take in this release. So we I wanted to go with uh with all of you again, but uh it may not worth the time.

A

I doubt people will accept it.

C

Yeah, I mean, I think we can at this point, like kind of punt on a lot of this stuff. Until, like you know, we cut the release and we reopen for dev, because until we do that, like these things are going to sit no matter what we do so.

A

Okay, any other agenda items for the day.

E

I have a small question: I have some end-to-end test fix for the memory manager and I saw that it's approved and look good to me and it's targeted to milestone 1.21, but it's still not merged like do. I need to get some additional attention from someone to get it in or.

E

C

That doesn't look like it has the milestone set to me.

E

ah Okay, like it can be, it can't wait for the next services again, it's not blocking.

C

Yeah no milestone is currently set, I don't know was one on there and did it get removed or just was it never added.

E

uh I saw some commit, probably from you. If I'm not mistaken, my memory is corrupted.

C

Oh so I guess we added it and then what.

C

C

The problem is now: it hasn't had like a week to bake, so I'm not sure what we want to do about that.

E

Like again, it's not like blocker or something like this yeah.

C

We can just wait, I think.

E

I just was curious why it's like you targeted it and.

C

E

C

A

A